Analysis Considerations
Questions to Explore
Q6-7
Comparing which VCS they started to use and when (YYYY)
- Delete all columns except 2 columns: “When did you first start using a version control system?” and “Which version control system did you start out using?”
- Create pivot table of year as rows and VCS as columns
- Values are Count of the number of respondents who started using a particular VCS per year
- Create line graph charting each VCS as separate lines, year on x-axis, and number of respondents on the y-axis. Visualizing the increase/decrease of VCS adoption through the years
- Need to code and count Other field for popular resources
Q14
Why did you first enter the world of git and version control?
- Delete columns not related to Q14
- Create bar chart of predetermined variables (e.g. I need a vcs, My collaborators Use it, and I heard it would get me a job in the future)
- Qualitiative code Other free text into 7 Reasons category
- Pivot table to count reasons incurred. From most common to least (course learning; upgrade, tech needs; collaboration, sharing; keep up with tech standards; work requirement; reproducibility; backup)
Q15
How did you learn to use version control? Check all that apply
- Delete columns not related to Q15
- Sum each column choice to the number of respondents show selected each choice
- Transpose header and sum from wide to long
- Create bar chart with header (method of learning Git) as x-axis and count of respondents (sums) as y-axis
Q16
Who taught you git specifically?
- Delete columns not realted to Q16 (kept Q17-18 for potential cross-tabulation)
- Sort by Z-A (omit blank text fields coded as “-99”) and skim alike responses
Other potential figures
Pull our specifically how specific groups of students learned how to use Git.
# get only the columns about how people learned, omit the NA values, and then filter on
# only students
status_learned <- data.frame(na.omit(output[c(32:39, 103)]))
status_learned <- dplyr::filter(status_learned, status != -99)
status_learned <- dplyr::filter(status_learned, status == 'Masters student' | status == 'Doctoral student' | status == 'Undergraduate student')
# pivot long to get the method of learning in one column with counts in the other,
# with statuses (repeating)
status_learned_wide <- pivot_longer(status_learned, cols = starts_with('how_learn'),
names_to = "Method", values_to = "count")
# pivot wider to put the statuses as headers, with counts of # of participants who used
# those methods as values (leaving method as a column) in case we want the chart
# learned_status <- pivot_wider(status_learned_wide, names_from = status, values_from = count,
# values_fn = list(count = sum))
# only keep the rows where the person *had* used a tool
status_learned_wide <- dplyr::filter(status_learned_wide, count == 1)
# remove the `how_learned` prefix from methods
status_learned_wide$Method <- substr(status_learned_wide$Method, 11, 40)
# plot it as stacked bar chart
ggplot(status_learned_wide, aes(x=Method)) + geom_bar(aes(fill = status), position = "dodge",
stat = "count") +
labs(title="How students first learned Git", x="Method", y="# Participants") +
theme_bw() + theme(axis.text.x = element_text(angle=25, vjust=0.75))
Simple histogram showing the frequency of reteaching
reteach <- na.omit(output)
reteach <- dplyr::filter(reteach, status != -99)
ggplot(reteach, aes(x=freq_reteach)) + geom_histogram(aes(fill = status),
position = "dodge", stat = "count") +
labs(title="How frequently participants reteach themselves git", x="Frequency", y="% Participants") +
theme_bw() + theme(axis.text.x = element_text(angle=25, vjust=0.65))## Warning: Ignoring unknown parameters: binwidth, bins, pad

Why folks use GHPs
# get only the columns about how often they reteach themselves git and their self-rated proficiency, omit the NA values and -99s and blank cells
scholghps <- data.frame(na.omit(output[57]))
scholghps <- dplyr::filter(scholghps, why_ghps != '')
# just get rid of the colon in other
scholghps$why_ghps <- str_replace(scholghps$why_ghps, "Other:", "Other")
# plot it
ggplot(scholghps, aes(x=why_ghps)) +
geom_bar(stat = "count", width = 0.6) +
theme(axis.text.x = element_text(vjust=0.75, angle=90)) +
scale_x_discrete( breaks=c("Change tracking (e.g. changes/additions to code on a macro-scale)",
"Collaboration (e.g. editing and updating data/code within a team)",
"Method tracking (e.g. documenting methodologies and protocols)",
"Openness (e.g. sharing data and code for open access)",
"Other", "Publishing (e.g. making content available online)",
"Version Control"),
labels=c("Change tracking","Collaboration","Method Tracking",
"Openness", "Other", "Publishing", "Version Control")) +
labs(title="Why participants use GHPs", x="\nReason",
y="# Participants") + theme_bw() +
theme(axis.title.y = element_text(angle=0))
In this graph, we want to show how people learned git.
# get only the columns about how people learned, omit the NA values
learned <- data.frame(na.omit(output[32:39]))
names(learned) <- substring(names(learned),11,)
# pivot the dataframe to get the Method_Learned and count as columns (removing status)
long_learned <- learned
long_learned <- pivot_longer(learned,cols = 0:8, names_to = "method", values_to = "count")
# convert the count into number column so we can sum
long_learned$count <- as.numeric(long_learned$count)
# count the counts and then group by the method, so we get the list of methods + counts of
# how many participants used them
long_learned <- aggregate(long_learned$count, by=list(long_learned$method), FUN=sum)
# rename the columns to what we want
names(long_learned) <- c("Method", "Count")
# plot the thing already, with %
ggplot(long_learned, aes(x=Method, y=Count)) +
geom_bar(stat="identity", width = 0.6) +
labs(title="How participants first learned Git", x="Learning Method",
y="\n# Participants") +
theme_bw() + theme(axis.text.x = element_text(vjust=0.75, angle=90)) +
theme(axis.title.y = element_text(angle=0))
- Where people deposit code/if people deposit
- Cross those with status
- Also cross with where you learned git
- Also cross with different features of GHPs used
- Engage in scholarly activities on GHP
- Features on GHPs
- Why use GHPs
- Why of CI