Git moving into open scholarship

A few weeks ago, I broadly defined Git and scholarship as separate entities. My original plan for this second Git-focused post was to start listing the features that Git hosting platforms offer and matching them to the traditional scholarly responsibilities they serve (e.g. publish a book on GitHub, collaborate with reproducible data and software on GitLab, etc.). But, as the school year came to a close, a spark hit me while I reflected back on the courses I’ve taken, projects I’ve delivered, and conferences I’ve attended. Openness was a common theme throughout all of the experiences, and so now I want to be explicit on why Git is not just for programmers and why the adaptations of this tool via Git hosting platforms is beneficial and influential to scholars, research institutions, and anyone else on the path of life-long learning.

This post will focus on the integration of the common Git hosting platforms into the academy, and why this fusion is important to us now more than ever. It is the combination of the open tool and the scholarly institution that will advance the plateau into the broader landscape of open scholarship—that is diversity, equity, inclusion, reproducibility, and accessibility.

GCCUNY Open Pedagogy Symposium
The Graduate Center Library at CUNY hosted an Open Pedagogy Symposium where attendance priority was given to scholars from traditionally marginalized backgrounds. This is where discussions about Git hosting platforms could be happening. (@cunyGClibrary 2019)

Open scholarship

Open sounds like a buzzword these days: open source, open access, open pedagogy, open data, and the list goes on. Tack the term open before anything and you’re almost guaranteed to get that grant fund, but there is a rhyme to that reason and with the right intentions, actions, and use of tools, openness can reveal new untold stories, connect disparate theories, and provide knowledge to wider audiences. Sharing openly and freely is just as important as "respecting the appropriate boundaries of spaces, conversations, and knowledge given the context" as Christina Hendricks, Professor of Teaching in the Philosophy Department at the University of British Columbia, states in her blog post "Openness and/as closure". This is when privacy and access are not opposites but part of the same trajectory in the cycle of scholarship.

I first experienced the benefits of methodical open scholarship in 2013 when I was a consultant (i.e. pre-Eslevier customer support) at bepress for Digital Commons open access institutional repositories, where I helped professors strengthen their international networks through their open access journal editorial boards and peer review roster. Now, I live through the effects of openness from my work with Preserve This Podcast, CUNY City Tech OERs, and the Dance Heritage Coalition. Through community-building around digital tools, each of these projects have given scholars and creators the agency to realize their works through an open platform (albeit of various degrees of openness), and that is the same power that Git hosting platforms give to us now.

On the more radical spectrum of open scholarship, there are scholar activists like Clelia O. Rodriguez, Professor of Humanities at Western University, who create open syllabi listing no assigned readings and emphasize that "the readings that often stay lingering around our heart are those that are not often published but have been cared for by the hands of elders." (Rodriguez 2018) On the other hand, for those that need structure, since 2012, ProfHacker from The Chronicle of Higher Education has been posting about another approach to the open syllabus—hosting course logistics on GitHub. This will be covered in more depth in the coming posts.

Open pedagogy can be approached in many ways, but since online tools have changed the landscape to high-throughput research, analysis, and sharing, the popularity of free Git hosting platforms has risen. This is why we need to understand the behavior and workflows of students, professors, and everyone in-between using Git hosting platforms. Understanding these types of practices will give us insight on how people are attempting to expand the open movement by increasing accessibility to their data, their results, and the opportunity to expand on tested ideas.

As a part of this project’s larger scope, by understanding scholars’ workflows with Git hosting platforms, we will be more aware on what is important to save and preserve for future scholars. For now, we will dive into the common Git hosting platforms that have been used as a tool to conduct scholarly tasks, which in return adds to the world of open knowledge.

Git hosting platforms used in academia

There are at least seventeen different (explicit to) Git hosting platforms available on the World Wide Web for developers to upload, store, share, and collaborate codebase on. Since 2011, this rate of Git hosting platform usage has increased due to funding agencies like the National Institute of Health, the National Science Foundation (Rubenstein 2012), and the National Endowment for the Humanities, mandated that publications and authors make raw data and/or source code available on an openly accessible server. Soon after, many other public and private investors have followed suit and this mandate expanded.

The most popular platforms among scholars are those that offer freemium accounts and/or free hosting functionalities for those with a `.edu` email account. Through a preliminary scan of papers and blog posts, GitHub, Bitbucket, GitLab, and SourceForge are almost exclusively the most popular Git hosting platforms amongst academics. Funding agencies are not requiring the use of these particular host platforms, but there is a clear distinction of favoritism for researchers to use these platforms for a variety of reasons. Researchers are relying on free hosting platforms like GitHub, GitLab, BitBucket, or SourceForge to store, publish, and give readers access to their code and data. This is where a survey asking scholars to share why they choose one platform over another could be helpful to pinpoint reasoning behind Git host of choice. This is a research method that we will carry out throughout this coming year.

Photo of GitHub Headquarters library.
GitHub headquarters’ library in San Francisco, CA. Could the painting of Octocat the mascot be the popularity vote from scholars? ("GitHub" photographed by Ben Nuttall licensed under CC BY-SA)

One theory about GitHub’s popularity is that it is the oldest of the bunch. Founded in 2008, GitHub gained wide popularity when Wired.com published "Lord of the Files: How GitHub Tamed Free Software (And More)" in 2012, profiling the company’s start up history, swanky office space, and mission to drive open source software visibility. The "authors used GitHub as the platform for the writing of their article as an experiment" which sparked a trend for researchers to use GitHub as a tool for the entire workflow of research, writing, publishing, and collaboration. Since then, there are more blog posts and articles highlighting scholars recounting their positive and negative experiences with GitHub as their productivity of choice, alongside GitLab and Bitbucket. It is important to note that Microsoft acquired GitHub in June 2018, which might have an impact on the large academic user community on GitHub, but at the same time, this may have initiated hosting platform migrations since features offered in GitLab and Bitbucket are comparable.

Chart of GitHub imports into GitLab from May to June 2018
@MovingToGitLab on Twitter was created the same month Microsoft bought GitHub. It is dedicated to migrating users from other Git hosting platforms into GitLab. This chart shows that between May 26 and June 6, 2018, GitLab received between zero to 113,900 repositories imported daily from GitHub alone. (#movingtogitlab blog post 2018)

Common scholarly Git experiences

Many recently published scholarly papers state similar lines as this: "All our source code is available on GitHub, to allow the community to reproduce our results, from the training of the networks, until the statistical analyses." (Perez 2019) Then, a footnote or a bibliographic reference marking the URL to the corresponding GitHub page.

Screenshot of source code in GitHub blurb
A screenshot of the source code blurb from Perez 2019, p. 2. A variation of this note is left for readers and researchers to view and reference in nearly all published papers involving code, data, and/or software.

This informs readers of where to go to find relevant code and data used in their study. It is also the most obvious record of Git hosting platforms continuously appearing in scholarship. Ideally, a diligent researcher will include a README markdown file, providing documentation of the repository’s contents, as well as instructions on how to reproduce access to the code and data, but this README file is still limited on the full story within the repository. It does not reveal much about the researchers’ methodology of how they used the hosting platform to fulfill their experiments or analyses. Like many creators, it is rare for scholars to be compelled to immediately write out the full methodology behind their computation and Git commits—creators are not ones to diligently document and preserve their work during the creation stage. Reflection is usually a postmortem activity, and we’ve only found a handful of explicitly GitHub (but not-GitHub-sponsored) blogs posts and journal articles that detail the benefits of using the hosted version control platform and its feature functionalities. This will be covered in more depth in the coming posts.

In Information Science (e.g. libraries, archives, and data science), document and genre information architecture is significant for preservation, access, retrieval, and reference. This means that any subject research involving data gathering and analysis, especially those from STEM and digital humanities, will "develop special kinds of documents as adaptations to their specific needs" (Hjorland 2002). In this case, researchers are creating Git repositories on a Git hosting platform as a document, which means there are many more diverse approaches to the use of Git hosting platforms than the one statement mentioned above: "All our source code is available on GitHub..." Knowing this, my next post will dive into the current state of the art of the scholarly experience on Git hosting platforms—its "communicative purposes and functions, their elements and composition and their potential values in information retrieval" (Hjorland 2002).

Bibliography

Hjorland, Birger. (2002). "Domain Analysis in Information Science: Eleven Approaches - Traditional As Well As Innovative." Journal of Documentation, v. 58, no. 4. 422-462.

Rodriguez, C. O. (2018). The #shitholes Syllabus: Undoing His(story). Radical Teacher: A Socialist, Feminist, and Anti-Racist Journal on the Theory and Practice of Teaching, 111. doi.org/10.5195/rt.2018.456

Perez, F., Avila, S., & Valle, E. (2019). Solo or Ensemble? Choosing a CNN Architecture for Melanoma Classification. arXiv:1904.12724v1 [cs.CV] 29 Apr 2019

Rubenstein, M. A. (2012, October 4). Dear Colleague Letter - Issuance of a new NSF Proposal & Award Policies and Procedures Guide [Letter]. Retrieved from https://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc_id=USNSF_109