Defining Scholarly Ephemera

The IASGE project has been busy writing updates for our community, participating in webinars, and attending conferences. We have recently hit our six month mark on the project and have made some significant progress researching the ways in which source code produced by the scholarly community can be archived and preserved for future (re)use. In recent discussions about our project with colleagues, we have been careful to stress the importance of saving the contextual, scholarly ephemera associated with source code; not just the source code itself. But what exactly do we mean by "scholarly ephemera"? To answer this question, we wanted to take a minute to write out our definition, provide some explanations on why we are seeking to archive it as part of the scholarly record, and elaborate on why it is an important way to understand source code more fully.

Git hosting platforms (GHPs)—such as GitHub, GitLab, and BitBucket—are well-known places to host source code and make it available under a (hopefully permissive) license. With each repository, however, there is also rich material that provides context regarding how source code was developed, including content that provides insights into the genesis of a project, communications between its contributors and collaborators, and the procedures and interactions that brought it to its most current state. This is important when understanding not only the history of a repository, but also how one repository might relate to another, how members of each repository branch out and form networks, and how this information can be used to track derivatives of current work.

Scholarly ephemera as it relates to Git, however, is not currently being archived. It mainly exists on GHPs themselves, not featured as a part of the Git repositories that one can clone. Quinn Dombrowski, Academic Technology Specialist at Stanford University, notes in her blog post Dissemination and valuation of non-traditional forms of scholarship that there is an interest in scholarly ephemera (which she defines as "notes, data sets, etc.") generally in the academy but that it "face[s] some of the same challenges as born digital non-traditional forms of scholarship: there are systems in place to preserve and disseminate printed works, but the resources needed to provide the same level of access to other materials are often limited or entirely unavailable." This is especially true for the scholarly ephemera on GHPs which often is viewed as secondary, or tertiary, to the code and not vital to the proper functioning of any resulting software.

However, we know it doesn't have to be this way! This information is valuable and we believe that it should be preserved and integrated into the scholarly record as important context for source code. We see a future in which researchers and scholars can access and view, for instance, threads on issues in a repository as equally valid primary source material for their work. In such a future, source code and its ephemera are recognized as distinct forms of knowledge production. So now that we have told you why scholarly ephemera is important, let's look at four specific examples: pull requests, issues, project boards, and wikis, all of which are defined below.

Pull Requests: Represented in Git as `git-request-pull`, referred to as Pull Requests on GitHub and Bitbucket and called Merge Requests on GitLab. This type of ephemera records the process of proposing changes to a repository, which are submitted by a user and either accepted or rejected by a repository's owner and/or its collaborators. This is a collaborative feature explicitly designed to allow potential collaborators to propose code changes and to communicate with members of a project. Once a pull request is opened, the project team members (e.g. core maintainers of a repository) can discuss and review the potential changes with collaborators before merging them into the canonical code base. This can be considered an analog to the peer review and editorial process used in scholarly journals. In fact, such code reviews have been used as a peer vetting procedure in research publications as a means of ensuring that "the code is effective, understandable, maintainable, and secure" (Pereira da Cruz, 2019).

Pull requests are important because they provide a record of the interactions between the people involved in the development of source code as well as insight into how code develops, how code reviews happen, who is most likely to get their code merged (or not), and who does the merging, etc. They act somewhat like an evolutionary tree of the repository, showing the timeline of events. For more on pull requests in educational contexts, see: Sarah's earlier post.

Cartoon of the pull request branch structure.
Pull Request (Image Credit: Kenya-Tech)

Issues: Issues on GHPs are an essential way to collaborate with team members (new or established), define a project's workflow, track tasks, and problem solve. The most common uses cases for issues include tracking bugs, reviewing proposed features, idea sharing, and collaboration. As with pull-requests, issues can be incorporated into the peer review process, as seen with the Journal of Open Source Software in their joss-reviews repository on GitHub. There are many attributes and metadata attached to issues, including title, description, authors, assignee(s), milestones, and unique issue number and URL for each issue. Issues are also a way to participate in open source projects in ways that do not require programming experience. This widens the pool of potential participants and opens more ways to contribute to the development of source code.

Screenshot of JOSS' GitHub repository open issues page.
Screenshot of JOSS' open issues (Image Credit: joss-reviews)

Issue Boards (GitLab), Boards (Bitbucket), and Project Boards (GitHub): These are project management tools, built into GHPs, that provide developers a way to prioritize issues and pull requests in their repositories. These are often based on the Kanban method of development, which is a "lean method" to manage work across many people/systems. Issue boards can implement the Kanban board approach, which uses user-defined columns with cards (that represent issues and pull requests) that can be reordered at will. In the software development context, Kanban boards are often used as a way to manage the time contributors spend on a project (by prioritizing their development queue) and also a way to roadmap future development (e.g. what is needed before the next version). They are excellent features that can be used by scholars who want to communicate priorities for their users as well as divide the work and time between all collaborators on a given project. For example in the screenshot below a scholar from the ReproServer project has a board called `A-questions`, indicating the issues that they need more input before development can proceed:

Screenshot of ViDA-NYUs GitLab repository Issue Boards page.
Screenshot of an active Issue Board with four column (Card status' from left to right: Open, A-questions, A-pr-exists, and Closed) (Image Credit: VIDA-NYU)

Wikis: These are built into GHPs as a way to provide deeper documentation for a repository (also versioned, though not in the Git data format). While a README file is the standard for providing quick information about a repository, wikis provide additional information including, but not limited to, how to use the repository, the architecture of the code, different ways to troubleshoot known issues, collecting materials about the repository and longer-form changelogs. For example, the Open Refine Github Repository's wiki contains a list of user-submitted tutorials. Wikis are configurable, can contain a sidebar displaying a table of contents, and can include images, figures, and any rich media you might want! Wikis provide latitude on the ways a repository can be described and offer multiple ways to present a project to the wider public. Wikis can be used to describe and document the contents of a repository. These can be cloned as independent Git repositories also and include a full history of changes. It can also be used as an e-lab notebook of sorts. In Vicky's case, for example, she used a wiki as a place where graduate student employees could document their progress in contributing code and research to the ReproZip project.

Screenshot of ViDA-NYUs GitHub repository Wiki page.
Screenshot of Reprozip's Wiki in which student employees log their weekly activities. (Image Credit: VIDA-NYU)

We hope that this clarifies a bit about we meant when we have been writing about and discussing the "scholarly ephemera" that is held in GHPs. We very much encourage questions and further research suggestions. If don't see your favorite "ephemera" on our (admittedly non-exhaustive) list above, we'd also love to hear about it and how it's helped your work. Please feel free to submit an issue or merge request to our GitLab repository or email Vicky Steeves to continue the conversation.