We are excited to share with you a new series of Lab Notes detailing some of the more technical and hands-on aspects of our project. In a series of posts, I will be providing insights into my work using a few web archiving tools to capture git repositories, including Archive-It, Webrecorder, and Memento Tracer. I was previously the Web Archiving Fellow, and later Web Archiving Technician at NYARC so I am really excited to get the opportunity to experience web archiving workflows from a software preservation perspective. Today’s post will be on Archive-It, a subscription-based tool from the Internet Archive often used by GLAMs and governmental institutions for web archiving.
The time has come: survey time! Do you work in an academic institution, of any kind? Have you been introduced to Git and/or actively use Git? Or maybe you heard about Git, and want to try it, but haven’t found the time?
We want to hear from you! From minimal to heavy users, this survey is for you! Please help us understand how people like you use Git and/or source code hosting platforms (e.g. GitLab, GitHub, BitBucket, etc.). We’re interested in using this data to strengthen training initiatives and improving the overall experience of using version control, which can have a steep learning curve for many.
It has been a few months since I last posted, but the open research movement has not stopped even in today's pandemic. Let's start this post on a light note with yesterday's quarantine experience:
Now onto the good stuff. To date, I have discussed other features of these platforms, such as community building, education, and method tracking Part I and Part II. This post continues upon that work and focuses on quality assurance.
My previous post discusses communities of practice that support software preservation and sustainability and demonstrates that software, and the source code behind it, is indeed part of our cultural heritage. Yet, preserving this material is not trivial, as source code is spread across various platforms and infrastructures, often migrating from one to another. Millions of projects, for example, are currently hosted on GitHub, GitLab, and Bitbucket. Many of these platforms have neither a long-term preservation plan nor any guarantee that they will not cease operation, as happened with Google Code and Gitorious. As a result, repositories are at risk in ways that many users do not anticipate and for which they do not (or cannot) prepare. While GitHub's Archive Program is a step in the right direction, it is, in many ways, predicated on the "set and forget" model of long-term storage of select repositories rather than long-term preservation of all repositories. In an effort to create stable archives, several projects developed solutions and workflows aimed at saving software (and its version and project histories) on Git hosting platforms. These projects have built the infrastructure needed to save software at the institutional/organization level and also on a larger scale. In what follows, I organize these efforts into three main approaches and provide examples to illustrate how each works. The first of these is the large-scale capture of event data about repositories carried out by GHTorrent and GHArchive. The second focuses on institutional/organizational solutions—put forward by the Software Archiving of Research Artifacts (SARA) initiative—that facilitate researchers self-capturing their software and scholarly outputs. The third is the effort to create a comprehensive archive of all the world’s software spearheaded by Software Heritage.