Programmatically Capturing Software

My previous post discusses communities of practice that support software preservation and sustainability and demonstrates that software, and the source code behind it, is indeed part of our cultural heritage. Yet, preserving this material is not trivial, as source code is spread across various platforms and infrastructures, often migrating from one to another. Millions of projects, for example, are currently hosted on GitHub, GitLab, and Bitbucket. Many of these platforms have neither a long-term preservation plan nor any guarantee that they will not cease operation, as happened with Google Code and Gitorious. As a result, repositories are at risk in ways that many users do not anticipate and for which they do not (or cannot) prepare. While GitHub's Archive Program is a step in the right direction, it is, in many ways, predicated on the "set and forget" model of long-term storage of select repositories rather than long-term preservation of all repositories. In an effort to create stable archives, several projects developed solutions and workflows aimed at saving software (and its version and project histories) on Git hosting platforms. These projects have built the infrastructure needed to save software at the institutional/organization level and also on a larger scale. In what follows, I organize these efforts into three main approaches and provide examples to illustrate how each works. The first of these is the large-scale capture of event data about repositories carried out by GHTorrent and GHArchive. The second focuses on institutional/organizational solutions—put forward by the Software Archiving of Research Artifacts (SARA) initiative—that facilitate researchers self-capturing their software and scholarly outputs. The third is the effort to create a comprehensive archive of all the world’s software spearheaded by Software Heritage.

Read more…

Towards Greater Software Sustainability

Research is reliant on software and computational methods. Yet, software developed across different disciplines has only recently been discussed in terms of its intellectual contributions and its status as a scholarly research output. In contrast, data's profile has increased in academia due to research data management roles, Data Management Plans (DMPs), data curation protocols, and the FAIR principles and more generally due to discussions of "big data", data journalism, and data science. In comparison, software has lagged in terms of recognition, policy, and professional roles specifically aimed at software curation and preservation, both in and out of the library field. The source code behind software, however, can be viewed as an entry point—one necessary for making visible otherwise inaccessible information. This blog post discusses what (and who) is moving the conversation on the recognition of software as a form of knowledge production and highlights some interesting groups and organizations that are actively building communities of practice around software sustainability and preservation.

Read more…

Defining Scholarly Ephemera

The IASGE project has been busy writing updates for our community, participating in webinars, and attending conferences. We have recently hit our six month mark on the project and have made some significant progress researching the ways in which source code produced by the scholarly community can be archived and preserved for future (re)use. In recent discussions about our project with colleagues, we have been careful to stress the importance of saving the contextual, scholarly ephemera associated with source code; not just the source code itself. But what exactly do we mean by "scholarly ephemera"? To answer this question, we wanted to take a minute to write out our definition, provide some explanations on why we are seeking to archive it as part of the scholarly record, and elaborate on why it is an important way to understand source code more fully.

Read more…