diff --git a/docs/data-model.rst b/docs/data-model.rst index f6e4f06..fc1639d 100644 --- a/docs/data-model.rst +++ b/docs/data-model.rst @@ -1,13 +1,257 @@ .. _data-model: Data model ========== +.. note:: The text below is adapted from §7 of the article `Software Heritage: + Why and How to Preserve Software Source Code + `_ (in proceedings of `iPRES + 2017 `_, 14th International Conference on Digital + Preservation, by Roberto Di Cosmo and Stefano Zacchiroli), which also + provides a more general description of Software Heritage for the digital + preservation research community. + +In any archival project the choice of the underlying data model—at the logical +level, independently from how data is actually stored on physical media—is +paramount. The data model adopted by Software Heritage to represent the +information that it collects is centered around the notion of *software +artifact*, described below. + +It is important to notice that according to our principles, we must store with +every software artifact full information on where it has been found +(provenance), that is also captured in our data model, so we start by providing +some basic information on the nature of this provenance information. + + +Source code hosting places +-------------------------- + +Currently, Software Heritage uses of a curated list of source code hosting +places to crawl. The most common entries we expect to place in such a list are +popular collaborative development forges (e.g., GitHub, Bitbucket), package +manager repositories that host source package (e.g., CPAN, npm), and FOSS +distributions (e.g., Fedora, FreeBSD). But we may of course allow also more +niche entries, such as URLs of personal or institutional project collections +not hosted on major forges. + +While currently entirely manual, the curation of such a list might easily be +semi-automatic, with entries suggested by fellow archivists and/or concerned +users that want to notify Software Heritage of the need of archiving specific +pieces of endangered source code. This approach is entirely compatible with +Web-wide crawling approaches: crawlers capable of detecting the presence of +source code might enrich the list. In both cases the list will remain curated, +with (semi-automated) review processes that will need to pass before a hosting +place starts to be used. + + +Software artifacts +------------------ + +Once the hosting places are known, they will need to be periodically looked at +in order to add to the archive missing software artifacts. Which software +artifacts will be found there? + +In general, each software distribution mechanism hosts multiple releases of a +given software at any given time. For VCS (Version Control Systems), this is +the natural behaviour; for software packages, while a single version of a +package is just a snapshot of the corresponding software product, one can often +retrieve both current and past versions of the package from its distribution +site. + +By reviewing and generalizing existing VCS and source package formats, we have +identified the following recurrent artifacts as commonly found at source code +hosting places. They form the basic ingredients of the Software Heritage +archive. As the terminology varies quite a bit from technology to technology, +we provide below both the canonical name used in Software Heritage and popular +synonyms. + +**contents** (AKA "blobs") + the raw content of (source code) files as a sequence of bytes, without file + names or any other metadata. File contents are often recurrent, e.g., across + different versions of the same software, different directories of the same + project, or different projects all together. + +**directories** + a list of named directory entries, each of which pointing to other artifacts, + usually file contents or sub-directories. Directory entries are also + associated to arbitrary metadata, which vary with technologies, but usually + includes permission bits, modification timestamps, etc. + +**revisions** (AKA "commits") + software development within a specific project is essentially a time-indexed + series of copies of a single "root" directory that contains the entire + project source code. Software evolves when a developer modifies the content + of one or more files in that directory and record their changes. + + Each recorded copy of the root directory is known as a "revision". It points + to a fully-determined directory and is equipped with arbitrary metadata. Some + of those are added manually by the developer (e.g., commit message), others + are automatically synthesized (timestamps, preceding commit(s), etc). + +**releases** (AKA "tags") + some revisions are more equals than others and get selected by developers as + denoting important project milestones known as "releases". Each release + points to the last commit in project history corresponding to the release and + might carry arbitrary metadata—e.g., release name and version, release + message, cryptographic signatures, etc. + + +Additionally, the following crawling-related information are stored as +provenance information in the Software Heritage archive: + +**origins** + code "hosting places" as previously described are usually large platforms + that host several unrelated software projects. For software provenance + purposes it is important to be more specific than that. + + Software origins are fine grained references to where source code artifacts + archived by Software Heritage have been retrieved from. They take the form of + ``(type, url)`` pairs, where ``url`` is a canonical URL (e.g., the address at + which one can ``git clone`` a repository or download a source tarball) and + ``type`` the kind of software origin (e.g., git, svn, or dsc for Debian + source packages). + +.. + **projects** + as commonly intended are more abstract entities that precise software + origins. Projects relate together several development resources, including + websites, issue trackers, mailing lists, as well as software origins as + intended by Software Heritage. + + The debate around the most apt ontologies to capture project-related + information for software hasn't settled yet, but the place projects will take + in the Software Heritage archive is fairly clear. Projects are abstract + entities, which will be arbitrarily nestable in a versioned + project/sub-project hierarchy, and that can be associated to arbitrary + metadata as well as origins where their source code can be found. + +**snapshots** + any kind of software origin offers multiple pointers to the "current" state + of a development project. In the case of VCS this is reflected by branches + (e.g., master, development, but also so called feature branches dedicated to + extending the software in a specific direction); in the case of package + distributions by notions such as suites that correspond to different maturity + levels of individual packages (e.g., stable, development, etc.). + + A "snapshot" of a given software origin records all entry points found there + and where each of them was pointing at the time. For example, a snapshot + object might track the commit where the master branch was pointing to at any + given time, as well as the most recent release of a given package in the + stable suite of a FOSS distribution. + +**visits** + links together software origins with snapshots. Every time an origin is + consulted a new visit object is created, recording when (according to + Software Heritage clock) the visit happened and the full snapshot of the + state of the software origin at the time. + + +Data structure +-------------- + .. _swh-merkle-dag: .. figure:: images/swh-merkle-dag.svg :width: 1024px :align: center Software Heritage archive as a Merkle DAG, augmented with crawling information (click to zoom). + +With all the bits of what we want to archive in place, the next question is how +to organize them, i.e., which logical data structure to adopt for their +storage. A key observation for this decision is that source code artifacts are +massively duplicated. This is so for several reasons: + +* code hosting diaspora (i.e., project development moving to the most + recent/cool collaborative development technology over time); +* copy/paste (AKA "vendoring") of parts or entire external FOSS software + components into other software products; +* large overlap between revisions of the same project: usually only a very + small amount of files/directories are modified by a single commit; +* emergence of DVCS (distributed version control systems), which natively work + by replicating entire repository copies around. GitHub-style pull requests + are the pinnacle of this, as they result in creating an additional repository + copy at each change done by a new developer; +* migration from one VCS to another—e.g., migrations from Subversion to Git, + which are really popular these days—resulting in additional copies, but in a + different distribution format, of the very same development histories. + +These trends seem to be neither stopping nor slowing down, and it is reasonable +to expect that they will be even more prominent in the future, due to the +decreasing costs of storage and bandwidth. + +For this reason we argue that any sustainable storage layout for archiving +source code in the very long term should support deduplication, allowing to pay +for the cost of storing source code artifacts that are encountered more than +once only once. For storage efficiency, deduplication should be supported for +all the software artifacts we have discussed, namely: file contents, +directories, revisions, releases, snapshots. + +Realizing that principle, the Software Heritage archive is conceptually a +single (big) `Merkle Direct Acyclic Graph (DAG) +`_, as depicted in Figure +:ref:`Software Heritage Merkle DAG `. In such a graph each of +the artifacts we have described—from file contents up to entire +snapshots—correspond to a node. Edges between nodes emerge naturally: +directory entries point to other directories or file contents; revisions point +to directories and previous revisions, releases point to revisions, snapshots +point to revisions and releases. Additionally, each node contains all metadata +that are specific to the node itself rather than to pointed nodes; e.g., commit +messages, timestamps, or file names. Note that the structure is really a DAG, +and not a tree, due to the fact that the line of revisions nodes might be +forked and merged back. + +.. + directory: fff3cc22cb40f71d26f736c082326e77de0b7692 + parent: e4feb05112588741b4764739d6da756c357e1f37 + author: Stefano Zacchiroli + date: 1443617461 +0200 + committer: Stefano Zacchiroli + commiter_date: 1443617461 +0200 + message: + objstorage: fix tempfile race when adding objects + + Before this change, two workers adding the same + object will end up racing to write .tmp. + [...] + + revisionid: 64a783216c1ec69dcb267449c0bbf5e54f7c4d6d + A revision node in the Software Heritage DAG + +In a Merkle structure each node is identified by an intrinsic identifier +computed as a cryptographic hash of the node content. In the case of Software +Heritage identifiers are computed taking into account both node-specific +metadata and the identifiers of child nodes. + +Consider the revision node in the picture whose identifier starts with +`c7640e08d..`. it points to a directory (identifier starting with +`45f0c078..`), which has also been archived. That directory contains a full +copy, at a specific point in time, of a software component—in the example the +`Hello World `_ software +component available on our forge. The revision node also points to the +preceding revision node (`43ef7dcd..`) in the project development history. +Finally, the node contains revision-specific metadata, such as the author and +committer of the given change, its timestamps, and the message entered by the +author at commit time. + +The identifier of the revision node itself (`c7640e08d..`) is computed as a +cryptographic hash of a (canonical representation of) all the information shown +in figure. A change in any of them—metadata and/or pointed nodes—would result +in an entirely different node identifier. All other types of nodes in the +Software Heritage archive behave similarly. + +The Software Heritage archive inherits useful properties from the underlying +Merkle structure. In particular, deduplication is built-in. Any software +artifacts encountered in the wild gets added to the archive only if a +corresponding node with a matching intrinsic identifier is not already +available in the graph—file content, commits, entire directories or project +snapshots are all deduplicated incurring storage costs only once. + +Furthermore, as a side effect of this data model choice, the entire development +history of all the source code archived in Software Heritage—which ambitions to +match all published source code in the world—is available as a unified whole, +making emergent structures such as code reuse across different projects or +software origins, readily available. Further reinforcing the Software Heritage +use cases, this object could become a veritable "map of the stars" of our +entire software commons.