Page MenuHomeSoftware Heritage

specify a manifest format for documenting archived software
Closed, MigratedEdits Locked

Description

To support the use case of scientists willing to document the software relevant for a given paper, we want to have a manifest format capable of fully describing some software archived by Software Heritage.

In its essence, a manifest should point to Software Heritage objects using the URI scheme of T335.
But then, different level of details are possible:

  • (minimum detail) just point to a manifest file archived somewhere (possibly on SWH), using a manifest ID
  • point to a directory
  • fully describe the directory content, with a pair <pathname, content id> for each file
  • (maximum detail) as above + a revision history, describing each revision in full ← this will be crazy-large for long histories

In addition to the above, various kinds of metadata could be added:

  • SWH-specific metadata: when and where the archived code has been found
  • user-provided metadata, submitted to SWH at the time of ingestion request (e.g., Dublin Core, paper references, etc.)

Event Timeline

Depending on the form of the archived code, and the needs of the user that reference it, a variety of different references may be needed
Here is the original list from the team meeting, that completes the one mentioned in the ticket.

  • raw hash (optional): the hash of the physical binary object ingested, if it is an archive (tar, zip, etc.)
  • Tree hash: hash of the tree object in our store containing the directory with the (extracted) source code
  • Committ hash: hash of the commit object in our store containing the specified commit in the ingested VCS
  • Manifest: extended document containing a variety of detailed informations (some of which are optional)
    • the ls -R of the (current version of) the source code directory, with hashes of the content files
    • raw hash
    • tree hash
    • commit hash
    • the list of all branches in the VCS, with their hashes
    • origin
    • timestamp of SWH crawling/visit
  • Hash of the manifest

We decided to provide *all of these*, and we will *recommend* to store the manifest alongside the publications for scientific reproducibility

Important: we want to provide these informations *for all the content stored in SWH*

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM
zack claimed this task.
zack added a subscriber: moranegg.

The identification part of this task has been done with documenting/implementing our PIDs, the rest is more suited for the software citation work on which @moranegg is actively working.