specify a manifest format for documenting archived software
To support the use case of scientists willing to document the software relevant for a given paper, we want to have a manifest format capable of fully describing some software archived by Software Heritage.

In its essence, a manifest should point to Software Heritage objects using the URI scheme of T335.
But then, different level of details are possible:

  • (minimum detail) just point to a manifest file archived somewhere (possibly on SWH), using a manifest ID
  • point to a directory
  • fully describe the directory content, with a pair <pathname, content id> for each file
  • (maximum detail) as above + a revision history, describing each revision in full ← this will be crazy-large for long histories

In addition to the above, various kinds of metadata could be added:

  • SWH-specific metadata: when and where the archived code has been found
  • user-provided metadata, submitted to SWH at the time of ingestion request (e.g., Dublin Core, paper references, etc.)

zack created this task.Mar 4 2016, 1:05 PM

Depending on the form of the archived code, and the needs of the user that reference it, a variety of different references may be needed
Here is the original list from the team meeting, that completes the one mentioned in the ticket.

  • raw hash (optional): the hash of the physical binary object ingested, if it is an archive (tar, zip, etc.)
  • Tree hash: hash of the tree object in our store containing the directory with the (extracted) source code
  • Committ hash: hash of the commit object in our store containing the specified commit in the ingested VCS
  • Manifest: extended document containing a variety of detailed informations (some of which are optional)
    • the ls -R of the (current version of) the source code directory, with hashes of the content files
    • raw hash
    • tree hash
    • commit hash
    • the list of all branches in the VCS, with their hashes
    • origin
    • timestamp of SWH crawling/visit
  • Hash of the manifest

We decided to provide *all of these*, and we will *recommend* to store the manifest alongside the publications for scientific reproducibility

Important: we want to provide these informations *for all the content stored in SWH*

The identification part of this task has been done with documenting/implementing our PIDs, the rest is more suited for the software citation work on which @moranegg is actively working.