Page MenuHomeSoftware Heritage

DB schema: add directory→tarball provenance information
Closed, MigratedEdits Locked

Description

We cannot reproduce a bit-by-bit identical tarball (and any other archive format: dsc, jar, zip, etc.) from the information stored in the DB, nor we want, given how brittle that is.
Instead, we are going to store $somewhere tarball in the SWH objstorage before unpacking them.

But we still want to be able to link a directory in the DB, to the tarball that led to it.
To that end, we need to add a table mapping directory entries (the dirs obtained by unpacking) to content entries (the original tarballs).

Event Timeline

zack raised the priority of this task from to Normal.
zack updated the task description. (Show Details)
zack added a project: Developers.
zack added a subscriber: zack.

We should make sure our abstraction works for things where the original artifact has several files (e.g. a debian source package)

A few proposals:

The most generic approach is linking a virtual "directory" containing all the original artifacts, complete with their names/relative paths, to the directory resulting of the artifact unpacking.

  • Pros : this lets us reference arbitrarily complex original artifacts, and reuses our already-existing "directory" abstraction as a "set of named files".
  • Cons : this makes us generate "dangling" directories and complexifies the single-original-artifact case (e.g. tarballs).

A more ad-hoc approach is creating an "original artifacts" table with the following columns (and one entry per file in each original artifact):

  • uncompressed directory
  • original artifact (content / skipped_content id)
  • original artifact relative name
  • Pros: doesn't clutter our directory/directory entry table
  • Cons: less code reuse for the original artefact import

This naive schema breaks when several original artifacts yield the same directory, so we probably need to add an "original artifact id".

We could do a hybrid approach where an original artifact could either be a content (single-file original artifact) or a directory (multi-file original artifact).

We're settling on a new table:

original_artifact

  • id bigserial primary key
  • directory sha1_git references directory(id)
  • type enum ('archive', 'dsc', ...) // ?

original_artifact_content

  • id bigint references original_artifact(id)
  • content sha1_git // sha1_git is not the primary key for content/skipped_content. chosen for consistency with directory_entry_file's foreign key, could use sha1 instead.
  • path bytea

We would register each directory produced by an original artifact in the original_artifact table.

Each file from the original artifact would be registered in our current content tables (currently, with an 'absent' status) if it doesn't exist yet. It would then be associated to the directory by adding an entry in original_artifact_content, with its relative path (we need to register the path for composite artifacts).

We could also merge the two tables into a single one.

zack claimed this task.

This has been fixed a while ago, for both debian and tarball ingestion.

The provenance information is stored in the metadata column of the revision table, and associated with the synthetic commit that corresponds to the archive. For tarballs, the provenance metadata look like this:

{
  "original_artifact" : [
    {
      "name" : "dejagnu.texi.tar.gz",
      "sha1" : "9a5380aa7c2a9fb7f84036b223cbde07d9db7e67",
      "sha256" : "68f5a356bae22bfef8d09c6314d2c508016507b197e298bcb210a2cdf674cb93",
      "sha1_git" : "26a738cc74f236e7e2e1ee1a00817dfe5716aa2d",
      "archive_type" : "tar"
    }
  ]
}

(length is missing there, but that's a bug; see: T339)

For Debian packages the inner array has multiple elements, one for each source package component (.dsc, .orig.tar.gz, etc).

Note that we don't store yet the tarballs though (even though we haven't ever deleted any of them yet).
Doing so will be tracked in a separate task.

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM