Page MenuHomeSoftware Heritage

Package loaders should write extrinsic metadata on directories instead of revisions/releases
Closed, MigratedEdits Locked

Description

Currently, loaders write release metadata on revision objects (and we were planning to write to release objects after T1258 is solved)

However, @rdicosmo pointed out that writing the metadata on directories would make them more useful, so they can be accessed without knowing the hash of the synthetic revision, using the hash of the directory (which is an intrinsic id)

There is an issue to work around before we can do this: if two releases have the exact same directory, the loader would write both metadata with the same discovery date on the same directory. So one would be lost, because (id, discovery_date, authority, fetcher) is a unique index on the MD storage. But we need a way to keep both metadata.

Possible solutions:

  1. change the semantics of discovery date, so both entry have a different one
  2. add contexts in the MD storage's unique key
  3. use a different unique key altogether (perhaps a randomly generated UUID)

none of these solutions seems great IMO

https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#artifact-metadata

Event Timeline

vlorentz renamed this task from Package loaders write extrinsic metadata on directories instead of revisions/releases to Package loaders should write extrinsic metadata on directories instead of revisions/releases.Oct 6 2020, 10:30 AM
vlorentz triaged this task as Normal priority.
vlorentz created this task.

Alternatively, we could keep writing the metadata on revision/releases, and use the provenance service (when it's ready) to find them from a directory SWHID. What do you think?

The suggestion was to have extrinsic metadata on directories that come from a deposit of a bundle (e.g. .tar.gz or .zip file coming from HAL), instead of on a synthetic revision as is currently the case, so they can be accessed knowing the hash of the directory (which is an intrinsic id).

The issue pointed out here seems to be related to how this is currently implemented. It would be useful to see a full example.

FTR, olasd, douardda and I discussed an inconsistency in keys used in kafka, and decided to use hashes for all origin/visits/visit statuses; and doing the same for ext metadata in both kafka and the DB solves the issue about defining unicity.