Package loaders should write extrinsic metadata on directories instead of revisions/releases
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Oct 6 2020, 10:30 AM

Description

Currently, loaders write release metadata on revision objects (and we were planning to write to release objects after T1258 is solved)

However, @rdicosmo pointed out that writing the metadata on directories would make them more useful, so they can be accessed without knowing the hash of the synthetic revision, using the hash of the directory (which is an intrinsic id)

There is an issue to work around before we can do this: if two releases have the exact same directory, the loader would write both metadata with the same discovery date on the same directory. So one would be lost, because (id, discovery_date, authority, fetcher) is a unique index on the MD storage. But we need a way to keep both metadata.

Possible solutions:

change the semantics of discovery date, so both entry have a different one
add contexts in the MD storage's unique key
use a different unique key altogether (perhaps a randomly generated UUID)

none of these solutions seems great IMO

https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#artifact-metadata

Revisions and Commits

rDLDBASE Generic VCS/Package Loader
	D4347	rDLDBASE8f41aeee10c4 package loaders: write original_artifact metadata to directories instead of…
	D4346	rDLDBASEcf58604ccaa4 package loaders: write extrinsic metadata to directories instead of revisions.
rDSTO Storage manager
	D4349	rDSTO6e3e35096f61 migrate_extrinsic_metadata: Write metadata on directories instead of revisions.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2668 Package loaders should write extrinsic metadata on directories instead of revisions/releases
Migrated	gitlab-migration	T2703 Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects
Migrated	gitlab-migration	T3017 Use hashes as keys in swh.journal.objects.raw_extrinsic_metadata
Migrated	gitlab-migration	T3018 Allow querying raw_extrinsic_metadata by hash in swh-storage
Migrated	gitlab-migration	T3022 Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields
Migrated	gitlab-migration	T3019 Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql
Migrated	gitlab-migration	T3020 Add an "index" for raw_extrinsic_metadata.id in swh.storage.cassandra
Migrated	gitlab-migration	T3074 Migrate all packages away from the old SWHID class
Migrated	gitlab-migration	T3034 generalize usage of SWHID for referencing SWH archive objects

Event Timeline

vlorentz renamed this task from Package loaders write extrinsic metadata on directories instead of revisions/releases to Package loaders should write extrinsic metadata on directories instead of revisions/releases.Oct 6 2020, 10:30 AM

vlorentz triaged this task as Normal priority.

vlorentz created this task.

rdicosmo added subscribers: moranegg, olasd.Oct 6 2020, 10:37 AM

vlorentz updated the task description. (Show Details)Oct 6 2020, 10:45 AM

Alternatively, we could keep writing the metadata on revision/releases, and use the provenance service (when it's ready) to find them from a directory SWHID. What do you think?

The suggestion was to have extrinsic metadata on directories that come from a deposit of a bundle (e.g. .tar.gz or .zip file coming from HAL), instead of on a synthetic revision as is currently the case, so they can be accessed knowing the hash of the directory (which is an intrinsic id).

The issue pointed out here seems to be related to how this is currently implemented. It would be useful to see a full example.

@rdicosmo a full example of what?

vlorentz added a subtask: T2686: Use hashes for all kafka keys.Oct 12 2020, 1:06 PM

vlorentz mentioned this in T2686: Use hashes for all kafka keys.

FTR, olasd, douardda and I discussed an inconsistency in keys used in kafka, and decided to use hashes for all origin/visits/visit statuses; and doing the same for ext metadata in both kafka and the DB solves the issue about defining unicity.

vlorentz mentioned this in T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects.Oct 14 2020, 2:01 PM

vlorentz removed a subtask: T2686: Use hashes for all kafka keys.Oct 14 2020, 2:08 PM

vlorentz mentioned this in D4346: package loaders: write extrinsic metadata to directories instead of revisions..Oct 23 2020, 4:53 PM

vlorentz added revisions: D4346: package loaders: write extrinsic metadata to directories instead of revisions., D4347: package loaders: write original_artifact metadata to directories instead of revisions..Oct 23 2020, 5:01 PM

vlorentz mentioned this in D4349: migrate_extrinsic_metadata: Write metadata on directories instead of revisions..Oct 23 2020, 5:26 PM

vlorentz added a revision: D4349: migrate_extrinsic_metadata: Write metadata on directories instead of revisions..Oct 23 2020, 5:26 PM

vlorentz added a commit: rDLDBASEcf58604ccaa4: package loaders: write extrinsic metadata to directories instead of revisions..Nov 2 2020, 12:22 PM

vlorentz added a commit: rDLDBASE8f41aeee10c4: package loaders: write original_artifact metadata to directories instead of….

vlorentz added a commit: rDSTO6e3e35096f61: migrate_extrinsic_metadata: Write metadata on directories instead of revisions..

vlorentz closed this task as Resolved.Nov 2 2020, 12:23 PM

vlorentz closed subtask T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects as Resolved.Mar 23 2021, 2:33 PM