Page MenuHomeSoftware Heritage

Add artifact metadata to the extrinsic metadata storage specification.
ClosedPublic

Authored by vlorentz on May 14 2020, 2:53 PM.

Details

Summary

The context is loosely based on SWHID qualifiers, but not a direct
translation (revision + release instead of anchor, visit actually
references a visit, and extra snapshot key).

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.May 14 2020, 2:53 PM

Build is green

Patch application report for D3154 (id=11200)

Rebasing onto 306aa69b26...

Current branch diff-target is up to date.
Changes applied before test
commit 35284ff11b80e9f2f2a2196be8c4fae1ad59c997
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/172/ for more details.

zack added a subscriber: zack.May 14 2020, 3:15 PM

Question: can we support both context-full and context-less metadata for arbitrary artifacts?

There are metadata that will be independent of the context, but that we still don't want to store in the main storage. If we don't support that here I'm not sure where we will store those.

I don't know if the proposal is as simple as supporting a NULL/None value for context here, or if we should rather duplicate the methods you're proposing (one set for context-full, one for context-less). Either approach would be fine with me.

I don't think it makes sense to make this difference, because if a metadata only has origin in its context, it means it's context-free on all "axis" but the origin already. Having also no origin context is just one more axis it would be context-free on.

This is a great idea to add context.
I always thought the multiplicity will be from having a lot of authorities saying different things about artifacts.
Haven't realized that the same object in a different context can have in itself different metadata.

The question will remain, about how will we show the metadata.

I've added some changes to clarify the need of context without excluding the option to add metadata without context.

docs/extrinsic-metadata-specification.rst
195

I would prefer saying that:

extrinsic metadata can be given on a specific artifact within a specified context (for example: a directory in a specific revision from a specific visit on a specific origin) which will be stored along the metadata itself.

and add the note:

the context information isn't mandatory, metadata on artifacts can be added without the contextual information with an empty context field

198

I don't understand that.. which authorship? of the file?
so you can't add authors metadata on a dir or on a file without the contextual information?

I would change this example because it opens question on authorship that are not in the scope of this documentation.

If we have a HAL deposit on an artifact developed in GitHub, the deposit metadata is enhancing the information we might have on the repository (it might not be in contradiction to the one on GitHub).

So I would say,

For example, two origins may develop the same file independently;
the information about authorship, licensing or even description may vary about the same artifact in a different context.
This is why it is important to capture the complete context on which the metadata applies to.

201

I think that Therefore is not needed.

213

do we get all metadata on the identified artifact, if the context is empty?
If yes, this is worth mentioning.

223

typo a URL instead of an URL

Thanks for sharing this: we're definitely going forward!
Comments are in the text, but overall:

  • agree that we need a way to store and retrieve metadata that is contextless (or... valid in all contexts)
  • for the context, we need to use the SHWIDs themselves, not the sha1_git that is bound to version 1 of SWHIDS

All the rest LGTM.

docs/extrinsic-metadata-specification.rst
195

I agree that we want to be able to store metadata that does not depend on a context. Using an empty context field, or a special distinguished value for designating "all contexts" fills the bill.

198

Agree with this clarification of the text.
Small improvement:

This is why it is important to qualify the metadata with the complete context for which it is inteded.

213

I agree it is important to be able to formulate a query that allows to request the metadata available for all contexts.
This can be obtained either using an empty context in the query, or a special distinguished value for designating "all contexts".

223

For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail)

224–229

Here also, we do not want to commit to using sha1_git, as in the near future we might need to store metadata for SWHIDs that use different hashes. Let's use "(a release SWHID)", "(a revision SWHID)" etc.

for the context, we need to use the SHWIDs themselves, not the sha1_git that is bound to version 1 of SWHIDS

What's the difference? When we'll use something other than sha1_git we'll just migrate this data with the rest.

docs/extrinsic-metadata-specification.rst
213

Sorry, that's a typo. I didn't mean to add this context argument.

vlorentz added inline comments.May 16 2020, 4:16 PM
docs/extrinsic-metadata-specification.rst
223

What do you mean? Make snapshots have a snapshot context?

zack added a comment.May 16 2020, 4:23 PM

for the context, we need to use the SHWIDs themselves, not the sha1_git that is bound to version 1 of SWHIDS

What's the difference? When we'll use something other than sha1_git we'll just migrate this data with the rest.

I agree we should use SWHID everywhere. We should've used them in the beginning everywhere in fact, but now we should organically migrate to them.
There are many reasons for this, but one specific to this task is that it will make exports of this DB self-contained and meaningful, only depending on a publicly available spec, while using sub-parts of SWHID will lose that property.

vlorentz updated this revision to Diff 11226.May 18 2020, 3:12 PM

big rebase

Build is green

Patch application report for D3154 (id=11226)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/176/ for more details.

Notice also that if/when we introduce new versions of SWHIDs with other
hashing algorithms, we need to maintain backward compatibility.
Metadata introduced with a given context must not be "migrated" to the new
SWHIDs, but "duplicated" for the new SWHIDs.
Better think right now of a schema that allows to "share" metadata payloads
among multiple versions of SWHIDs without actually copying them over.

Le sam. 16 mai 2020 à 16:24, zack (Stefano Zacchiroli) <
forge@softwareheritage.org> a écrit :

zack added a comment. View Revision
https://forge.softwareheritage.org/D3154

In D3154#76782 https://forge.softwareheritage.org/D3154#76782, @vlorentz
https://forge.softwareheritage.org/p/vlorentz/ wrote:

for the context, we need to use the SHWIDs themselves, not the sha1_git
that is bound to version 1 of SWHIDS

What's the difference? When we'll use something other than sha1_git we'll
just migrate this data with the rest.

I agree we should use SWHID everywhere. We should've used them in the
beginning everywhere in fact, but now we should organically migrate to them.
There are many reasons for this, but one specific to *this* task is that
it will make exports of this DB self-contained and meaningful, only
depending on a publicly available spec, while using sub-parts of SWHID will
lose that property.

*REPOSITORY*
rDSTO Storage manager

*CHANGES SINCE LAST ACTION*
https://forge.softwareheritage.org/D3154/new/

*REVISION DETAIL*
https://forge.softwareheritage.org/D3154

*EMAIL PREFERENCES*
https://forge.softwareheritage.org/settings/panel/emailpreferences/

*To: *vlorentz, Reviewers, moranegg, rdicosmo
*Cc: *zack

vlorentz updated this revision to Diff 11228.May 18 2020, 3:38 PM

undo push on the wrong diff

vlorentz updated this revision to Diff 11229.May 18 2020, 3:40 PM

undo push on the wrong diff (for real this time)

Build is green

Patch application report for D3154 (id=11228)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/178/ for more details.

vlorentz updated this revision to Diff 11230.May 18 2020, 3:52 PM
vlorentz marked 9 inline comments as done.

apply most comments

I applied most of your comments, except:

For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail)

because I don't understand it.

Also, should the first argument of <X>_metadata_add and <X>_metadata_get be a core SWHID?

Build is green

Patch application report for D3154 (id=11229)

Rebasing onto 6d24ed721a...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit bb531dda9c4707a41fdb828a3eb4c427b49e17fc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/179/ for more details.

Build is green

Patch application report for D3154 (id=11230)

Rebasing onto 6d24ed721a...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit 00b0b8c8e5a061c8699b407026107ea8a252c4e1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/180/ for more details.

I applied most of your comments,

Great!

except:

For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail)

because I don't understand it.

I removed that comment, it was incorrect: sorry for the noise.

Also, should the first argument of <X>_metadata_add and <X>_metadata_get be a core SWHID?

Yes, please!

vlorentz updated this revision to Diff 11235.May 18 2020, 6:13 PM

Make 'visit' depend on 'origin' being provided.

Build is green

Patch application report for D3154 (id=11235)

Rebasing onto 6d24ed721a...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit a38509ce2449e1d2f2bef0012fae95741dab43e5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/182/ for more details.

ardumont accepted this revision.EditedMay 20 2020, 11:13 AM
ardumont added a subscriber: ardumont.

Looks good to me (1 typo in the comments above).


Note: I understand the functional reason behind wanting to store SWHID.

Just mentioning that techically, that might make things harder if we want to
join (sql speaking) between tables. That means those "join" will have to take
place python side. That might be what we want in the end as we have other
storage backends where join might not be possible (e.g cassandra comes to
mind).

docs/extrinsic-metadata-specification.rst
236

SWHID of a directory

This revision is now accepted and ready to land.May 20 2020, 11:13 AM
vlorentz updated this revision to Diff 11265.May 20 2020, 1:01 PM

fix typo + use SWHID as first argument

vlorentz marked an inline comment as done.May 20 2020, 1:01 PM

Build is green

Patch application report for D3154 (id=11265)

Rebasing onto f1b51a7a2b...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit db1ecad241db08be6924bd69769e324562574241
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/188/ for more details.

Adding a last comment after a last reread

docs/extrinsic-metadata-specification.rst
229

It seems that the XXXX in "(the core SWHID of a XXXX)" are off by one in lines 228, 230, 232: in 228 XXXX should be snapshot, not release, etc.

vlorentz updated this revision to Diff 11293.May 25 2020, 4:57 PM
vlorentz marked an inline comment as done.

fix mismatch

Build is green

Patch application report for D3154 (id=11293)

Rebasing onto f1b51a7a2b...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit 2e574d33b74fa7dff700757f9b72f32a435a47b1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/189/ for more details.

moranegg accepted this revision.May 26 2020, 11:50 AM

Looks good!
I added a comment about the an URL typo which we discussed on IRC (and decided it's a URL because we trust Wikipedia) :-)

docs/extrinsic-metadata-specification.rst
227

typo: an URL should be a URL

vlorentz updated this revision to Diff 11302.May 26 2020, 1:01 PM
  • fix typo
  • rebase
This revision was landed with ongoing or failed builds.May 26 2020, 1:02 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D3154 (id=11302)

Rebasing onto f1b51a7a2b...

First, rewinding head to replay your work on top of it...
Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit ffea7a4d2801ce1b93a26d360f9bbc3704840b64
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu May 14 14:50:46 2020 +0200

    Add artifact metadata to the extrinsic metadata storage specification.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/190/ for more details.

Build is green

Patch application report for D3154 (id=11303)

Rebasing onto 213f1b1239...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-191-D3154.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/191/ for more details.