Page MenuHomeSoftware Heritage

Specify a new element to describe the provenance of deposit metadata
ClosedPublic

Authored by vlorentz on Feb 15 2022, 1:12 PM.

Details

Summary

This will be useful for metadata-only deposit, as there is not necessarily
and origin in the referenced SWHID; and even when there is, it is usually
not the actual source of the metadata.

Therefore, we need this new field to link back to the provenance of the metadata.

Related to T3677

Diff Detail

Repository
rDDEP Push deposit
Branch
metadata-provenance
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26872
Build 42011: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 42010: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7174 (id=25999)

Rebasing onto c30dc3da46...

Current branch diff-target is up to date.
Changes applied before test
commit 4d123a5e0b305ceeafce1db7e51ebfbd6fa2db1e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Feb 15 13:12:08 2022 +0100

    Specify a new element to describe the provenance of deposit metadata
    
    This will be useful for metadata-only deposit, as there is not necessarily
    and origin in the referenced SWHID; and even when there is, it is usually
    not the actual source of the metadata.
    
    Therefore, we need this new field to link back to the provenance of the metadata.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/700/ for more details.

fix namespace inconsistency wrt. the introduction

Build is green

Patch application report for D7174 (id=26000)

Rebasing onto c30dc3da46...

Current branch diff-target is up to date.
Changes applied before test
commit 233363fddce5510f54731db4695fa098a3225a6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Feb 15 13:12:08 2022 +0100

    Specify a new element to describe the provenance of deposit metadata
    
    This will be useful for metadata-only deposit, as there is not necessarily
    and origin in the referenced SWHID; and even when there is, it is usually
    not the actual source of the metadata.
    
    Therefore, we need this new field to link back to the provenance of the metadata.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/701/ for more details.

Build is green

Patch application report for D7174 (id=26001)

Rebasing onto c30dc3da46...

Current branch diff-target is up to date.
Changes applied before test
commit 9697301835a30302fe340c9b63c10a1bc6bd94d9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Feb 15 13:12:08 2022 +0100

    Specify a new element to describe the provenance of deposit metadata
    
    This will be useful for metadata-only deposit, as there is not necessarily
    and origin in the referenced SWHID; and even when there is, it is usually
    not the actual source of the metadata.
    
    Therefore, we need this new field to link back to the provenance of the metadata.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/702/ for more details.

I tried checking the prov vocabulary is there is a metadata-provenance property and there is not.
This diff is very good and I like the explanation you have introduced, which is clear.
I have two questions in comments, which aren't blockers IMHO.

docs/specs/protocol-reference.rst
281

can we use <swh:metadata-provenance> and <swh:deposit> as introduced in the example below, instead of <swhdeposit:metadata-provenance> and <swhdeposit:deposit> ?

283

what is the schema.org namespace?

This revision is now accepted and ready to land.Feb 15 2022, 3:35 PM
docs/specs/protocol-reference.rst
281

This document already uses swhdeposit: in text and swh: in example to show the prefix does not matter; but I think I will change this in a future diff, because it is indeed confusing.

283

it's a synonym for "schema.org vocabulary", but I'm using the terminology used in XML.

docs/specs/protocol-reference.rst
281

sure, in a future diff.
I know we have it elsewhere as well, but it might be confusing that it depends on the declaration of the namespace.

283

it's a synonym for "schema.org vocabulary", but I'm using the terminology used in XML.

yes right, I don't understand this specific context.
Is it just for the type url? do we use schema.org anywhere else?
BTW, I didn't find a provenance property in schema.org, maybe you had more luck.

docs/specs/protocol-reference.rst
283

yes, for now it's only for url, but I phrased it this way to be extensible.

provenance seems outside schema.org's scope. there's https://www.w3.org/ns/prov if you really want a dedicated vocabulary, but it's waaaay outside what we can use while remaining compatible with codemeta.

lgtm

one question regarding the xsd.

docs/specs/swh.xsd
39

Don't we want to be restrictive here first and then open more when we extend this (if we ever do it)?

docs/specs/swh.xsd
39

I don't think there is harm in allowing other metadata here. Plus, if there is any external reader of the metadata, they would want to future-proof it anyway.

docs/specs/protocol-reference.rst
281–283

i think it's clearer. @vlorentz, @moranegg thoughts?

Also I'm more inclined towards imposing the namespace (last sentence). Because
otherwise, we could be annoyed by the various differences we'll be receiving and bug
will slip through.

If not ^, then the last sentence can be reworded as:

It's preferably defined using...
This revision was landed with ongoing or failed builds.Feb 23 2022, 10:47 AM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7174 (id=26191)

Rebasing onto 55ae87b13c...

Current branch diff-target is up to date.
Changes applied before test
commit cc3705a3d33c07c7542024deb90d89fbc2c66853
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Feb 15 13:12:08 2022 +0100

    Specify a new element to describe the provenance of deposit metadata
    
    This will be useful for metadata-only deposit, as there is not necessarily
    and origin in the referenced SWHID; and even when there is, it is usually
    not the actual source of the metadata.
    
    Therefore, we need this new field to link back to the provenance of the metadata.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/734/ for more details.