Page MenuHomeSoftware Heritage

Add deposit info to objects added to swh-storage from metadata-only deposits
ClosedPublic

Authored by vlorentz on Mar 12 2021, 2:41 PM.

Details

Summary

Deposits with code objects are loaded as their own origin, so we can
look them up in the deposit database from their metadata (which hold the
origin as a context).

This is not true for metadata-only deposits, because we don't create an
origin for them; so we need to store this information somewhere.
The naive solution would be to insert them in the Atom entry provided by
the client, but it means altering a document before we archive it, which
is bad.

This commit makes the deposit server write a "metametadata" object (ie.
a metadata object with an other metadata object as target) the metadata
storage.
This metametadata contains information on the deposit itself: id,
client, and collection.

Depends on D5237 and D5238

Resolves T2779

Diff Detail

Repository
rDDEP Push deposit
Branch
metametadata
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19909
Build 30923: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 30922: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5239 (id=18777)

Could not rebase; Attempt merge onto 4353a323f6...

Updating 4353a323..58e4c62d
Fast-forward
 swh/deposit/api/common.py                          |  84 +++++++--
 swh/deposit/config.py                              |  16 ++
 swh/deposit/templates/deposit/deposit_info.xml     |   5 +
 swh/deposit/tests/api/test_collection_post_atom.py | 196 +++++++++++++++++----
 swh/deposit/tests/conftest.py                      |   2 +
 5 files changed, 260 insertions(+), 43 deletions(-)
 create mode 100644 swh/deposit/templates/deposit/deposit_info.xml
Changes applied before test
commit 58e4c62d46acec30ce039d853c5e76ffb87dc2b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.

commit 5949c08cbc3c29e7871fa5eae62cd37784daa5cf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:22:58 2021 +0100

    tests: Simplify discovery_date comparison.

commit bb053c7bd51bbd393db9bde991dfa9ffa1e7e202
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 11 17:29:12 2021 +0100

    Check a SWHID exists in the archive before accepting a metadata-only deposit

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/566/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/566/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 12 2021, 2:44 PM
Harbormaster failed remote builds in B19875: Diff 18777!

Build is green

Patch application report for D5239 (id=18806)

Could not rebase; Attempt merge onto c4972584f1...

Updating c4972584..49e075ea
Fast-forward
 swh/deposit/api/common.py                          | 35 ++++++--
 swh/deposit/config.py                              | 15 ++++
 swh/deposit/templates/deposit/deposit_info.xml     |  5 ++
 swh/deposit/tests/api/test_collection_post_atom.py | 98 +++++++++++++++-------
 swh/deposit/tests/conftest.py                      |  1 +
 5 files changed, 116 insertions(+), 38 deletions(-)
 create mode 100644 swh/deposit/templates/deposit/deposit_info.xml
Changes applied before test
commit 49e075eae5fa40e5460b8b1705463bc8ccebb2d3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.

commit 22d3f3ef906c162572881b6df87b82cc3d4032ae
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:22:58 2021 +0100

    tests: Simplify discovery_date comparison.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/574/ for more details.

I'm not very fond of the commit message, especially the fact it describes what is not done, but not really what is done, and I find Add deposit info to objects added to swh-storage from metadata-only deposits unclear. What (type of) objects are we talking of? How is the deposit info it "added" to them?

douardda requested changes to this revision.EditedMar 15 2021, 1:50 PM

Otherwise LGTM. I'd really like a better commit message, and probably some documentation somewhere (in docs/ maybe?) explaining these 2 levels of metadata, especially documenting the second layer, since it's crafted by the deposit.

This revision now requires changes to proceed.Mar 15 2021, 1:50 PM

reword commit msg

I can't find a better way to phrase the first line, but I added a paragraph
explaining the changes.

Build is green

Patch application report for D5239 (id=18815)

Rebasing onto 22d3f3ef90...

Current branch diff-target is up to date.
Changes applied before test
commit 3a9b2fc4baa4e5080e9a17c974035550db7dab4d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.
    
    This commit makes the deposit server write a "metametadata" object (ie.
    a metadata object with an other metadata object as target) the metadata
    storage.
    This metametadata contains information on the deposit itself: id,
    client, and collection.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/575/ for more details.

probably some documentation somewhere (in docs/ maybe?) explaining these 2 levels of metadata, especially documenting the second layer, since it's crafted by the deposit.

D5247

moranegg added inline comments.
swh/deposit/config.py
115

can you remind me where it is specified that the deposit authority type is a registry?

This revision is now accepted and ready to land.Mar 15 2021, 3:13 PM
swh/deposit/config.py
115

Nowhere. But it's not a forge/package-manager, and is not a deposit client. Only remaining option is registry.

We could add a new authority type, eg. "witness" or "notary", but the difference with registry is probably too subtle.