Page MenuHomeSoftware Heritage

Add deposit info to objects added to swh-storage from metadata-only deposits
ClosedPublic

Authored by vlorentz on Mar 12 2021, 2:41 PM.

Details

Summary

Deposits with code objects are loaded as their own origin, so we can
look them up in the deposit database from their metadata (which hold the
origin as a context).

This is not true for metadata-only deposits, because we don't create an
origin for them; so we need to store this information somewhere.
The naive solution would be to insert them in the Atom entry provided by
the client, but it means altering a document before we archive it, which
is bad.

This commit makes the deposit server write a "metametadata" object (ie.
a metadata object with an other metadata object as target) the metadata
storage.
This metametadata contains information on the deposit itself: id,
client, and collection.

Depends on D5237 and D5238

Resolves T2779

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D5239 (id=18777)

Could not rebase; Attempt merge onto 4353a323f6...

Updating 4353a323..58e4c62d
Fast-forward
 swh/deposit/api/common.py                          |  84 +++++++--
 swh/deposit/config.py                              |  16 ++
 swh/deposit/templates/deposit/deposit_info.xml     |   5 +
 swh/deposit/tests/api/test_collection_post_atom.py | 196 +++++++++++++++++----
 swh/deposit/tests/conftest.py                      |   2 +
 5 files changed, 260 insertions(+), 43 deletions(-)
 create mode 100644 swh/deposit/templates/deposit/deposit_info.xml
Changes applied before test
commit 58e4c62d46acec30ce039d853c5e76ffb87dc2b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.

commit 5949c08cbc3c29e7871fa5eae62cd37784daa5cf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:22:58 2021 +0100

    tests: Simplify discovery_date comparison.

commit bb053c7bd51bbd393db9bde991dfa9ffa1e7e202
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 11 17:29:12 2021 +0100

    Check a SWHID exists in the archive before accepting a metadata-only deposit

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/566/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/566/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 12 2021, 2:44 PM
Harbormaster failed remote builds in B19875: Diff 18777!

Build is green

Patch application report for D5239 (id=18806)

Could not rebase; Attempt merge onto c4972584f1...

Updating c4972584..49e075ea
Fast-forward
 swh/deposit/api/common.py                          | 35 ++++++--
 swh/deposit/config.py                              | 15 ++++
 swh/deposit/templates/deposit/deposit_info.xml     |  5 ++
 swh/deposit/tests/api/test_collection_post_atom.py | 98 +++++++++++++++-------
 swh/deposit/tests/conftest.py                      |  1 +
 5 files changed, 116 insertions(+), 38 deletions(-)
 create mode 100644 swh/deposit/templates/deposit/deposit_info.xml
Changes applied before test
commit 49e075eae5fa40e5460b8b1705463bc8ccebb2d3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.

commit 22d3f3ef906c162572881b6df87b82cc3d4032ae
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:22:58 2021 +0100

    tests: Simplify discovery_date comparison.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/574/ for more details.

I'm not very fond of the commit message, especially the fact it describes what is not done, but not really what is done, and I find Add deposit info to objects added to swh-storage from metadata-only deposits unclear. What (type of) objects are we talking of? How is the deposit info it "added" to them?

douardda requested changes to this revision.EditedMar 15 2021, 1:50 PM

Otherwise LGTM. I'd really like a better commit message, and probably some documentation somewhere (in docs/ maybe?) explaining these 2 levels of metadata, especially documenting the second layer, since it's crafted by the deposit.

This revision now requires changes to proceed.Mar 15 2021, 1:50 PM

reword commit msg

I can't find a better way to phrase the first line, but I added a paragraph
explaining the changes.

Build is green

Patch application report for D5239 (id=18815)

Rebasing onto 22d3f3ef90...

Current branch diff-target is up to date.
Changes applied before test
commit 3a9b2fc4baa4e5080e9a17c974035550db7dab4d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 12 14:13:29 2021 +0100

    Add deposit info to objects added to swh-storage from metadata-only deposits
    
    Deposits with code objects are loaded as their own origin, so we can
    look them up in the deposit database from their metadata (which hold the
    origin as a context).
    
    This is not true for metadata-only deposits, because we don't create an
    origin for them; so we need to store this information somewhere.
    The naive solution would be to insert them in the Atom entry provided by
    the client, but it means altering a document before we archive it, which
    is bad.
    
    This commit makes the deposit server write a "metametadata" object (ie.
    a metadata object with an other metadata object as target) the metadata
    storage.
    This metametadata contains information on the deposit itself: id,
    client, and collection.

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/575/ for more details.

probably some documentation somewhere (in docs/ maybe?) explaining these 2 levels of metadata, especially documenting the second layer, since it's crafted by the deposit.

D5247

moranegg added inline comments.
swh/deposit/config.py
114

can you remind me where it is specified that the deposit authority type is a registry?

This revision is now accepted and ready to land.Mar 15 2021, 3:13 PM
swh/deposit/config.py
114

Nowhere. But it's not a forge/package-manager, and is not a deposit client. Only remaining option is registry.

We could add a new authority type, eg. "witness" or "notary", but the difference with registry is probably too subtle.