Page MenuHomeSoftware Heritage

deposit: Adapt loader to send extrinsic raw metadata to the metadata storage
ClosedPublic

Authored by ardumont on Sep 30 2020, 5:57 PM.

Details

Summary

This keeps the metadata written in the revision in the same format as before
though (json dict).

Questions:

  • (storage metadata) Do we keep the raw metadata attached to the revision as it's currently done or do we attach it to the directory instead or both?

-> currently the deposit server writes the raw metadata to the directory (during an update D4013), not on the revision as it cannot.

-> if we need to change it (attach the raw extrinsic metadata to the directory in the metadata storage), that's doable. It's a more involved refactoring though as the package loader does not allow it yet. And this can be done as a next diff.

  • (storage) Do we keep writing the metadata to the revision at all (we could stop now that the loaders are writing to the metadata storage)?

Remark:
Implementation wise, this sends one raw extrinsic metadata column per raw xml received (a deposit can be done in multiple requests, which translates into multiple raw xml files).
The first implementation of this was to join all xml into one concatenated xml. As this meant more handling down the line, when reading, this was changed into this instead.

Related to T2649

Related to D4100
Related to D4101

Test Plan

tox

Diff Detail

Event Timeline

Rework commit message to reference the task

Build has FAILED

Patch application report for D4105 (id=14460)

Rebasing onto 63e9d3caae...

Current branch diff-target is up to date.
Changes applied before test
commit 928ef5bcdb672fc47b9d7c0217187840b7654813
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 30 17:54:49 2020 +0200

    deposit.loader: Adapt loader to send the extrinsic raw metadata
    
    This almost keeps the revision metadata format as is. It sends the raw
    metadata (xml instead of the parsed json before).

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/305/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/305/console

ardumont edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D4105 (id=14461)

Rebasing onto 63e9d3caae...

Current branch diff-target is up to date.
Changes applied before test
commit c72ef7e526bec4d6754ef810a84a38b7a4e09aa2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 30 17:54:49 2020 +0200

    deposit.loader: Adapt loader to send the extrinsic raw metadata
    
    This almost keeps the revision metadata format as is. It sends the raw
    metadata (xml instead of the parsed json before).
    
    Related to T2649

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/306/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/306/console

Build is green

Patch application report for D4105 (id=14462)

Rebasing onto 63e9d3caae...

Current branch diff-target is up to date.
Changes applied before test
commit 95b674783eea1946eaf160b46ae43481c587764a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 30 17:54:49 2020 +0200

    deposit.loader: Adapt loader to send the extrinsic raw metadata
    
    This almost keeps the revision metadata format as is. It sends the raw
    metadata (xml instead of the parsed json before).
    
    Related to T2649

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/307/ for more details.

ardumont edited the summary of this revision. (Show Details)
swh/loader/package/deposit/loader.py
70

The all raw xml join i'm talking about in the diff description.
It's either that or sending one RawExtrinsicMetadataCore per xml file below (around line 95) [1]
Like i did for the origin metadata...

I don't know what's best...
I'm now inclined to think that sending multiple raw metadata may be better.

Bottom line, It implies less action when reading.

197

Kept the existing json format as it was before...

211

Here, i did not join the xml files into 1 (as I did for the revision one), shout out if not ok ;)

Send one raw extrinsic metadata per raw xml received

Build is green

Patch application report for D4105 (id=14463)

Rebasing onto 63e9d3caae...

Current branch diff-target is up to date.
Changes applied before test
commit 177f5ec445a573502f53acc077fe3a3807a50102
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 30 17:54:49 2020 +0200

    deposit.loader: Adapt loader to send the extrinsic raw metadata
    
    This almost keeps the revision metadata format as is. It sends the raw
    metadata (xml instead of the parsed json before).
    
    Related to T2649

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/308/ for more details.

swh/loader/package/deposit/loader.py
70

no longer relevant ^ (this now sends on RawExtrinsicMetadataCore per xml file).

211

The other part was aligned with this.

swh/loader/package/deposit/tests/test_deposit.py
306

cf. my question in the diff, do we still want to write metadata directly in a Revision model object?
My take is no. What do you think?

vlorentz requested changes to this revision.Oct 1 2020, 9:58 AM
vlorentz added a subscriber: vlorentz.

(storage metadata) Do we keep the raw metadata attached to the revision as it's currently done or do we attach it to the directory instead or both?
(storage) Do we keep writing the metadata to the revision at all (we could stop now that the loaders are writing to the metadata storage)?

we should keep writing the metadata to the revision, as JSON for now, just in case we choose to change the schema of metadata again before we migrate. Removing it should be the last step in the migration.

and we can additionally write as dir metadata, but it's outside the scope of this diff

This revision now requires changes to proceed.Oct 1 2020, 9:58 AM

we should keep writing the metadata to the revision, as JSON for now, just in case we choose to change the schema of metadata again before we migrate. Removing it should be the last step in the migration.

sounds like a reasonable plan forward, thx.
I'll go and adapt D4100/D4101 towards that.

and we can additionally write as dir metadata, but it's outside the scope of this diff

agreed.

sounds like a reasonable plan forward, thx.
I'll go and adapt D4100/D4101 towards that.

Actually standby, i'm no longer convinced this is a good idea, we are discussing in irc...

swh/loader/package/deposit/tests/test_deposit.py
304

isn't this the original XML?

swh/loader/package/deposit/tests/test_deposit.py
304

yes, it is.
It's prior to your requesting change to continue sending the json one.
I've reworked D4100 and D4101 so it continues sending the json metadata.
I'm currently updating the diff with this change.

  • Adapt according to review, keep the old json format for the metadata written in the revision
  • Rework commit message accordingly
ardumont retitled this revision from deposit.loader: Adapt loader to send the extrinsic raw metadata to deposit: Adapt loader to send extrinsic raw metadata to the metadata storage.Oct 1 2020, 1:19 PM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D4105 (id=14477)

Rebasing onto 63e9d3caae...

Current branch diff-target is up to date.
Changes applied before test
commit 777ea4468ad3c4e8981151b29f8b70ff7124c4ef
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 30 17:54:49 2020 +0200

    deposit: Adapt loader to send extrinsic raw metadata to the metadata storage
    
    This keeps the metadata written in the revision in the same format as before
    though (json dict).
    
    Related to T2649

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/309/ for more details.

ardumont added inline comments.
swh/loader/package/deposit/tests/test_deposit.py
304

adapted back to keep the old json format now.

This revision is now accepted and ready to land.Oct 1 2020, 2:02 PM