Page MenuHomeSoftware Heritage

model: Add payload to ExtID class
AcceptedPublic

Authored by samplet on Oct 25 2022, 1:36 AM.

Details

Summary

This revision adds payload and payload type fields to
the ExtID class. The intent is to use these fields to store
Disarchive specifications to support recovering source tarballs.

See https://sympa.inria.fr/sympa/arc/swh-devel/2022-02/msg00022.html
where Stefano suggests using a generic payload mechanism for ExtIDs
and https://sympa.inria.fr/sympa/arc/swh-devel/2022-05/msg00027.html
where we decided on using object storage for the payload.

Diff Detail

Repository
rDMOD Data model
Branch
disarchive-3
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32613
Build 51088: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51087: arc lint + arc unit

Event Timeline

samplet published this revision for review.Oct 25 2022, 4:58 AM

It looks like using the --draft flag when sending this with arc has confused the CI system. The build failed because it couldn't access the updated code:

fatal: Couldn't find remote ref refs/tags/phabricator/diff/31582

Sorry! Can the CI job be retried?

Hi, thanks for the diffs

It looks like using the --draft flag when sending this with arc has confused the CI system. The build failed because it couldn't access the updated code:

fatal: Couldn't find remote ref refs/tags/phabricator/diff/31582

It was because you hadn't signed the CLA when submitting the diff

Sorry! Can the CI job be retried?

Done

swh/model/git_objects.py
644–645

That's weird. I don't see why it would fail this time.

Anyway, I'm guessing it will fix itself when you send any update to the diff so let's ignore that issue for now.

This looks fine to me, though we might want to use something other than SHA1 for this.

We are currently in the process of moving the objstorage to multi-hashes, so eventually we will want to use multi-hashes here too. I wonder if we should do that before landing this diff (and D8760). It would be less effort overall, but will delay your work a bit.

eg. here it could be done by making payload a dictionary (with algos as keys and digests as values), and serialize the manifest as:

[payload_type $StrWithoutSpaces]
[payload blake2s256 $Hexdigest]
[payload sha1 $Hexdigest]
[payload sha1_git $Hexdigest]
[payload sha256 $Hexdigest]

@samplet @olasd What do you think?

swh/model/git_objects.py
665

Hi, thanks for the diffs

It looks like using the --draft flag when sending this with arc has confused the CI system. The build failed because it couldn't access the updated code:

fatal: Couldn't find remote ref refs/tags/phabricator/diff/31582

It was because you hadn't signed the CLA when submitting the diff

Nope, that's because the commit hasn't been pushed to the staging repo. I assume either you used --skip-staging when running arc diff, or the push failed for some reason.

You need to go through these instructions: https://docs.softwareheritage.org/devel/contributing/phabricator.html#enabling-git-push-to-our-forge to ensure that you're able to push to https://forge.softwareheritage.org/source/staging.git

Anyway, I'm guessing it will fix itself when you send any update to the diff so let's ignore that issue for now.

This looks fine to me, though we might want to use something other than SHA1 for this.

We are currently in the process of moving the objstorage to multi-hashes, so eventually we will want to use multi-hashes here too. I wonder if we should do that before landing this diff (and D8760). It would be less effort overall, but will delay your work a bit.

eg. here it could be done by making payload a dictionary (with algos as keys and digests as values), and serialize the manifest as:

[payload_type $StrWithoutSpaces]
[payload blake2s256 $Hexdigest]
[payload sha1 $Hexdigest]
[payload sha1_git $Hexdigest]
[payload sha256 $Hexdigest]

@samplet @olasd What do you think?

So, in essence, the payload would be an outbound edge from the extid node to a content node (and only a content node?), which needs to be recorded in the SWH archive separately, correct?

Rather than bodge together some multi-hash edge definition that would only be used for extid manifests (and which may or may not be what we end up retaining for SWHID v2), I think it would make more sense to use a definition consistent with the existing edges in our data model, that is, use the sha1_git of the content to refer to it (like directory entries or snapshot branches would, today). We can make the schema of extids (and the hash used to identify edges) evolve concurrently with the schema for the rest of the SWH object identifiers once we have a consistent viewpoint on how to do that.

Hi, thanks for the diffs

My pleasure! Sorry for fumbling with Phabricator so much. 😳 I've set up pushing to the forge now, so hopefully things will work after an update (as suggested).

In D8759#227867, @olasd wrote:

[...] I think it would make more sense to use a definition consistent with the existing edges in our data model, that is, use the sha1_git of the content to refer to it (like directory entries or snapshot branches would, today).

This makes sense to me, too. I had it this way, but was tempted away by the ease of using storage.content_get_data with the plain SHA-1. If we store the Git SHA-1, my understanding is that I will have to use storage.content_get to get the metadata and proceed as before with the SHA-1. (In light of your comment, which I agree with, it's weird that storage.content_get_data uses a plain SHA-1 in the first place.)

Use a Git SHA-1 for the payload.

This update switches from using a bare SHA-1 to a Git SHA-1
for the payload.

It also includes minor corrections suggested by @vlorentz.

This is a duplicate of the last update. I'm hoping that Arcanist
is now happy with my Git remote settings and will be able to push
the changes to the staging repo to be picked up by CI.

Sorry for the noise.

Build is green

Patch application report for D8759 (id=31659)

Rebasing onto fe8d55588a...

Current branch diff-target is up to date.
Changes applied before test
commit 49475e98028f2b6095fe5972cc51cc88919dfe3a
Author: Timothy Sample <samplet@ngyro.com>
Date:   Tue Sep 27 15:29:23 2022 -0400

    model: Add payload to ExtID class

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/544/ for more details.

olasd added inline comments.
docs/data-model.rst
281–286

Maybe worth adding a practical note that the extid payload actually refers to a content node in the archive, rather than being stored inline.

This revision is now accepted and ready to land.Nov 4 2022, 3:55 PM

Maybe worth @douardda or @marmoute having a look at this, as I think there were more usecases than just disarchive for this?

This comment was removed by vlorentz.