Page MenuHomeSoftware Heritage

Use mercurial tags as named pointer (referenced in the snapshot)
AbandonedPublic

Authored by ardumont on Sep 11 2018, 6:24 PM.

Details

Summary

As exchanged in irc discussion:

12:04 <+olasd> ardumont: so, just to be clear, what do tags look like in the mercurial loader now?
12:17 <+ardumont> as before, they just now target the right associated revision (and no longer the wrong mercurial ones)
12:17 <+ardumont> (https://forge.softwareheritage.org/D395 is the fix)
13:07 <+olasd> what's the point of having release objects with no message, no author, no date? this information could live entirely in the snapshot
13:08 <+olasd> (same as git "lightweight" tags)
13:11 <+ardumont> i think that was developed prior to the snapshot
13:12 <+olasd> snapshots have been in the data model forever (they were called occurrences before but they were the same thing)
13:13 <+olasd> if the object is, functionally, a named pointer for a revision id, there's no need for a full fat release object
13:17 <+olasd> (this conversation is orthogonal to the removal of the "bogus" objects)
13:20 <+ardumont> yes, right (about occurrences)
13:20 <+ardumont> for the choice then, "the point" was probably done to stick to the mercurial model, 1 tag -> 1 release
13:21 <+ardumont> which as you point out is probably not that good
13:23 <+olasd> well, we've been dithering on whether to synthesize release objects for most loaders now, but I think we're pretty much consistently avoiding them in favor of putting the information in the snapshot
13:24 <+ardumont> indeed
13:24 <+olasd> unless there really is more to it than having a named pointer to a given revision

Related T1189

Test Plan

Manual (unfortunately cf. T954)

  • No data in swh-dev instance
  • Charged 'distutils' origin (with tags)
  • Checked that no longer any releases are created.
  • Checked that the 'tags' are referenced as branch in the snapshot and that they still (as in last prior fix D395) target the right revisions.
$ python3
>>> project = 'distutils2' 
>>> # remote repository
... origin_url = 'https://www.mercurial-scm.org/repo/%s' % project
>>> # local clone
... directory = '/home/storage/hg/repo/%s' % project
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.mercurial.tasks import LoadMercurialTsk
>>>
>>> t = LoadMercurialTsk()
>>> t.run(origin_url=origin_url, directory=directory, visit_date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadMercurialTsk:Creating hg origin for https://www.mercurial-scm.org/repo/distutils2
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done creating hg origin for https://www.mercurial-scm.org/repo/distutils2
DEBUG:swh.scheduler.task.LoadMercurialTsk:Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadMercurialTsk:Bundling at /home/storage/hg/repo/distutils2/HG20_none_bundle
INFO:swh.scheduler.task.LoadMercurialTsk:5847fb2b7549fb301882c1054eb8b3d0893e3570: b'1.0a1' -> 2e8c4a95ae8d6855d083a4bf3888c0c2bd1ab7d7
INFO:swh.scheduler.task.LoadMercurialTsk:61d1f457a279398c785f8ab729f30f526b6edd58: b'1.0a2' -> a6c182120215be5357db6d538dceee57f8f76d58
INFO:swh.scheduler.task.LoadMercurialTsk:e15c0aa57f52215fe8f184427b0207e6acfb65d6: b'1.0a3' -> c4dd5b9920c7bfefdf857f656c4f64939caf6fe4
INFO:swh.scheduler.task.LoadMercurialTsk:e15c0aa57f52215fe8f184427b0207e6acfb65d6: b'1.0a3' -> c4dd5b9920c7bfefdf857f656c4f64939caf6fe4
INFO:swh.scheduler.task.LoadMercurialTsk:7c8e61aa51f4748286964bc1405bd4169c270f46: b'1.0a3' -> d06ecb3a53cd93e3a9b90861a355e3f1f86b4e75
INFO:swh.scheduler.task.LoadMercurialTsk:7c8e61aa51f4748286964bc1405bd4169c270f46: b'1.0a3' -> d06ecb3a53cd93e3a9b90861a355e3f1f86b4e75
INFO:swh.scheduler.task.LoadMercurialTsk:d930ae6caab58bec92683235aa88d06bbc07ae36: b'1.0a3' -> b893f4bee89bb5dbc0cd8b1821fd37008b377a9c
INFO:swh.scheduler.task.LoadMercurialTsk:1e4d52d83e95c14e3f0bd2179a81ac1023ef32e9: b'1.0a4' -> ca984898a5f9218d0ab5496748002fdc1ec4c260
DEBUG:swh.scheduler.task.LoadMercurialTsk:Updating origin_visit for origin 1 with status full
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done updating origin_visit for origin 1 with status full
DEBUG:swh.scheduler.task.LoadMercurialTsk:Cleanup up working bundle /home/storage/hg/repo/distutils2/HG20_none_bundle
{'status': 'eventful'}
>>>

log format: "node id -> rev-target-id-as-hex" and check that id in db.

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1381
Build 1725: arc lint + arc unit

Event Timeline

ardumont created this revision.Sep 11 2018, 6:24 PM
ardumont edited the test plan for this revision. (Show Details)Sep 11 2018, 6:29 PM
ardumont retitled this revision from Summary: Use mercurial tags as named pointer (referenced in the snapshot) to Use mercurial tags as named pointer (referenced in the snapshot).Sep 12 2018, 9:54 AM
ardumont edited the summary of this revision. (Show Details)Sep 12 2018, 10:02 AM
zack requested changes to this revision.Sep 12 2018, 1:12 PM
zack added a subscriber: zack.

So, as discussed today f2f, I don't like getting rid of release objects. For at least a couple of reasons.

  1. Even if they have no metadata associated (which is indeed a pity), any release object stores 1 bit of information which is quite important: a developer (even if we don't know who) has decided that the particular revision pointed to by the release object was a noteworthy point in the project development history. (That is a major drawback of SVN, for instance, where this information is available only via heuristics based on tag names, but at least there it is not our fault. Throwing mercurial release objects away would be our fault.)
  2. As a general principle we should do all we can to ensure round-tripping: it should be possible to reconstruct a semantically equivalent hg repo starting from what we have ingested into the archive. If we ditch release objects, that is no longer the case.

So it's nack for me on this proposal.

This revision now requires changes to proceed.Sep 12 2018, 1:12 PM
ardumont abandoned this revision.Sep 12 2018, 2:04 PM
In D409#7845, @zack wrote:

So, as discussed today f2f, I don't like getting rid of release objects. For at least a couple of reasons.

  1. Even if they have no metadata associated (which is indeed a pity), any release object stores 1 bit of information which is quite important: a developer (even if we don't know who) has decided that the particular revision pointed to by the release object was a noteworthy point in the project development history. (That is a major drawback of SVN, for instance, where this information is available only via heuristics based on tag names, but at least there it is not our fault. Throwing mercurial release objects away would be our fault.)
  2. As a general principle we should do all we can to ensure round-tripping: it should be possible to reconstruct a semantically equivalent hg repo starting from what we have ingested into the archive. If we ditch release objects, that is no longer the case.

So it's nack for me on this proposal.

Yes.

The type 'release' in itself holds metadata information.
And f2f, you got me at 'if you are trying to rebuild the mercurial repository, then you cannot rebuild the tags'.

Thanks for the heads up.

Cheers,

olasd added a comment.Sep 12 2018, 5:34 PM
In D409#7845, @zack wrote:

So, as discussed today f2f, I don't like getting rid of release objects. For at least a couple of reasons.

  1. Even if they have no metadata associated (which is indeed a pity), any release object stores 1 bit of information which is quite important: a developer (even if we don't know who) has decided that the particular revision pointed to by the release object was a noteworthy point in the project development history. (That is a major drawback of SVN, for instance, where this information is available only via heuristics based on tag names, but at least there it is not our fault. Throwing mercurial release objects away would be our fault.)
  2. As a general principle we should do all we can to ensure round-tripping: it should be possible to reconstruct a semantically equivalent hg repo starting from what we have ingested into the archive. If we ditch release objects, that is no longer the case.

So it's nack for me on this proposal.

The information about what tags have been stored is not thrown away, it is stored as pointers in the snapshot object, the exact same way git (lightweight) tags are currently stored as pointers in the snapshot object when we load a git repository.

I don't see at all how that prevents round-tripping. If anything it sounds more like a namespacing issue in the snapshot object.

Taking your reasoning to the extreme, should we change the git loader to synthesize empty release objects that carry no information for git lightweight tags as well?

The information about what tags have been stored is not thrown away, it is stored as pointers in the snapshot object,

Well, in that patch, it is. It's typed 'revision'.
That is the issue, i don't have much else to choose from (see below).

I don't see at all how that prevents round-tripping. If anything it sounds more like a namespacing issue in the snapshot object.

Right.

  1. If we choose 'release', it won't be solvable, as no release exist in this patch (no no).
  1. If we choose 'revision' (this patch), it's solvable. But for the repository creation (through vault), we won't be able to distinguish between a directly targeted revision and an indirection through tag.

So we end up with something different than the original repository ~> no /.hgtags will be created

  1. If we choose a new one (new snapshot's target_type enum to be defined), that could solve that issue indeed (and a minor edit to this patch).

...the exact same way git (lightweight) tags are currently stored as pointers in the snapshot object when we load a git repository.
Taking your reasoning to the extreme, should we change the git loader to synthesize empty release objects that carry no information for git lightweight tags as well?

For solution 2, it seems that way.
For solution 3., that would mean changing the target_type to that new elected one.

olasd added a comment.EditedSep 12 2018, 7:29 PM
In D409#7851, @ardumont wrote:

The information about what tags have been stored is not thrown away, it is stored as pointers in the snapshot object,

Well, in that patch, it is. It's typed 'revision'.

The target is a revision, why would the target_type field be anything else than revision?

I'll explain my namespace remark further as I think I wasn't clear.

Git snapshots use the following convention for "branch" (really, pointer) names:

  • a pointer named HEAD points to the default branch. In the general case, the git remote protocol doesn't let symbolic references through, so the pointer is an explicit revision. This could become an alias to an explicit branch name if the git remote protocol evolves.
  • pointers named refs/heads/<something> represent the branch named <something>. The target is a revision, which is the head of the given branch.
  • pointers named refs/tags/<something> represent the tag named <something>. The target can be any object (as git allows lightweight tags to point at any object type)
  • there's other arbitrary pointers named refs/<something>, which can have different meanings depending on the tooling used by the developers
    • git notes are stored within the refs/notes/<xxx> namespace
    • in the GitHub world, refs/pull/<number>/head represent the branch submitted as pull request <number>
    • in the GitLab world that's refs/merge-requests/<number>/head
    • in the Gerrit world, refs/changes/<id mod 100>/<id>/<ver> represents version <ver> of the review changeset with id <id>

I believe it's up to us to define a sensible Software Heritage-specific snapshot pointer name namespace convention for Mercurial repositories. Reusing an existing convention, such as the one git-cinnabar uses, could be a way forward.

(all in all reading the cinnabar docs about mercurial tags - the ones stored in .hgtags - makes me wonder if we really want to unpack this information in the snapshots, and if so, how do we decide from which branch we pull the .hgtags file from?)

In D409#7851, @ardumont wrote:

The information about what tags have been stored is not thrown away, it is stored as pointers in the snapshot object,

Well, in that patch, it is. It's typed 'revision'.

The target is a revision, why would the target_type field be anything else than revision?

Well, yes it's a revision but that's does not convey exactly the same meaning here. We would not discuss this otherwise.
It's an 'indirect' revision for lack of a better description.

And to clarify, in my mind, what i proposed remained a revision but typed differently (without the need to tamper with the branch's name to determine that).

Also, I can see how that typed revision is impactful on our current implementations.
So that might not be such a simple thing to change (if that were considered).

I'll explain my namespace remark further as I think I wasn't clear.

It was but i interpreted it differently than you intended. Thanks anyway.

And saw another possible implementation than the existing one (using branch name conventions).
Which, i believe makes sense. Otherwise, why would the upstream technology uses different terminologies... (annotated tags, lightweight tags, tags...).

Git snapshots use the following convention for "branch" (really, pointer) names:

  • a pointer named HEAD points to the default branch. In the general case, the git remote protocol doesn't let symbolic references through, so the pointer is an explicit revision. This could become an alias to an explicit branch name if the git remote protocol evolves.
  • pointers named refs/heads/<something> represent the branch named <something>. The target is a revision, which is the head of the given branch.
  • pointers named refs/tags/<something> represent the tag named <something>. The target can be any object (as git allows lightweight tags to point at any object type)
  • there's other arbitrary pointers named refs/<something>, which can have different meanings depending on the tooling used by the developers
    • git notes are stored within the refs/notes/<xxx> namespace
    • in the GitHub world, refs/pull/<number>/head represent the branch submitted as pull request <number>
    • in the GitLab world that's refs/merge-requests/<number>/head
    • in the Gerrit world, refs/changes/<id mod 100>/<id>/<ver> represents version <ver> of the review changeset with id <id>

I was only aware of some, so thanks for providing further detail.
So there exists already conventions (branch name), it's just not enforced by a specific 'type' differentiating from other revisions.

I believe it's up to us to define a sensible Software Heritage-specific snapshot pointer name namespace convention for Mercurial repositories. Reusing an existing convention, such as the one git-cinnabar uses, could be a way forward.

Thanks for the link.
As it's the current way of doing it (without impact on existing implementations), that sounds about right then.

If we go with this, I would choose something explicit like 'refs/tags/%s' or even 'tags/%s'.
Because as @anlambert pointed out to me, the user needs something explicit to understand it's a "tag" (in regards to browing experience).

(all in all reading the cinnabar docs about mercurial tags - the ones stored in .hgtags - makes me wonder if we really want to unpack this information in the snapshots, and if so, how do we decide from which branch we pull the .hgtags file from?)

That's a good question.

We could avoid making that choice and not unpack the information.
As far as i understood the loader (current state), there is no choice made either (regarding branch).
If a .hgtags is detected (in the contents), it's taken into account.

If we were to remove that .hgtags parsing step, we could still recreate the repository (since it's part of the contents).
So even not considering it, we do not lose the information.

(We lose a little for the browsing experience though as we will no longer see the tags there...)

Cheers,

zack added a comment.Sep 23 2018, 4:48 PM

I'm not sure why the status of this diff is abandoned, but I'm picking up the discussion on whether the mercurial loader should create release objects because it started here.
Since we last exchanged here, we also exchanged f2f, so rather than picking up all the sub-thread I'm focusing on the main points.

Regarding reproducible of the original state, I'm sold: I've no trouble believing that, using suitable branch names in the snapshot object we can make sure to recreate the original state. So this is indeed a no issue.
What remains is the second point: semantic adherence to the Software Heritage data model.

As we have described in several places now, release objects in the SWH DAG are "revisions [that] get selected by developers as denoting important project milestones". This means that the goal we should strive for is that, every time a developer uses their tool at hand to "make a release", we will have a corresponding tag object in the archive. Looking back, we have already failed this goal in several places:

  1. the non-VCS loaders (e.g., debian, tarball, etc.) that shortcut release objects and only create revision objects
  2. the git loader for non-annotated tags (i.e., plain "git tag" v. "git tag -a/-s")

For (1), the reasons we used were two (IIRC): a) release objects do not support arbitrary metadata while revision objects do; plus b) we considered the chain release → revision → directory "too long", and we shortened it to revision → directory. For (2) we didn't want to discriminate among various refs, treating refs/tags/ differently than others (I *guess*).

In hindsight, I think these were the wrong decisions. We should have instead:

  • added support for arbitrary metadata to release objects
  • use "release → directory" chains for tarballs & friends rather than "revision → directory"
  • discriminate Git refs by explicitly creating synthetic=true release objects for "git tag" tags

Doing so would have been more consistent with what we claim our data model captures.

Now, creating proper release objects for mercurial tags is consistent with the view above. At the same time is inconsistent with the (wrong, if you agree with the above) historical choices we have made elsewhere.

I'm tempted to starting get things right(er) in this respect and also try to fix what we have done suboptimally in the past. (Will need a separate task with the above rationale and a plan of action.)