Page MenuHomeSoftware Heritage

swh-loader-mercurial: Fix invalid release target and add missing data
ClosedPublic

Authored by anlambert on Jul 19 2018, 3:59 PM.

Details

Summary

When processing hg tags and transforming them to swh releases, the target revisions
were not correctly computed. Instead of pointing to a swh revision, the original hg
tag changeset id was used.

That diff adds a global mapping between hg changesets and swh revisions in order
to compute the correct revision target identifier when processing a mercurial tag.
Missing release data (message, author, date) are also added from the associated
target revision.

Related T1155

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
fix-release-target-and-data
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1355
Build 1699: arc lint + arc unit

Event Timeline

ardumont added inline comments.
swh/loader/mercurial/bundle20_loader.py
445

Yes, that's the T1155 fix, targetting the swh revision, thanks.

451

But I'm not sure we want the message, author, and date since that does not exist in mercurial.
From the .hgtags file, we have solely the name and the targetted changeset.

From [1], a tag is a symbolic identifier for a changeset so indeed, does not contain much.

Here, we derived those information from the revision. We kinda force the symbolic resolution.
(Also we fetch the revision from the storage)...

[1] https://www.mercurial-scm.org/wiki/Tag
(careful now, that page says that some information may be no longer true, without specifying what is what...)

swh/loader/mercurial/bundle20_loader.py
451

My point here is to stick to the swh data model (not the mercurial one) and add some relevant info regarding a release (date is quite important for instance).

I agree that this is pure data duplication but for those who query the /release api endpoint
or browse a release page, it may be relevant to add those info.

For instance, the page https://archive.softwareheritage.org/browse/release/ac77c274b0e8060d349a1a7e8a5fa7fae12f4ca3/?origin=https://273280703-shooter-player.googlecode.com/hg/ looks sad without them.

ardumont added inline comments.
swh/loader/mercurial/bundle20_loader.py
451

My point here is to stick to the swh data model (not the mercurial one)

But, aren't we supposed to adhere to both? For me, we do.

and add some relevant info regarding a release (date is quite important for instance).

My take is that it's not for us to decide, we must state facts about the mercurial repository.
The fact for me is that there exists a tag without anything except a name and a target.
That translates in swh as a release with a name and a revision target, nothing more.

I agree that this is pure data duplication but for those who query the /release api endpoint

or browse a release page, it may be relevant to add those info.

Yes, but they already can by clicking one more link, the target one you fixed (well, they will at some point in the future, when we merge, deploy, clean up, etc...).

Also, i fear that modifying this for the archive's benefit is conflating multiple things.
Archive's concern should be in webapp, not pushed on the loader.

@zack any thoughts on this?


In any case, note that discussing with you some more, it appears the mercurial web interfaces actually shows the changeset when showing the release. So they solve the indirection in their ui.

That lessen my point a little, but not much.
The indirection solving could be implemented in the archive side (for origin type 'hg') where i think it belongs.

swh/loader/mercurial/bundle20_loader.py
451

(thanks for highlighting me, I wasn't following this)

The guiding principle for importing non-synthetic objects has indeed always been to represent as faithfully as possible the information available in the original source code origin. So I concur that, if hg releases do not have timestamps, we should not copy there the timestamps of the pointed revisions. (And that is supported by our underlying data model, where timestamps for release objects are nullable.) Same for other metadata that are not strictly required by our data model.

I totally understand the visualization constraint, though. But that can be fixed by heuristics, e.g., if the timestamp is null, try to lookup the timestamp of the pointed object and show that instead (maybe with a visual clue that helps users differentiate the two cases). FWIW, I don't think that should be done only for objects coming from specific origins, it's just a general thing. Also, the API should not do the same, and rather not return a timestamp if it's not in the object.

Update diff according to discussion: do not copy targeted revision metadata
(message, author, date) into release ones.

swh/loader/mercurial/bundle20_loader.py
451

Thanks for clarifying the situation, diff has been updated accordingly.

This revision is now accepted and ready to land.Jul 23 2018, 11:35 AM
This revision was automatically updated to reflect the committed changes.