Page MenuHomeSoftware Heritage

loader: Read snapshot out of the last origin visit status
ClosedPublic

Authored by ardumont on Jun 17 2020, 4:26 PM.

Details

Summary

Preparatory work to drop unused fields from OriginVisit

Related to T2310

Test Plan

tox

Diff Detail

Repository
rDLDHG Mercurial loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D3306 (id=11710)

Rebasing onto f1866671a4...

Current branch diff-target is up to date.
Changes applied before test
commit 63957ffc0cba296c37bc5f835a0d011ffa2e65e2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 17 16:22:12 2020 +0200

    loader: Read snapshot out of the last origin visit status
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/9/ for more details.

Forgot to update test accordingly

Build is green

Patch application report for D3306 (id=11711)

Rebasing onto f1866671a4...

Current branch diff-target is up to date.
Changes applied before test
commit 151831ca932d85a2a75b93951f4245e3027f5c14
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 17 16:22:12 2020 +0200

    loader: Read snapshot out of the last origin visit status
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/10/ for more details.

Simplify according to latest storage algos.snapshot.snapshot-get-latest
endpoint

Build is green

Patch application report for D3306 (id=11806)

Rebasing onto cba09ab734...

Current branch diff-target is up to date.
Changes applied before test
commit 7837917c8ab71aefa1ad3da816defb900668f319
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jun 22 16:02:57 2020 +0200

    loader: Read snapshot out of the last origin visit status
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/11/ for more details.

swh/loader/mercurial/loader.py
175–180

I could require branches_count=1 as well as we don't do much with that snapshot.
Currently that call could end up being heavy if the snapshot is huge (and in the end we only use it to have its id...

Make it apparent we only use the snapshot for its id
(Also reduce the snapshot object model to the minimum)

Build is green

Patch application report for D3306 (id=11808)

Rebasing onto cba09ab734...

Current branch diff-target is up to date.
Changes applied before test
commit 33fe79eb34c074a760fae5a67f2130e1d47ea7c9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jun 22 16:02:57 2020 +0200

    loader: Retrieve last snapshot with snapshot_get_latest function
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/12/ for more details.

swh/loader/mercurial/loader.py
175–180

well, i did ;)
shout out if you don't agree.

douardda added a subscriber: douardda.

Not sure about implications of the branches_count=1 argument, but it may be related to my other comment. Otherwise, LGTM.

swh/loader/mercurial/loader.py
609

Is there a corner case situation here where snapshots[-1] is a partial snapshot (e.g. because of a failure in the middle of the previous loading) but snapshots[-2] is ok, and current snapshot is actually the same this later?

In other words, should we check for existence of a previous snapshot equal to the current one, and not only the last one?

This revision is now accepted and ready to land.Jun 23 2020, 9:42 AM
swh/loader/mercurial/loader.py
609

I guess that could happen yes.
Note that i'm just porting the code here ;)

We could adapt to iterate over all the snapshots we know for the origin.
(And i guess that could also be a common behavior for all loaders.)

I don't really know though:

  • if that situation happens enough
  • how complicate the code will get
  • if the worse situation case (no snapshot found over a lot of visits) how the performance will get degraded

All in all, i found the current code simple enough and to the point for most cases.
But as usual, i may be wrong ;)

Not sure about implications of the branches_count=1 argument, but it may be related to my other comment. Otherwise, LGTM.

As we only want the snapshot id here, there is no point in creating the full snapshot here (which in some edge cases could be large).
That's what the branches_count=1 is all about.

loader: Read snapshot out of the last origin visit status

Build is green

Patch application report for D3306 (id=11827)

Rebasing onto cba09ab734...

Current branch diff-target is up to date.
Changes applied before test
commit d210f73957a52888ba6e17e118fa4e4485e3536b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jun 22 16:02:57 2020 +0200

    loader: Read snapshot out of the last origin visit status
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/13/ for more details.