Page MenuHomeSoftware Heritage

Use extids to filter out already seen revisions across hg origins
ClosedPublic

Authored by ardumont on Sep 10 2021, 5:27 PM.

Details

Summary

For a given origin, if a previous snapshot exists, this keeps the behavior of filtering
out already seen revisions from that snapshot. Then it goes on and tries to filter the
remaining revisions through the extid mappings for other mercurial origins.

Note: This makes me think back that we need to update the storage interface so the
filtering on the extid_version can happen server side. So as to avoid filtering in the
client code. This is somewhat an independent work from this which got started in a
dedicated diff [1].

[1] D6249

Test Plan

tox

Diff Detail

Repository
rDLDHG Mercurial loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23581
Build 36788: Phabricator diff pipeline on jenkinsJenkins console ยท Jenkins
Build 36787: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Explicit that the current behavior does not compute another snapshot when nothing
changes (thus everything gets filtered out).

To show the new behavior is consistent with it.

Marked as FIXME as i'm not sure we want such behavior though.

Build is green

Patch application report for D6240 (id=22644)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 4e018d168fd6e11bc9fa60dfde47f3b7edac9660
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Use extids to filter out already seen revisions across hg origins
    
    Note that no longer uses the snapshot to actually reuse the extid filtering.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/283/ for more details.

Add back dropped test (stash mix up)

Build is green

Patch application report for D6240 (id=22645)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit e05d3184d0502a3e3bcfe9789e43834944b2cac9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Use extids to filter out already seen revisions across hg origins
    
    Note that no longer uses the snapshot to actually reuse the extid filtering.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/284/ for more details.

swh/loader/mercurial/from_disk.py
370โ€“371

@olasd Heads up, i've kept the old behavior as i currently don't see how to do what you
suggested ("filtering after the history filter"). I gather it's because I don't really
see how to merge consistently the filtered (HgFilteredSet/hgutil.smartset) results .

Maybe the current implementation is enough already. It uses best of both world. If there
is snapshot, then use it to filter out already seen revisions (that will work on smaller
subset, still when using the exitd mapping table, that will nonetheless uses other
fork's potential existing mappings).

Otherwise, let's fallback to filter over all extids for the default case.

What do you think of ^?

I did not yet work on grouping the extid reading calls though (@vlorentz's suggestion),
is that necessary?

ardumont edited the summary of this revision. (Show Details)

Possibly forgotten comment changes

Build is green

Patch application report for D6240 (id=22652)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 80991602c2d0034a976ff389965be41c1ae310df
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Use extids to filter out already seen revisions across hg origins
    
    Note that no longer uses the snapshot to actually reuse the extid filtering.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/285/ for more details.

Actually filter through snapshot then on the remaining data set, filter out through
extids mapping.

Tests seem still happy

swh/loader/mercurial/from_disk.py
370โ€“371

I've tried to make it 1. then 2. instead of 1. or 2.

  1. being using the snapshot.
  2. being using the global extid mappings

Build is green

Patch application report for D6240 (id=22656)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 3976e6fb7031010d2d25b845c0943207aac9bd8e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Use extids to filter out already seen revisions across hg origins
    
    Note that no longer uses the snapshot to actually reuse the extid filtering.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/286/ for more details.

Adapt docstring and rework commit message.

Build is green

Patch application report for D6240 (id=22658)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 7a31286e4971d51bb965826963cb3ce6ffc13aa2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/287/ for more details.

Drop no longer needed conditional

Build is green

Patch application report for D6240 (id=22659)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 63904aa241861db0dd0f0aaaaa3454772fd7cbcd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/288/ for more details.

swh/loader/mercurial/tests/test_from_disk.py
739

@olasd @vlorentz Note this though ^ and the comment below.

swh/loader/mercurial/tests/test_from_disk.py
739

A full visit shouldn't be able to have snapshot = None, so, yeah, there's something fishy going on.

swh/loader/mercurial/tests/test_from_disk.py
739

It's the current behavior of the implementation though.

That's one of the reason i added some checks in a previous test
about uneventful visit when you ingest twice a repository without change (line 189).

I'll double check that adaptations in the current master (line 189 onwards in the master tomorrow).
If we need to change this, i gather some breaking tests change though (in another diff though).

ardumont added inline comments.
swh/loader/mercurial/tests/test_from_disk.py
739

What i meant about current behavior is that when a revision is currently detected as seen.
It's filtered out and does not make it into a snapshot.
So in effect, a new visit of the same origin near enough (that is the origin did not change in between visits), will end up into 1 ingestion with 1 visit full with a snapshot the first time, then for the other visit, this will end up with no snapshot.

While it can be "somewhat" ok for the same origin, it becomes a "nono" for a fork since, for a given order, we won't be able to make a snapshot for forks.

The thing is, if we adapt the behavior to actually have a snapshot with already seen revisions (which would be closer to the truth), that will change the behavior for both scenario (same origin and forks). To do this is something that would be clearer to me and more to the point.
But that's more work [1] [2] (in another diff to make things clearer ;)

What do you think?

[1] in another diff to make things clearer

[2] That begs the question of "do we keep the old implementations? (because *sigh* at the idea of adapting this code as well...)

ardumont added inline comments.
swh/loader/mercurial/tests/test_from_disk.py
739

See D6262 to explicit the wrong behavior.
And T3571 to actually track and fix it.

I'm unsure about in which order to do this though.
I think landing this and then adapt correctly the snapshot handling after this makes sense.

swh/loader/mercurial/tests/test_from_disk.py
739

I think landing this and then adapt correctly the snapshot handling after this makes sense.

The reverse order may be easier: fixing the snapshot logic in the "simpler" case when extids are only involved for existing heads, then introducing the extid matching logic for arbitrary revisions.

Either way, it doesn't matter much, as I assume both will get deployed at the same time.

Build is green

Patch application report for D6240 (id=22674)

Could not rebase; Attempt merge onto 9be124af21...

Updating 9be124a..5fe2a9b
Fast-forward
 swh/loader/mercurial/from_disk.py            | 95 ++++++++++++++++++++--------
 swh/loader/mercurial/tests/test_from_disk.py | 19 +++++-
 2 files changed, 87 insertions(+), 27 deletions(-)
Changes applied before test
commit 5fe2a9bf77d6667b55a912b58c0f3350390d2cda
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

commit 139061bcc38bd5652a266b5eb0645a1492e7a5aa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 15 10:24:48 2021 +0200

    test: Explicit that 2 visits without change ends up with no snapshot

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/291/ for more details.

swh/loader/mercurial/tests/test_from_disk.py
739

Either way, it doesn't matter much, as I assume both will get deployed at the same time.

yeah, i'm in that mind logic as well.

Add back dropped test by mistake (rebase messed up probably)

Use correct range of commits

Build is green

Patch application report for D6240 (id=22695)

Rebasing onto 9be124af21...

Current branch diff-target is up to date.
Changes applied before test
commit 139061bcc38bd5652a266b5eb0645a1492e7a5aa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 15 10:24:48 2021 +0200

    test: Explicit that 2 visits without change ends up with no snapshot

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/293/ for more details.

Build is green

Patch application report for D6240 (id=22696)

Could not rebase; Attempt merge onto 9be124af21...

Updating 9be124a..3231f2e
Fast-forward
 swh/loader/mercurial/from_disk.py            | 95 ++++++++++++++++++++--------
 swh/loader/mercurial/tests/test_from_disk.py | 77 +++++++++++++++++++++-
 2 files changed, 145 insertions(+), 27 deletions(-)
Changes applied before test
commit 3231f2ed19962275d567f7bf39127a2d0621697d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

commit 139061bcc38bd5652a266b5eb0645a1492e7a5aa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 15 10:24:48 2021 +0200

    test: Explicit that 2 visits without change ends up with no snapshot

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/294/ for more details.

Build is green

Patch application report for D6240 (id=22715)

Could not rebase; Attempt merge onto 9be124af21...

Updating 9be124a..1c6c7e5
Fast-forward
 swh/loader/mercurial/from_disk.py            | 95 ++++++++++++++++++++--------
 swh/loader/mercurial/tests/test_from_disk.py | 75 +++++++++++++++++++++-
 2 files changed, 143 insertions(+), 27 deletions(-)
Changes applied before test
commit 1c6c7e5be82df5f2419e9727d6b5f0e74e9494d9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

commit bf39050051ec5c7b4c6cc7b4801079ea591cd549
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Sep 15 10:24:48 2021 +0200

    test: Explicit that 2 visits without change ends up with no snapshot

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/299/ for more details.

Group into chunks prior to discuss with storage

Build is green

Patch application report for D6240 (id=22722)

Rebasing onto bf39050051...

Current branch diff-target is up to date.
Changes applied before test
commit 4c9eb38ec516de3f65c3996d937cd3328ded6261
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/301/ for more details.

swh/loader/mercurial/from_disk.py
301

It's changed that way (from a comprehension to a for loop) because of the next step [1] (and i got lazy after the gazillionth rebase).

[1] D6268

LGTM

swh/loader/mercurial/from_disk.py
362

why did you remove the return type?

375

seen_revs will typically be very large; does Mercurial handle this query nicely?

377โ€“380
swh/loader/mercurial/from_disk.py
362

a mistake ;)
it gets removed, well transformed, in the next though.

375

Well, i don't know but it was already that way before.
It just now does some more substraction now.

How can I check this?

I'm planning (to land) and check that on staging first on overall large repositories.

to land or test it in a venv, i'm not sure yet.

swh/loader/mercurial/from_disk.py
362

Nope, it's transformed here.
I'll add it back nonetheless, thx.

swh/loader/mercurial/from_disk.py
377โ€“380

i'm not a huge fan of this writing but ok.

swh/loader/mercurial/from_disk.py
377โ€“380

Note, that's not it.
I'm not returning a list, i'm yielding stuff.

i'd prefer yield from self._new... then.

Adapt according to last comments

Build is green

Patch application report for D6240 (id=22744)

Rebasing onto bf39050051...

Current branch diff-target is up to date.
Changes applied before test
commit 765cb5aefa8bb370bc1343d990ad328188a4eba5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 10 17:00:15 2021 +0200

    Add support to filter out already seen revisions across hg origins
    
    This uses the extid mapping from the storage to actually achieve this.
    
    Related to T3563

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/304/ for more details.

swh/loader/mercurial/from_disk.py
375

I don't know; load a large repo

swh/loader/mercurial/from_disk.py
375

yes, but i thought you add another idea ;)
The venv is giving me a hard time though [1] so i'm of a mind to land and check the result.

[1]
workers are running buster.
the install requires (pip install -e .) some confluent-kafka with a version 1.6 more recent than what's proposed on debian (1.5).
Hence not working...
And it seems i'm not able to say, drop that deps.

swh/loader/mercurial/from_disk.py
375

ah that's a new deps grown out of the new deps from storage on swh.counter in the 0.37 storage.
i've successfully installed my venv installing the previous version of storage.

swh/loader/mercurial/from_disk.py
375

pip hell unstuck, fixed.

I've made a pass on the origin https://foss.heptapod.net/mercurial/tortoisehg/thg which qualifies
It's faster and consistent se P1167 ;)

This revision was not accepted when it landed; it landed in state Needs Review.Sep 17 2021, 10:05 AM
This revision was automatically updated to reflect the committed changes.