Page MenuHomeSoftware Heritage

Implement discovering branch targets from the archive
ClosedPublic

Authored by olasd on Nov 4 2022, 7:33 PM.

Details

Summary

With the proper implementation of packfile negotiation, remotes can
return packfiles that do not contain all of our wanted objects. Consider
the following two histories:

* c1                                      c1     ← [refs/tags/original]
↑                                         ↑  ↖
* c2 ← [refs/heads/main]                  |   c3 ← [refs/heads/main]
                                          c2     ← [refs/heads/broken]

The first visit of the origin would load commits c1 and c2, and write a
snapshot referencing c2.

During the second visit, the loader would tell the origin that c2 is
known, and that c1 and c3 are wanted (as new heads). The origin, knowing
that c1 is a parent of c2, would be allowed by the git protocol to send
a packfile containing only c3. Under these circumstances, the loader
cannot tell what object type the snapshot branch
[refs/tags/original] should point to.

The repository in tests has a similar structure ([refs/heads/master] is
in the history of [refs/tags/branch2-before-delete]), so refactor the
incremental load test to exercise this specific behavior. This test can
be moved to the common tests as well.

(Bundle two small refactoring commits to make sure the dumb protocol loader has
the same behavior as the smart protocol loader for tests)

Test Plan

improved existing test to trigger the behavior

Diff Detail

Repository
rDLDG Git loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8817 (id=31769)

Rebasing onto 3cf7582aa7...

Current branch diff-target is up to date.
Changes applied before test
commit 96fab5241ad44909120e91a5929a89b0f8fe0741
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:52:31 2022 +0100

    Implement discovering branch targets from the archive
    
    With the proper implementation of packfile negotiation, remotes can
    return packfiles that do not contain all of our wanted objects. Consider
    the following two histories:
    
    * c1                                      c1     ← [refs/tags/original]
    ↑                                         ↑  ↖
    * c2 ← [refs/heads/main]                  |   c3 ← [refs/heads/main]
                                              c2     ← [refs/heads/broken]
    
    The first visit of the origin would load commits c1 and c2, and write a
    snapshot referencing c2.
    
    During the second visit, the loader would tell the origin that c2 is
    known, and that c1 and c3 are wanted (as new heads). The origin, knowing
    that c1 is a parent of c2, would be allowed by the git protocol to send
    a packfile containing only c3. Under these circumstances, the loader
    cannot tell what object type the snapshot branch
    [refs/tags/original] should point to.
    
    The repository in tests has a similar structure ([refs/heads/master] is
    in the history of [refs/tags/branch2-before-delete]), so refactor the
    incremental load test to exercise this specific behavior. This test can
    be moved to the common tests as well.

commit e7988153e2ae8294a2d8ad451423ba587984687c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:51:58 2022 +0100

    dumb loader: also filter the symbolic refs
    
    Even though this is only HEAD, we should make sure that it's filtered anyway.

commit c2ed09e0e86c1df75ea115a6a93f2f5a45427e52
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:50:23 2022 +0100

    Make utils.filter_refs accept {bytes: bytes} and {bytes: HexBytes}
    
    In terms of mypy, this function is just doing some types-washing anyway.

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/238/ for more details.

olasd requested review of this revision.Nov 4 2022, 7:36 PM

More consistent commit message

Build is green

Patch application report for D8817 (id=31770)

Rebasing onto 3cf7582aa7...

Current branch diff-target is up to date.
Changes applied before test
commit 92d9ada9b73939134c363347fde11b5146d25842
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:52:31 2022 +0100

    Implement discovering branch targets from the archive
    
    With the proper implementation of packfile negotiation, remotes can
    return packfiles that do not contain all of our wanted objects. Consider
    the following two histories:
    
    * c1                                      * c1     ← [refs/tags/original]
    ↑                                         ↑ ↖
    * c2 ← [refs/heads/main]                  |   * c3 ← [refs/heads/main]
                                              * c2     ← [refs/heads/broken]
    
    The first visit of the origin would load commits c1 and c2, and write a
    snapshot referencing c2.
    
    During the second visit, the loader would tell the origin that c2 is
    known, and that c1 and c3 are wanted (as new heads). The origin, knowing
    that c1 is a parent of c2, would be allowed by the git protocol to send
    a packfile containing only c3. Under these circumstances, the loader
    cannot tell what object type the snapshot branch
    [refs/tags/original] should point to.
    
    The repository in tests has a similar structure ([refs/heads/master] is
    in the history of [refs/tags/branch2-before-delete]), so refactor the
    incremental load test to exercise this specific behavior. This test can
    be moved to the common tests as well.

commit e7988153e2ae8294a2d8ad451423ba587984687c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:51:58 2022 +0100

    dumb loader: also filter the symbolic refs
    
    Even though this is only HEAD, we should make sure that it's filtered anyway.

commit c2ed09e0e86c1df75ea115a6a93f2f5a45427e52
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Nov 4 18:50:23 2022 +0100

    Make utils.filter_refs accept {bytes: bytes} and {bytes: HexBytes}
    
    In terms of mypy, this function is just doing some types-washing anyway.

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/239/ for more details.

I've tested this on all the origins that triggered https://sentry.softwareheritage.org/share/issue/d04e5d7050cb49c6a080b0552fdee5ef/ (with D7838 applied to test on the real storage) and they load fine now.

neat

swh/loader/git/loader.py
508–517

we can save up to three set-to-list conversions by turning targets_unknown into a list like this

swh/loader/git/tests/test_loader.py
567–569 ↗(On Diff #31770)

what happened to this test?

swh/loader/git/loader.py
508–517

99% of cases will probably find all targets on the first call. and we still need to convert targets_unknown into a set before the subtraction to get known

swh/loader/git/tests/test_loader.py
567–569 ↗(On Diff #31770)

It's been functionally by the other one, which is more accurate (as it does two real loads instead of faking a previous visit).

swh/loader/git/tests/test_loader.py
567–569 ↗(On Diff #31770)

It's been superseded by the other one, which is more accurate (as it does two real loads instead of faking a previous visit).

This revision is now accepted and ready to land.Nov 4 2022, 8:35 PM