Page MenuHomeSoftware Heritage

Eagerly populate the set of local heads in RepoRepresentation.__init__
ClosedPublic

Authored by olasd on Nov 3 2022, 5:22 PM.

Details

Summary

As dulwich's client.fetch_pack expects an instance of history graph
walker with set of known heads, move the local heads caching from
determine_wants to the RepoRepresentation initialization logic.

Our previous code would always initialize the graph walker with an empty
set of heads (as the graph_walker() method is called before
determine_wants() has run, so self.heads was always empty), so we
would never actually fetch an incremental pack file.

Test Plan

Added a new test checking that the incremental loads really only
fetch new commits.

On an incremental load of chromium.git:

before:

Enumerating objects: 18141323, done.
Counting objects: 100% (4755/4755), done.
Compressing objects: 100% (2733/2733), done.
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status

after:

Enumerating objects: 1203181, done.
Counting objects: 100% (407/407), done.
Compressing objects: 100% (233/233), done.
Total 1203181 (delta 270), reused 279 (delta 174), pack-reused 1202774

Diff Detail

Repository
rDLDG Git loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8808 (id=31738)

Could not rebase; Attempt merge onto d6d5ce2b58...

Updating d6d5ce2..42553ee
Fast-forward
 swh/loader/git/loader.py            | 28 ++++++++++++++-------
 swh/loader/git/tests/test_loader.py | 50 +++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 9 deletions(-)
Changes applied before test
commit 42553ee2cea05b71e107de24a84c5a566677c783
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 16:20:58 2022 +0100

    Implement the dulwich graph walker interface directly in RepoRepresentation
    
    As dulwich's client.fetch_pack expects an instance of history graph
    walker, and we don't know the list of heads already available locally
    until `determine_wants` has run, just implement the two
    methods (`__next__` and `ack`) that the graph walker expects in our
    RepoRepresentation class.
    
    Our previous code would always initialize the graph walker with an empty
    set of heads (as the `graph_walker()` method is called before
    `determine_wants()` has run, so `self.heads` is always empty), so we
    would never actually fetch an incremental pack file.

commit 35ecf1843696b3676275920c86bce0c88ca75427
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 15:21:45 2022 +0100

    Dump all known and remote heads when debugging

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/234/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/234/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 3 2022, 5:22 PM
Harbormaster failed remote builds in B32693: Diff 31738!

Build is green

Patch application report for D8808 (id=31739)

Could not rebase; Attempt merge onto d6d5ce2b58...

Updating d6d5ce2..d925d3d
Fast-forward
 swh/loader/git/loader.py            | 28 ++++++++++++------
 swh/loader/git/tests/test_loader.py | 59 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 10 deletions(-)
Changes applied before test
commit d925d3df9ae6c8b1e09c6bc011aad759d52b818c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 16:20:58 2022 +0100

    Implement the dulwich graph walker interface directly in RepoRepresentation
    
    As dulwich's client.fetch_pack expects an instance of history graph
    walker, and we don't know the list of heads already available locally
    until `determine_wants` has run, just implement the two
    methods (`__next__` and `ack`) that the graph walker expects in our
    RepoRepresentation class.
    
    Our previous code would always initialize the graph walker with an empty
    set of heads (as the `graph_walker()` method is called before
    `determine_wants()` has run, so `self.heads` is always empty), so we
    would never actually fetch an incremental pack file.

commit 35ecf1843696b3676275920c86bce0c88ca75427
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 15:21:45 2022 +0100

    Dump all known and remote heads when debugging

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/235/ for more details.

olasd requested review of this revision.Nov 3 2022, 5:30 PM

Jeez, what a subtle bug !

However I found the fix a little bit hackish. How about sticking to the dulwich API and fetching the local heads
in the RepoRepresentation constructor instead ?

The following diff seems to do the trick and all tests still pass.

diff --git a/swh/loader/git/dumb.py b/swh/loader/git/dumb.py
index 35826e9..7197363 100644
--- a/swh/loader/git/dumb.py
+++ b/swh/loader/git/dumb.py
@@ -102,7 +102,7 @@ class GitObjectsFetcher:
                     # commit not already seen in the current load
                     parent not in self.objects[b"commit"]
                     # commit not already archived by a previous load
-                    and parent not in self.base_repo.heads
+                    and parent not in self.base_repo.local_heads
                 ):
                     commit_objects.append(cast(Commit, self._get_git_object(parent)))
                     self.objects[b"commit"].add(parent)
diff --git a/swh/loader/git/loader.py b/swh/loader/git/loader.py
index 76bc265..451eb43 100644
--- a/swh/loader/git/loader.py
+++ b/swh/loader/git/loader.py
@@ -15,6 +15,7 @@ from typing import Any, Callable, Dict, Iterable, Iterator, List, Optional, Set,
 import dulwich.client
 from dulwich.errors import GitProtocolError, NotGitRepository
 from dulwich.objects import ShaFile
+from dulwich.object_store import ObjectStoreGraphWalker
 from dulwich.pack import PackData, PackInflater
 
 from swh.core.statsd import Statsd
@@ -58,8 +59,16 @@ class RepoRepresentation:
         else:
             self.base_snapshots = []
 
-        self.heads: Set[HexBytes] = set()
-        self.__iterator = None
+        # Cache existing heads
+        self.local_heads: Set[HexBytes] = set()
+        for base_snapshot in self.base_snapshots:
+            for branch_name, branch in base_snapshot.branches.items():
+                if not branch or branch.target_type == TargetType.ALIAS:
+                    continue
+                self.local_heads.add(HexBytes(hashutil.hash_to_bytehex(branch.target)))
+
+    def graph_walker(self) -> ObjectStoreGraphWalker:
+        return ObjectStoreGraphWalker(self.local_heads, get_parents=lambda commit: [])
 
     def determine_wants(self, refs: Dict[bytes, HexBytes]) -> List[HexBytes]:
         """Get the list of bytehex sha1s that the git loader should fetch.
@@ -71,16 +80,6 @@ class RepoRepresentation:
         if not refs:
             return []
 
-        # Cache existing heads
-        local_heads: Set[HexBytes] = set()
-        for base_snapshot in self.base_snapshots:
-            for branch_name, branch in base_snapshot.branches.items():
-                if not branch or branch.target_type == TargetType.ALIAS:
-                    continue
-                local_heads.add(HexBytes(hashutil.hash_to_bytehex(branch.target)))
-
-        self.heads = local_heads
-
         # Get the remote heads that we want to fetch
         remote_heads: Set[HexBytes] = set()
         for ref_name, ref_target in refs.items():
@@ -88,9 +87,9 @@ class RepoRepresentation:
                 continue
             remote_heads.add(ref_target)
 
-        logger.debug("local_heads_count=%s", len(local_heads))
+        logger.debug("local_heads_count=%s", len(self.local_heads))
         logger.debug("remote_heads_count=%s", len(remote_heads))
-        wanted_refs = list(remote_heads - local_heads)
+        wanted_refs = list(remote_heads - self.local_heads)
         logger.debug("wanted_refs_count=%s", len(wanted_refs))
         if self.statsd is not None:
             self.statsd.histogram(
@@ -100,20 +99,11 @@ class RepoRepresentation:
             )
             self.statsd.histogram(
                 "git_known_refs_percent",
-                len(local_heads & remote_heads) / len(remote_heads),
+                len(self.local_heads & remote_heads) / len(remote_heads),
                 tags={},
             )
         return wanted_refs
 
-    # Compatibility with the dulwich graph walker
-    def __next__(self):
-        if not self.__iterator:
-            self.__iterator = iter(self.heads)
-        return next(self.__iterator, None)
-
-    def ack(self, sha):
-        pass
-
 
 @dataclass
 class FetchPackReturn:
@@ -217,7 +207,7 @@ class GitLoader(BaseGitLoader):
         pack_result = client.fetch_pack(
             path,
             base_repo.determine_wants,
-            base_repo,
+            base_repo.graph_walker(),
             do_pack,
             progress=do_activity,
         )

Build has FAILED

Patch application report for D8808 (id=31749)

Could not rebase; Attempt merge onto d6d5ce2b58...

Updating d6d5ce2..3cf7582
Fast-forward
 swh/loader/git/dumb.py              |  2 +-
 swh/loader/git/loader.py            | 36 +++++++++++-----------
 swh/loader/git/tests/test_loader.py | 59 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 77 insertions(+), 20 deletions(-)
Changes applied before test
commit 3cf7582aa74d4a4b3721eaa7f1b5f8ad69a2b7be
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 16:20:58 2022 +0100

    Eagerly populate the set of local heads in RepoRepresentation.__init__
    
    As dulwich's client.fetch_pack expects an instance of history graph
    walker with set of known heads, move the local heads caching from
    `determine_wants` to the RepoRepresentation initialization logic.
    
    Our previous code would always initialize the graph walker with an empty
    set of heads (as the `graph_walker()` method is called before
    `determine_wants()` has run, so `self.heads` was always empty), so we
    would never actually fetch an incremental pack file.

commit 35ecf1843696b3676275920c86bce0c88ca75427
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 15:21:45 2022 +0100

    Dump all known and remote heads when debugging

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/236/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/236/console

LOL, thanks Jenkins

12:38:13    py3: commands succeeded
12:38:13    congratulations :)
[...]
12:38:13  Error when executing always post condition:
12:38:13  java.lang.ArrayIndexOutOfBoundsException
12:38:13  Caused: java.lang.ArrayIndexOutOfBoundsException
olasd retitled this revision from Implement the dulwich graph walker interface directly in RepoRepresentation to Eagerly populate the set of local heads in RepoRepresentation.__init__.Nov 4 2022, 1:05 PM
olasd edited the summary of this revision. (Show Details)
olasd edited the summary of this revision. (Show Details)

LGTM, I guess we should deploy this as soon as possible, right ?

This revision is now accepted and ready to land.Nov 4 2022, 1:24 PM

Build is green

Patch application report for D8808 (id=31749)

Rebasing onto 35ecf18436...

Current branch diff-target is up to date.
Changes applied before test
commit 3cf7582aa74d4a4b3721eaa7f1b5f8ad69a2b7be
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Nov 3 16:20:58 2022 +0100

    Eagerly populate the set of local heads in RepoRepresentation.__init__
    
    As dulwich's client.fetch_pack expects an instance of history graph
    walker with set of known heads, move the local heads caching from
    `determine_wants` to the RepoRepresentation initialization logic.
    
    Our previous code would always initialize the graph walker with an empty
    set of heads (as the `graph_walker()` method is called before
    `determine_wants()` has run, so `self.heads` was always empty), so we
    would never actually fetch an incremental pack file.

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/237/ for more details.

LGTM, I guess we should deploy this as soon as possible, right ?

Yeah. I tagged v2.0.0 for this.