Hello,

The current loader git consumes a lot of memory depending on the size of
the repository. It's fetching the full packfile of unknown
references/refs (filtered by last snapshot's references), then parses
the packfile multiple times to load in order contents, directories,
revisions, releases and then finishes by creating one snapshot for the
visit.

References in this context are resolved tips of branches (e.g
refs/heads/master, ...) or tags (e.g refs/tags/...).

While the memory consumption is not a problem for small (< 200 refs) to
medium repositories (<= 500 refs), this can become one on large
repositories (> 500), either:

1. The currently unique packfile retrieved at the beginning of the
   loading is too big (> 4Gib) which fails immediately the ingestion.
   Nothing gets done. The visit is marked as failed. If that happens too
   often (thrice consecutively iirc), the origin ends up disabled, so no
   longer scheduled (up until it's listed again).

2. The ingestion starts but due to concurrency with other loading
   processes, the ingestion process gets killed. That means partial
   ingestion of objects got done, but no snapshot nor finalized visit.
   This is actually a major problem regarding the scheduling of further
   visits for that origin. Nonetheless, if further visit happens
   somehow, those will skip already ingested objects (which will have
   still been retrieved again though *without* partial snapshot in
   between visits).

3. Prior to the process getting killed, a high memory pressure in the
   current deployment situation implies that at some point heavy disk
   i/o happens (to reduce memory pressure). But that actually creates
   other ceph problems down the line, which can possibly cascade to
   outage, see the events we faced during holidays for example... It's a
   work in progress outside the scope of this email. But still, that
   could still happen. In the mean time, the actual proposed solutions
   could incidentally prevent this.

To solve these problems, some work has been investigated and tried.

A first naive attempt has been made to iterate over the packfile once
and keep a dict of the references (to drop immediately the packfile
reference) [1]. This failed as the memory consumption spiked even
further. This had the advantage to kill the loading very fast. So, the
conclusion of this attempt is that iterating over the packfile multiple
times (one iteration for each type of object of our model) is actually
not the problem.

[1] https://forge.softwareheritage.org/D6377

Another attempt was to modify the loader git to make the ingestion fetch
multiple packfiles ([2] [3] with a slight loader-core change required
[4])). This has the advantage of naturally taking care of 1. (no more
huge packfile). This is done by asking intervals of unknown remote refs,
starting by the tags (in natural order) then the branches [5]. The
natural order on tags sounds like a proper way to start incrementally
load the repository following its history [3]. If we don't follow the
history (only [2]), we could fetch first a huge packfile (with mostly
everything in it) thus back to square one. This does assume tags exist
in the repository (which should mostly be the case). The only limitation
seen for that approach is that we now regularly discuss with the server
to retrieve information during the loading. It's a trade-off.

FWIW, this regular server discussion approach is what's currently done
with the mercurial loader without issues. That loader is very stable now
and not as greedy in memory as it used to be (hence one other motivation
to align git loader behavior). It did the sourceforge hg origins
ingestion without issues and the bitbucket origins still ongoing is
running smoothly (ETA ~1 week or so).

[2] https://forge.softwareheritage.org/D6386 (ingest in multiple
packfile fetch)

[3] https://forge.softwareheritage.org/D6392 (follow refs tags in order,
then branches)

[4] https://forge.softwareheritage.org/D6380. This core loader
adaptation actually allows the git loader (well DVCS loaders) to create
partial visit targeting a snapshot after each internal ingestion loop
(so for git after each packfile consumption). So this makes sure that we
create incremental snapshot prior to the final one (if we reach the end
of the ingestion). Such adaptation takes care of point 2. (and make
subsequent visits do less work even in case of failures). Thanks to
@vlorentz which made me realize this on that diff's review. After a
couple of nights sleeping on it, it finally made sense!

[5] Another idea (not followed through) would be to delay ingestion of
some special references which are assumed highly connected within the
graph at the end. Typically, e.g. "HEAD", "refs/heads/master",
"refs/heads/main", "refs/heads/develop", ... (others?). The hypothesis
is that these references are the main connected part of the repository.
So starting with those would end up with a huge packfile immediately (so
with large repositories at least, we are back to the initial problem).
On the contrary, if we start by the other references first, then dealing
with the special ones at the end, only a bit more work would be needed
to fill in the blanks. That could yet be another optimization which
would maybe help if there are no tags in the repository for example.

Another consideration we did not follow completely through yet was to
use a depth parameter (in the current internal lib used to discuss with
the server). It's not completely clear what actual depth number would be
a relatively decent and satisfying enough for all repositories out
there. It's been slighly tested but dismissed for now due to that
question. It's not to be excluded though. It may simply be that this
solution composed with the previous points could just be a deeper
optimization on reducing the loader's work (especially the part walking
the git graph).

Not necessarily memory consumption related but still, as another
optimization point to further align the git loader with the mercurial
loader, it would be interesting to start using the extid table to map
what's considered git ids (which will change afaiui) with the
revision/release id. Using this mapping would allow filtering even
further what we already ingested across origins (known refs and not only
those from the last snapshot of the same origin as currently). That
optimization reduced quite a lot the mercurial loader work especially
regarding mercurial forks. In the loader git's case, filtering early
enough known refs (revision/release) would actually reduce further the
packfiles (providing the extid table is actually filled enough that is).

As a heads up, there are ongoing runs between staging (patched venv with
diffs [2] [3] and [4]) and production nodes. Stats will be updated along
the way when more runs conclude (stats [6]). In the mean time, the first
runs already confirm what's been described in this email, "it works".
That is less memory is used and it's not necessarily slower. Current
runs actually shows that staging nodes finishes faster than production
ones (possibly due to their load been less intensive). This is also
consistent in term of snapshot id computation. The final snapshot of the
"full" visit remains the same between production and patched staging
nodes (when the visit actually finishes production side).

[6] https://forge.softwareheritage.org/T3625#71604

Having described the problematic, possible solutions, implement some of
them, and with the first runs actually confirming it's actually working,
I'm now fairly convinced that we have a way forward to improve the
loader git. With at least the first diffs in review, it will end up
being less greedy in terms of memory, thus more concurrent friendly.
Which in turn, should help in making our lag subside even faster (we are
keeping up [7] btw but not as fast as we'd like).

[7] https://grafana.softwareheritage.org/goto/crFAS4Dnk?orgId=1

Any thoughts, pros or cons arguments prior to actually drive this all
the way through?

Thanks in advance for any feedback, cheers,
@vsellier and @ardumont
--
tony / Antoine R. Dumont (@ardumont)
-----------------------------------------------------------------
gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8