The current loader git consumes a lot of memory depending on the size of the repository.
It's fetching the full packfile of unknown references/refs (filtered by last snapshot's
references), then parses the packfile multiple times to load in order contents,
directories, revisions, releases and then finishes by creating one snapshot for the
visit.
References in this context are resolved tips of branches (e.g refs/heads/master, ...) or
tags (e.g refs/tags/...).
While the memory consumption is not a problem for small (< 200 refs) to medium
repositories (<= 500 refs), this can become one on large repositories (> 500), either:
1. The currently unique packfile retrieved at the beginning of the loading is too big (>
4Gib) which fails immediately the ingestion. Nothing gets done. The visit is marked
as failed. If that happens too often (thrice consecutively iirc), the origin ends up
disabled, so no longer scheduled (up until it's listed again).
2. The ingestion starts but due to concurrency with other loading processes, the
ingestion process gets killed. That means partial ingestion of objects got done, but
no snapshot nor finalized visit. This is currentactually a major problem. Current regarding the
deployment details also implies a heavy disk i/o (which creates ceph problems downscheduling of further visits for that origin. Nonetheless, if further visit happens
the line, which can possibly cascade to outagesomehow, see the events we faced duringthose will skip already ingested objects (which will have still been
holidays for example...retrieved again though *without* partial snapshot in between visits).
3. The last point 2.Prior to the process getting killed, is also problematic for scheduling further visits for that origin.a high memory pressure in the current deployment
Nonetheless, if further visit happens somehow, those will skip already ingestedsituation implies that at some point heavy disk i/o happens (to reduce memory
objects (which will have still been retrieved again though *without* partial snapshotpressure). But that actually creates other ceph problems down the line, which can
in between visits)possibly cascade to outage, see the events we faced during holidays for example...
It's a work in progress outside the scope of this email. But still, that could still
happen. In the mean time, the actual proposed solutions could incidentally prevent
this.
To solve these problems, some work has been investigated and tried.
A first naive attempt has been made to iterate over the packfile once and keep a dict of
the references (to drop immediately the packfile reference) [1]. This failed as the
memory consumption spiked even further. This had the advantage to kill the loading very
fast. So, the conclusion of this attempt is that iterating over the packfile multiple
times (one iteration for each type of object of our model) is actually not the problem.
[1] https://forge.softwareheritage.org/D6377
Another attempt was to modify the loader git to make the ingestion fetch multiple
packfiles ([2] [3] with a slight loader-core change required [4])). This has the
advantage of naturally taking care of 1. (no more huge packfile). This is done by asking
intervals of unknown remote refs, starting by the tags (in natural order) then the
branches [5]. The natural order on tags sounds like a proper way to start incrementally
load the repository following its history [3]. If we don't follow the history (only
[2]), we could fetch first a huge packfile (with mostly everything in it) thus back to
square one. This does assume tags exist in the repository (which should mostly be the
case). The only limitation seen for that approach is that we now regularly discuss with
the server to retrieve information (so a time trade-off) during the loadingduring the loading. It's a trade-off.
FWIW, this continuousregular server discussion approach is what's currently done with the
mercurial loader without issues. That loader is very stable now and not as greedy in
memory as it used to be (hence one other motivation to align git loader behavior). It
mercurial loader without issues (which is stable nowdid the sourceforge hg origins ingestion without issues and not as greedy in memory as itthe bitbucket origins still
used to be, hence one other motivation to align git loader behaviorongoing is running smoothly (ETA ~1 week or so).
[2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch)
[3] https://forge.softwareheritage.org/D6392 (follow refs tags in order, then branches)
[4] https://forge.softwareheritage.org/D6380. This core loader adaptation actually
allows the git loader (well DVCS loaders) to create partial visit targeting a snapshot
after each internal ingestion loop (so for git after each packfile consumption). So this
makes sure that we create incremental snapshot prior to the final one (if we reach the
end of the ingestion). Such adaptation takes care of point 32. (and make subsequent
visits do less work even in case of failures). Thanks to @vlorentz which made me realize
this on that diff's review. After a couple of nights sleeping on it, it clickedit finally made
sense!
[5] Another idea (not followed through) would be to delay ingestion of some special
references which are assumed highly connected within the graph at the end. Typically,
e.g. "HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ...
(others?). The hypothesis is that these references are the main connected part of the
repository. So starting with those would end up with a huge packfile immediately (so
with large repositories at least, we are back to the initial problem). On the contrary,
if we start by the other references first, then dealing with the special ones at the
end, only a bit more work would be needed to fill in the blanks. That could yet be
another optimization which would maybe help if there are no tags in the repository for
example.
Another consideration we did not follow completely through yet was to use a depth
parameter (in the current internal lib used to discuss with the server). It's not
completely clear what actual depth number would be a relatively decent and satisfying
enough for all repositories out there. It's been slighly tested but dismissed for now
due to that question. It's not to be excluded though. It may simply be that this
solution composed with the previous points could just be a deeper optimization on
reducing the loader's work (especially the part walking the git graph).
Not necessarily memory consumption related but still, as another optimization point to
further align the git loader with the mercurial loader, it would be interesting to start
using the extid table to map what's considered git ids (which will change afaiui) with
the revision/release id. Then starting uUsing this mapping would allow filtering even further what we
further what we already ingested across origins (known refs (and not only those from the lastast snapshot of
snapshot of the same origin as currently). That optimization reduced quite a lot the mercurial
mercurial loader work especially regarding mercurial forks. In the loader git's case, filtering
filtering early enough known refs (revision/release) would actually reduce further the packfiles
packfiles (providing the extid table is actually filled enough that is).
TAs a heads up, there are ongoing runs between staging (patched venv) and production nodes
nodes. Stats will be updated along the way when more runs conclude [6]. In the mean
time, the first runs already confirm what's been described in this email, "it works",
less memory usage and it's not necessarily slower. Stats willCurrent runs actually shows that
be updated along the way when more runs conclude [6]. Istaging nodes finishes faster than production ones (even the mean time, the first onesough they are machines "less"
already confirm what's been described in this email.big, This,they are also less loaded in terms of loading so might be just that). without impacting the currentAll of this,
without impacting the current snapshot id computations, meaning the final snapshot of the "full" visit remains the
the "full" visit remains the same between production and patched staging nodes.
[6] https://forge.softwareheritage.org/T3625#71604
Having described the problematic, possible solutions, implement some of it, and with the
first runs actually confirming it's actually working, I'm now fairly convinced that we
have a way forward to improve the loader git. so it ends up being a more concurrency
friendly loader. That should also help in making our lag subside even faster (we are
keeping up [7] btw but not as fast as we could).
[7] https://grafana.softwareheritage.org/goto/crFAS4Dnk?orgId=1
Any thoughts, pros or cons arguments prior to actually drive this all the way through?
Thanks in advance for any feedback, cheers,