Page MenuHomeSoftware Heritage

Reduce git loader memory footprint
Closed, ResolvedPublic

Description

It's currently using a huge amount of memory especially on large repositories.
Which is currently a blocking point so we need to decrease their concurrency
so not everything fails on oom.

Checking a bit the code, currently the loader retrieves the full packfile and then parse it multiple times to ingest the dag objects.
We believe it's possible to retrieve packfiles incrementally.

This increases the communication with the server but that should be reducing the memory usage.

Related to T3025

Event Timeline

ardumont created this task.
ardumont renamed this task from Improve git loader memory footprint to Reduce git loader memory footprint.Oct 1 2021, 6:06 PM
ardumont updated the task description. (Show Details)

D6377 actually increased the memory footprint to the point of getting ingestion killed
fast. So closed!

D6386 (requires D6380 in the loader-core) is another attempt which tries to fetch
packfiles to ingest instead of one big packfile. The implementation is not entirely
satisfactory as it's not deterministic. It really depends on the connected nature of the
git repository graph we are trying to ingest (and the order of the refs we are ingestion
[3]). If it's fully connected, we will have a full packfile immediately (thus rendering
^ moot).

We tried to have a look at the depth parameter of the fetch_pack internally used by the
loader [2]. But that part is still fuzzy. Notably, how to determine what a good depth factor
could be.

[1] https://www.dulwich.io/docs/api/dulwich.client.html#dulwich.client.AbstractHttpGitClient.fetch_pack

[2] @anlambert also suggested to have a look at it later on, see D6386#165823.

[3] Another idea that was only discussed would be to make certain we first start by
ingesting in order tag references (under the assumption that we will then ingest mostly
in natural order the repository). Then focus on the remaining references (because mostly
there is a high probability that if we start with HEAD and/or master/main at first, we will
end up with mostly the overall repository in one one round).

[3] Another idea that was only discussed would be to make certain we first start by
ingesting in order tag references (under the assumption that we will then ingest mostly
in natural order the repository). Then focus on the remaining references (because mostly
there is a high probability that if we start with HEAD and/or master at firstz, we will
end up with the overall repository).

D3692 (tests are fine locally).

A draft note to send to the #swh-devel ml is been drafted [1]
Open as draft for review first.

[1] P1192

All runs done from medium to large repositories.
No diverging hash and consistently the loader-git ran with the patched version uses less memory.

|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| Machine | Origins         | Refs  | Snapshot                | Memory (max RSS kbytes) | Time (h:)mm:ss(.ms) | Note                                                         |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | torvalds/linux  | 1496  | \xc2847...3fb4          |                 1361324 |             6:59:16 |                                                              |
| prod    | //              | //    | \xc2847...3fb4          |                 3080408 |            24:13:11 |                                                              |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | CocoaPods/Specs | 14036 | X (hash mismatched) [1] |                 5789344 |            23:10:48 | unrelated error: hash mismatched error                       |
| prod    | //              | //    | X (killed) [2]          |                14280284 |            10:09:09 | (would have been the same error if not killed first)         |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | keybase/client  | 19867 | \xf780...c5a7           |                  278568 |            35:26.36 | (loaded recently, visit start from same snapshot as prod)    |
|         | //              |       | \xf780...c5a7           |                  489256 |             4:41:04 | Run from scratch (which not so long ago would have crashed)  |
| prod    | //              | //    | \xf780...c5a7           |                 1381324 |             1:29:02 | (loaded recently, visit start from same snapshot as staging) |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | cozy/cozy-stack | 3128  | \x58f9a...8edaf         |                  172404 |             3:35.43 |                                                              |
| prod    | //              | //    | \x58f9a...8edaf         |                 1096400 |             5:11.26 |                                                              |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | git/git         | 1903  | \x2df8...848e           |                  464200 |             1:16:17 | unrelated error: hash mismatched error, partial snap         |
| prod    | //              | //    | X                       |                 2059192 |             2:31:38 | // (exactly the same hash mismatch)                          |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | rust-lang/rust  | 48615 | \xc59a...9ff3           |                 1171708 |            22:23:37 | 1st run, divergent hash (new commit in between prod/staging) |
| staging | //              | //    | \x0abd...3d3d           |                  899700 |             1:15:27 | 2nd run                                                      |
| prod    | //              | //    | \x4035...9ecb           |                 3397172 |             3:05:09 | 1st run, divergent hash (new commit in between prod/staging) |
| prod    | //              | //    | \x0abd...3d3d           |                 3190956 |             2:02:09 | 2nd run                                                      |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|

Full log details and snapshot extraction in P1192#8008

One good question got raised on the mailing list thread both by @douardda and @stsp.

Do we already know for sure what is actually causing the memory bloat?

I'm actually not sure yet.
So i gave a spin to [1] triggering a run on the most heavy and problematic origins with it (on
production nodes keybase/client and CocoaPods/Specs).
Let's see what result that gives (ongoing).

[1] https://github.com/pythonspeed/filprofiler

I'm actually not sure yet.
So i gave a spin to [1] triggering a run on the most heavy and problematic origins with it (on
production nodes keybase/client and CocoaPods/Specs).
Let's see what result that gives (ongoing).

[1] https://github.com/pythonspeed/filprofiler

That did not bare result as the script [2] used to profile was using subprocess call.
The flamegraph stops at that call. so moot.

Runs on worker17 so production on standard loader-git flamegraph output [3] [4].

[2] P1197

[3] CocoaPods/Specs:

[4] keybase/client

Trigger other runs with memory-profiler instead. [1]
It's not perfect though. I cannot find the proper way to actually have the
legends as they described in their documentation. [2]

kubernetes/kubernetes, graph with memory over time:

Information with actual memory usage per line of code (corresponding to the graph):

We do not see much yet and i'm also missing functions that i required to be profiled, not showed for some reason.

[1] https://pypi.org/project/memory-profiler/

[2] https://github.com/pythonprofilers/memory_profiler#time-based-memory-usage

So, after doing some more analysis of memory usage patterns on these edge case repositories, my suspicion is that the high memory usage is generally being caused by the loader processing batches of large directories, closely packed together, at the same time.

The heuristics git uses to pack objects together means that there's a high chance that very similar large objects (e.g. the history of a given, very large, directory) will be packed and deltified together; The git loader will just process these objects in order, and therefore have to handle these large packed objects sequentially, in close batches.

The buffer storage proxy currently will only limit the number of directories in a given batch, but the number of entries for said directories are effectively unbounded, which means the memory usage itself is unbounded.

When stracing the loading of these repositories, we can find bundles of directories that are multiple hundreds of megabytes being sent at once to the storage backend. The memory usage sawtooths around these batches, once the serialization has happened and the data has been sent.

After trying to load an edge-case repository using D6427, which adds a threshold for the number of *entries* in a given batch of directories, the memory usage hovers around 500MB instead of tens of gigabytes.

(this also matches the fact that we've seen, on our main ingestion database, directory_add operations that would take multiple hours, and have knock-on effects on backups and replications because of the long-running insertion transactions)

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

  • cumulated length of messages for revisions and releases
  • cumulated number of parents for revisions

as those are two other areas where the memory use is currently pretty much unbounded

I concur with this analysis btw

2 birds, one stone!

We'll definitely stop having the very long directory_add db transation which creates replication warning along the way.
And for all kinds of workers (possibly only git loader demonstrated it but that could happen on any others actually).

Also, one log file of a run done on an edge case repository (NixOS/nixpkgs) can be found in the diff fixing this behavior [1].

[1] D6427#167037

In T3625#71799, @olasd wrote:

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

  • cumulated length of messages for revisions and releases

that's D6446

  • cumulated number of parents for revisions

and that's D6445

as those are two other areas where the memory use is currently pretty much unbounded

ardumont changed the task status from Open to Work in Progress.Oct 8 2021, 5:55 PM

btw ^

Deploy storage v0.38 on worker (proxy buffer/filter adaptations client/loader side).
Restarted all loaders with it.

Remains to given an update on the mailing list thread.
I'll attend to this before the end of the week.

ardumont claimed this task.

Actually deployed and the number of oom actually decreased.