Reduce git loader memory footprint
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Sep 30 2021, 6:01 PM

Description

It's currently using a huge amount of memory especially on large repositories.
Which is currently a blocking point so we need to decrease their concurrency
so not everything fails on oom.

Checking a bit the code, currently the loader retrieves the full packfile and then parse it multiple times to ingest the dag objects.
We believe it's possible to retrieve packfiles incrementally.

This increases the communication with the server but that should be reducing the memory usage.

Related to T3025

Revisions and Commits

rDLDG Git loader
	Abandoned		D6419 Add some tracemalloc hooks
	Changes Planned		D6392 git: Ingest ordered tags then ordered branches references
	Abandoned		D6377 wip: git: Group objects per type early to drop the packfile reference asap
	Changes Planned		D6386 git: Load git repository through multiple packfiles fetch operations
rDLDBASE Generic VCS/Package Loader
	Needs Review		D6380 Allow partial snapshot creation during ingestion
rDSTO Storage manager
		D6446	rDSTOb6040142fe72 buffer: add a threshold for the estimated size of revision and release batches
		D6445	rDSTO7c5b0ec15e40 buffer: add a threshold for the number of revision parents in one batch
		D6443	rDSTO5edc0ba7ac12 buffer: add a threshold for the number of directory entries in one batch
		D6427	rDSTOabe95b34a2f9 filter: add filtering for release_add
		D6427	rDSTOc52b7b667911 filter: do not call the underlying functions if there's nothing to add
		D6427	rDSTO5d5d4c941eac buffer: Ensure that we don't send data from empty buffers

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T3653 Stabilize loader git
		Migrated	gitlab-migration	T3625 Reduce git loader memory footprint

Event Timeline

ardumont triaged this task as High priority.Sep 30 2021, 6:01 PM

ardumont created this task.

vsellier added a subscriber: vsellier.Sep 30 2021, 6:01 PM

ardumont added a revision: D6380: Allow partial snapshot creation during ingestion.Sep 30 2021, 6:01 PM

ardumont added a revision: D6386: git: Load git repository through multiple packfiles fetch operations.Oct 1 2021, 2:09 PM

ardumont renamed this task from Improve git loader memory footprint to Reduce git loader memory footprint.Oct 1 2021, 6:06 PM

ardumont updated the task description. (Show Details)

ardumont added a revision: D6377: wip: git: Group objects per type early to drop the packfile reference asap.

D6377 actually increased the memory footprint to the point of getting ingestion killed
fast. So closed!

D6386 (requires D6380 in the loader-core) is another attempt which tries to fetch
packfiles to ingest instead of one big packfile. The implementation is not entirely
satisfactory as it's not deterministic. It really depends on the connected nature of the
git repository graph we are trying to ingest (and the order of the refs we are ingestion
[3]). If it's fully connected, we will have a full packfile immediately (thus rendering
^ moot).

We tried to have a look at the depth parameter of the fetch_pack internally used by the
loader [2]. But that part is still fuzzy. Notably, how to determine what a good depth factor
could be.

[1] https://www.dulwich.io/docs/api/dulwich.client.html#dulwich.client.AbstractHttpGitClient.fetch_pack

[2] @anlambert also suggested to have a look at it later on, see D6386#165823.

[3] Another idea that was only discussed would be to make certain we first start by
ingesting in order tag references (under the assumption that we will then ingest mostly
in natural order the repository). Then focus on the remaining references (because mostly
there is a high probability that if we start with HEAD and/or master/main at first, we will
end up with mostly the overall repository in one one round).

ardumont added a revision: D6392: git: Ingest ordered tags then ordered branches references.Oct 1 2021, 6:55 PM

[3] Another idea that was only discussed would be to make certain we first start by
ingesting in order tag references (under the assumption that we will then ingest mostly
in natural order the repository). Then focus on the remaining references (because mostly
there is a high probability that if we start with HEAD and/or master at firstz, we will
end up with the overall repository).

D3692 (tests are fine locally).

A draft note to send to the #swh-devel ml is been drafted [1]
Open as draft for review first.

[1] P1192

zack added a subscriber: zack.Oct 2 2021, 4:16 PM

All runs done from medium to large repositories.
No diverging hash and consistently the loader-git ran with the patched version uses less memory.

|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| Machine | Origins         | Refs  | Snapshot                | Memory (max RSS kbytes) | Time (h:)mm:ss(.ms) | Note                                                         |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | torvalds/linux  | 1496  | \xc2847...3fb4          |                 1361324 |             6:59:16 |                                                              |
| prod    | //              | //    | \xc2847...3fb4          |                 3080408 |            24:13:11 |                                                              |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | CocoaPods/Specs | 14036 | X (hash mismatched) [1] |                 5789344 |            23:10:48 | unrelated error: hash mismatched error                       |
| prod    | //              | //    | X (killed) [2]          |                14280284 |            10:09:09 | (would have been the same error if not killed first)         |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | keybase/client  | 19867 | \xf780...c5a7           |                  278568 |            35:26.36 | (loaded recently, visit start from same snapshot as prod)    |
|         | //              |       | \xf780...c5a7           |                  489256 |             4:41:04 | Run from scratch (which not so long ago would have crashed)  |
| prod    | //              | //    | \xf780...c5a7           |                 1381324 |             1:29:02 | (loaded recently, visit start from same snapshot as staging) |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | cozy/cozy-stack | 3128  | \x58f9a...8edaf         |                  172404 |             3:35.43 |                                                              |
| prod    | //              | //    | \x58f9a...8edaf         |                 1096400 |             5:11.26 |                                                              |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | git/git         | 1903  | \x2df8...848e           |                  464200 |             1:16:17 | unrelated error: hash mismatched error, partial snap         |
| prod    | //              | //    | X                       |                 2059192 |             2:31:38 | // (exactly the same hash mismatch)                          |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|
| staging | rust-lang/rust  | 48615 | \xc59a...9ff3           |                 1171708 |            22:23:37 | 1st run, divergent hash (new commit in between prod/staging) |
| staging | //              | //    | \x0abd...3d3d           |                  899700 |             1:15:27 | 2nd run                                                      |
| prod    | //              | //    | \x4035...9ecb           |                 3397172 |             3:05:09 | 1st run, divergent hash (new commit in between prod/staging) |
| prod    | //              | //    | \x0abd...3d3d           |                 3190956 |             2:02:09 | 2nd run                                                      |
|---------+-----------------+-------+-------------------------+-------------------------+---------------------+--------------------------------------------------------------|

Full log details and snapshot extraction in P1192#8008

douardda added a subscriber: douardda.Oct 4 2021, 10:39 AM

ardumont mentioned this in T3627: Consider dropping pull request references from the git loader ingestion.Oct 4 2021, 1:07 PM

ardumont updated the task description. (Show Details)Oct 4 2021, 1:19 PM

One good question got raised on the mailing list thread both by @douardda and @stsp.

Do we already know for sure what is actually causing the memory bloat?

I'm actually not sure yet.
So i gave a spin to [1] triggering a run on the most heavy and problematic origins with it (on
production nodes keybase/client and CocoaPods/Specs).
Let's see what result that gives (ongoing).

[1] https://github.com/pythonspeed/filprofiler

I'm actually not sure yet.
So i gave a spin to [1] triggering a run on the most heavy and problematic origins with it (on
production nodes keybase/client and CocoaPods/Specs).
Let's see what result that gives (ongoing).

[1] https://github.com/pythonspeed/filprofiler

That did not bare result as the script [2] used to profile was using subprocess call.
The flamegraph stops at that call. so moot.

Runs on worker17 so production on standard loader-git flamegraph output [3] [4].

[2] P1197

[3] CocoaPods/Specs:

peak-memory.svg29 KBDownload

[4] keybase/client

peak-memory.svg29 KBDownload

Trigger other runs with memory-profiler instead. [1]
It's not perfect though. I cannot find the proper way to actually have the
legends as they described in their documentation. [2]

kubernetes/kubernetes, graph with memory over time:

Information with actual memory usage per line of code (corresponding to the graph):

worker1-staging-kubernetes-kubernetes-prof-run.txt22 KBDownload

We do not see much yet and i'm also missing functions that i required to be profiled, not showed for some reason.

[1] https://pypi.org/project/memory-profiler/

[2] https://github.com/pythonprofilers/memory_profiler#time-based-memory-usage

ardumont added a revision: D6419: Add some tracemalloc hooks.Oct 6 2021, 2:19 PM

olasd added a revision: D6427: swh.storage filter/buffer improvements.Oct 7 2021, 9:23 AM

So, after doing some more analysis of memory usage patterns on these edge case repositories, my suspicion is that the high memory usage is generally being caused by the loader processing batches of large directories, closely packed together, at the same time.

The heuristics git uses to pack objects together means that there's a high chance that very similar large objects (e.g. the history of a given, very large, directory) will be packed and deltified together; The git loader will just process these objects in order, and therefore have to handle these large packed objects sequentially, in close batches.

The buffer storage proxy currently will only limit the number of directories in a given batch, but the number of entries for said directories are effectively unbounded, which means the memory usage itself is unbounded.

When stracing the loading of these repositories, we can find bundles of directories that are multiple hundreds of megabytes being sent at once to the storage backend. The memory usage sawtooths around these batches, once the serialization has happened and the data has been sent.

After trying to load an edge-case repository using D6427, which adds a threshold for the number of *entries* in a given batch of directories, the memory usage hovers around 500MB instead of tens of gigabytes.

(this also matches the fact that we've seen, on our main ingestion database, directory_add operations that would take multiple hours, and have knock-on effects on backups and replications because of the long-running insertion transactions)

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

cumulated length of messages for revisions and releases
cumulated number of parents for revisions

as those are two other areas where the memory use is currently pretty much unbounded

I concur with this analysis btw

2 birds, one stone!

We'll definitely stop having the very long directory_add db transation which creates replication warning along the way.
And for all kinds of workers (possibly only git loader demonstrated it but that could happen on any others actually).

Also, one log file of a run done on an edge case repository (NixOS/nixpkgs) can be found in the diff fixing this behavior [1].

[1] D6427#167037

olasd added a revision: D6443: buffer: add a threshold for the number of directory entries in one batch.Oct 8 2021, 3:06 PM

olasd added a commit: rDSTO5d5d4c941eac: buffer: Ensure that we don't send data from empty buffers.Oct 8 2021, 3:56 PM

olasd added a commit: rDSTOc52b7b667911: filter: do not call the underlying functions if there's nothing to add.

olasd added a commit: rDSTOabe95b34a2f9: filter: add filtering for release_add.

olasd added a commit: rDSTO5edc0ba7ac12: buffer: add a threshold for the number of directory entries in one batch.

olasd added a revision: D6446: buffer: add a threshold for the estimated size of revision and release batches.Oct 8 2021, 3:58 PM

In T3625#71799, @olasd wrote:

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

cumulated length of messages for revisions and releases

that's D6446