Page MenuHomeSoftware Heritage

git loader OOM when loading huge repository
Closed, ResolvedPublic

Description

Consistenly [1] not able to ingest some repositories on staging:

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/NixOS/nixpkgs.git
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
Enumerating objects: 1151, done.
Counting objects: 100% (1151/1151), done.
Compressing objects: 100% (475/475), done.
Total 2367234 (delta 844), reused 697 (delta 671), pack-reused 2366083
INFO:swh.loader.git.BulkLoader:Listed 70404 refs for repo https://github.com/NixOS/nixpkgs.git
Killed

real    57m16.787s
user    50m33.560s
sys     0m40.689s

Note: That ends up with a lingering origin visit with status ongoing (thus T2372 is really interesting).

machine (worker1.internal.staging.swh.network):

  • 4 cores
  • 16Gib ram
  • no swap (our prod node does though) [2]

Nothing else runs there (other loader service are stopped).

[1] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=worker1.internal.staging.swh.network&from=1587553104919&to=1587561781178
(both pick in memory usage are tryouts)

[2] I will add some swap to that node to check if that goes further with it.

Event Timeline

ardumont renamed this task from staging: loader git: failure to ingest repository to staging: git loader: failure to ingest huge repository (e.g. nixpkgs).Apr 22 2020, 3:33 PM
ardumont triaged this task as Normal priority.
ardumont created this task.

[2] I will add some swap to that node to check if that goes further with it.

Added 4Gib of swap, that fails the same (expectedly but i wanted to be sure [3] ;).

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/NixOS/nixpkgs.git
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
Enumerating objects: 1237, done.
Counting objects: 100% (1237/1237), done.
Compressing objects: 100% (458/458), done.
Total 2367288 (delta 929), reused 818 (delta 777), pack-reused 2366051
INFO:swh.loader.git.BulkLoader:Listed 70409 refs for repo https://github.com/NixOS/nixpkgs.git
Killed

real    60m0.376s
user    53m11.195s
sys     0m39.039s

[3] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=worker1.internal.staging.swh.network&from=1587561176456&to=1587565786528

The base logic of the git loader regarding packfiles hasn't really been touched since it was first implemented: it's never been really profiled/optimized with respect to its memory usage; This issue isn't specific to the staging infra, it's only more salient there because the workers have been made with tight constraints.

There's a few strategies I can think of the relieve the memory pressure from the git loader:

  1. ignore pull request branches (should reduce the number of objects loaded, to some extent)
  2. split the pack files and load objects from different branches sequentially, instead of all at once
    • pros:
      • reduces the amount of objects in memory at once
      • enables partial snapshots
    • cons:
      • more intensive on the server: more requests, less caching
      • makes the "we already have these objects" logic more tricky: need to overlay the contents of the archive and the objects that have been just downloaded
  3. write the packfile on disk instead of using a memory-backed object
    • pros:
      • memory usage is the os cache instead of our own memory
    • cons:
      • more intensive on disk, even for loading tiny repositories
      • need to be careful with on-disk cleanup
  4. give workers more swap to "automate" 3.
    • pros:
      • only a small setup overhead; no code changes
      • will use RAM until it's not possible anymore
    • cons:
      • the control over what the OS swaps is pretty poor
      • swapping affects all workers, not just the one hogging all the memory

I wonder if there's a "disk-backed BytesIO object" that we can use in Python that would give us a decent middle ground between options 3 and 4.

Reading this again, and seeing that the workers have 16GB of RAM, there's something weird going on that's not related to the volume of the packfile (which is 2GB max).

It'd be useful to see when the worker crashes (with some debugging output turned on), because none of the strategies I talked about would change anything to stuff that happens after the packfile is downloaded.

Currently running this again with debug logs...
Thanks for the input.

Currently running this again with debug logs...

done.

This seems to crash on the storage.content_missing call:

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/NixOS/nixpkgs.git 2>&1 | tee /tmp/loader-git-run-nixpkgs-debug.txt
WARNING:swh.core.cli:Could not load subcommand indexer: cannot import name 'get_journal_client' from 'swh.journal.client' (/usr/lib/python3/dist-packages/swh/journal/client.py)
DEBUG:swh.loader.cli:kw: {}
DEBUG:swh.loader.cli:registry: {'task_modules': ['swh.loader.git.tasks'], 'loader': <class 'swh.loader.git.loader.GitLoader'>}
DEBUG:swh.loader.cli:loader class: <class 'swh.loader.git.loader.GitLoader'>
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): storage0.internal.staging.swh.network:5002
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/add HTTP/1.1" 200 38
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/visit/add HTTP/1.1" 200 198
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/visit/get_latest HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /NixOS/nixpkgs.git/info/refs?service=git-upload-pack HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
... (snip) ...
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 3039
INFO:swh.loader.git.BulkLoader:Listed 70746 refs for repo https://github.com/NixOS/nixpkgs.git
DEBUG:urllib3.connectionpool:Resetting dropped connection: storage0.internal.staging.swh.network
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /content/missing HTTP/1.1" 200 9475383

real    59m41.024s
user    52m27.378s
sys     0m44.581s

FTR, in a test setup I made a few days ago on docker, I had a git loader crunching ~28GB of RES mem (on 32 available on that machine). Not sure which repo it was ingesting, but it was on codeberg.

So content_missing call explodes mid-air client side (`"POST /content/missing
HTTP/1.1" 200 9475383` so client received the data).

It so happens that the content_missing api is taking an unlimited amount of
bytes ids as input [1] And then "tries" to stream to the client the results
(rpc layer in the middle makes that moot).

Pagination

As a unified effort around pagination started, a way forward could be to make
its interface be:

def content_missing(
        self, contents: List[Dict[str, Any]], key_hash: str = "sha1"
    ) -> PagedResult[List[bytes]]:

So the client code can iterate incrementally over the result instead of
reifying everything back in ram prior to the content_add call [2]

pros: sounds sensible
cons: don't know if that's feasible, I don't really know/understand yet the get_inflater in the loader git code [2]

Stop filtering altogether

We could also as an alternative, let the server do the missing calls happen...
The _add storage methods also consistenly calls the _missing endpoints already.

pros: less ram used since we no longer keep the internal content missing set
cons: more data on the wire that only gets filtered server side...

Clean up redundant code

And finally, I see our loader configuration now references the proxy storage
filter which does already the missing calls. So maybe we could drop some code
in the loader git about that.

Unsure this one will help though. That could make the loader less lazy in regards to how it currently deals with the content objects reification...


As discussed in irc anyway, profiling would be a way forward to detect more appropriately what happens.

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/postgresql/storage.py$335-349

[2] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/loader.py$309-324

[3]
https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/defaults.yaml$1985-198
https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/defaults.yaml$1242-1254

zack renamed this task from staging: git loader: failure to ingest huge repository (e.g. nixpkgs) to git loader OOM when loading huge repository.Sep 17 2020, 9:53 AM
zack raised the priority of this task from Normal to High.
zack added a subscriber: zack.

So content_missing call explodes mid-air client side (`"POST /content/missing
HTTP/1.1" 200 9475383` so client received the data).

It so happens that the content_missing api is taking an unlimited amount of
bytes ids as input [1] And then "tries" to stream to the client the results
(rpc layer in the middle makes that moot).

AFAICT, gradually, the refactorings of swh.storage and swh.loader.core around type checking, pipelined filtering and buffering have completely lost the "lazy" behavior of the git loader.

The most glaring issue currently is within swh.loader.core: https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.py;v0.10.0$409-435

store_data generates a full list of all object types before sending them to storage. This means that even if you use the buffer filter in your storage pipeline, you end up with a full list of all the objects you're loading, even when the implementations of get_contents, get_directories, ... try very hard to be lazy generators.

So, in the same line of thinking, I think the buffer filter should also slice the backend calls to *_missing into chunks. Adding pagination to these endpoints seems quite overkill.

Adding pagination to these endpoints seems quite overkill.

Agreed, we just need to make smaller calls.

Status on this. Loader-core has been tagged 0.11.0 which includes D3976.

Tested on staging with multiple either large or repositories I had issues with
in the past (with a configuration change though D3986):

|-----------------------------------------+--------+--------------------+----------------|
| repo                                    | status | ingestion time (m) | Number of refs |
|-----------------------------------------+--------+--------------------+----------------|
| https://github.com/python/cpython       | done   | 34 [1]             | ~22756         |
| https://github.com/NixOS/nixpkgs.git    | done   | 703 [2]            | ~80000         |
| https://github.com/guix-mirror/guix.git | done   | 41 [3]             | 94             |
| https://github.com/torvalds/linux       | done   | 971 [4]            | 1447           |
|-----------------------------------------+--------+--------------------+----------------|

Note:

  • I opened a similar diff about the configuration change for the docker-environment D3988 (test ran fine).
  • RAM on the worker machines decreased to 12Gib of RAM since we opened that task btw

So, in the same line of thinking, I think the buffer filter should also slice
the backend calls to *_missing into chunks

Nothing has been done for that.

[1]

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/python/cpython
...
{'status': 'eventful'}
real    34m15.161s
user    27m5.043s
sys     0m19.578s

[2]

swhworker@worker0:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/NixOS/nixpkgs.git 
...
{'status': 'eventful'}
real    703m18.613s
user    315m45.926s
sys     5m42.913s

[3]

swhworker@worker2:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/guix-mirror/guix.git 
...
{'status': 'eventful'}

real    41m50.139s
user    27m19.108s
sys     0m31.859s

[4]

$ swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/torvalds/linux
...
{'status': 'eventful'}

real    971m26.927s
user    630m15.199s
sys     5m2.752s
ardumont changed the task status from Open to Work in Progress.Sep 18 2020, 3:42 PM

I can confirm that with the current master HEAD of swh-loader-core (452fa224f9ca635a979cf1a8e98c88bb560ca98a), loading of the Linux kernel repo no longer OOM.
(It failed after ~24 hours, but apparently for unrelated reasons.)

The peak memory usage of the celery worker I've observed is ~6.5 GB RAM, which is ~4x the size of the git packfile for that repository. In relative terms it seems acceptable. In absolute terms it is definitely OK for usage on desktop/laptop; I've no idea if that's acceptable for usage on our crawlers, which load multiple repos in parallel.

fwiw, loader-core v0.11.0 deployed in production.

fwiw, loader-core v0.11.0 deployed in production.

so what's missing to close this?

vlorentz claimed this task.

fwiw, loader-core v0.11.0 deployed in production.

so what's missing to close this?

well, the following were not respectively, checked, adressed or answered...

  1. running some of the sources on production. I have "save code now" guix and nixpkgs repositories, i could also add the linux kernel (it the visit is old enough).
  1. the suggestion from olasd was not answered: "So, in the same line of thinking, I think the buffer filter should also slice the backend calls to *_missing into chunks."
  1. your question was not answered either: " I've no idea if that's acceptable for usage on our crawlers, which load multiple repos in parallel."

but now it's closed so i guess it's fine...

running some of the sources on production. I have "save code now" guix and
nixpkgs repositories, i could also add the linux kernel (it the visit is old
enough).

Everything passed so it's fine.

I also threw torvalds/linux into the mix but that is kinda biases as its last
visit was august 2020 or so ¯\_(ツ)_/¯.

the suggestion from olasd was not answered: "So, in the same line of
thinking, I think the buffer filter should also slice the backend calls to
*_missing into chunks."

That could go in another task, T2622.

your question was not answered either: " I've no idea if that's acceptable
for usage on our crawlers, which load multiple repos in parallel."

The previous large repositories passed so i think that's not a problem.
I'll stop there now.

Cheers,