Page MenuHomeSoftware Heritage

staging: git loader: failure to ingest huge repository (e.g. nixpkgs)
Open, NormalPublic

Description

Consistenly [1] not able to ingest some repositories on staging:

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/NixOS/nixpkgs.git
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
Enumerating objects: 1151, done.
Counting objects: 100% (1151/1151), done.
Compressing objects: 100% (475/475), done.
Total 2367234 (delta 844), reused 697 (delta 671), pack-reused 2366083
INFO:swh.loader.git.BulkLoader:Listed 70404 refs for repo https://github.com/NixOS/nixpkgs.git
Killed

real    57m16.787s
user    50m33.560s
sys     0m40.689s

Note: That ends up with a lingering origin visit with status ongoing (thus T2372 is really interesting).

machine (worker1.internal.staging.swh.network):

  • 4 cores
  • 16Gib ram
  • no swap (our prod node does though) [2]

Nothing else runs there (other loader service are stopped).

[1] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=worker1.internal.staging.swh.network&from=1587553104919&to=1587561781178
(both pick in memory usage are tryouts)

[2] I will add some swap to that node to check if that goes further with it.

Event Timeline

ardumont renamed this task from staging: loader git: failure to ingest repository to staging: git loader: failure to ingest huge repository (e.g. nixpkgs).Apr 22 2020, 3:33 PM
ardumont triaged this task as Normal priority.
ardumont created this task.
ardumont updated the task description. (Show Details)Apr 22 2020, 3:37 PM

[2] I will add some swap to that node to check if that goes further with it.

Added 4Gib of swap, that fails the same (expectedly but i wanted to be sure [3] ;).

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/NixOS/nixpkgs.git
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
Enumerating objects: 1237, done.
Counting objects: 100% (1237/1237), done.
Compressing objects: 100% (458/458), done.
Total 2367288 (delta 929), reused 818 (delta 777), pack-reused 2366051
INFO:swh.loader.git.BulkLoader:Listed 70409 refs for repo https://github.com/NixOS/nixpkgs.git
Killed

real    60m0.376s
user    53m11.195s
sys     0m39.039s

[3] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=worker1.internal.staging.swh.network&from=1587561176456&to=1587565786528

olasd added a subscriber: olasd.Apr 28 2020, 11:49 AM

The base logic of the git loader regarding packfiles hasn't really been touched since it was first implemented: it's never been really profiled/optimized with respect to its memory usage; This issue isn't specific to the staging infra, it's only more salient there because the workers have been made with tight constraints.

There's a few strategies I can think of the relieve the memory pressure from the git loader:

  1. ignore pull request branches (should reduce the number of objects loaded, to some extent)
  2. split the pack files and load objects from different branches sequentially, instead of all at once
    • pros:
      • reduces the amount of objects in memory at once
      • enables partial snapshots
    • cons:
      • more intensive on the server: more requests, less caching
      • makes the "we already have these objects" logic more tricky: need to overlay the contents of the archive and the objects that have been just downloaded
  3. write the packfile on disk instead of using a memory-backed object
    • pros:
      • memory usage is the os cache instead of our own memory
    • cons:
      • more intensive on disk, even for loading tiny repositories
      • need to be careful with on-disk cleanup
  4. give workers more swap to "automate" 3.
    • pros:
      • only a small setup overhead; no code changes
      • will use RAM until it's not possible anymore
    • cons:
      • the control over what the OS swaps is pretty poor
      • swapping affects all workers, not just the one hogging all the memory

I wonder if there's a "disk-backed BytesIO object" that we can use in Python that would give us a decent middle ground between options 3 and 4.

Reading this again, and seeing that the workers have 16GB of RAM, there's something weird going on that's not related to the volume of the packfile (which is 2GB max).

It'd be useful to see when the worker crashes (with some debugging output turned on), because none of the strategies I talked about would change anything to stuff that happens after the packfile is downloaded.

Currently running this again with debug logs...
Thanks for the input.

ardumont added a comment.EditedApr 28 2020, 1:20 PM

Currently running this again with debug logs...

done.

This seems to crash on the storage.content_missing call:

swhworker@worker1:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh --log-level DEBUG loader run git https://github.com/NixOS/nixpkgs.git 2>&1 | tee /tmp/loader-git-run-nixpkgs-debug.txt
WARNING:swh.core.cli:Could not load subcommand indexer: cannot import name 'get_journal_client' from 'swh.journal.client' (/usr/lib/python3/dist-packages/swh/journal/client.py)
DEBUG:swh.loader.cli:kw: {}
DEBUG:swh.loader.cli:registry: {'task_modules': ['swh.loader.git.tasks'], 'loader': <class 'swh.loader.git.loader.GitLoader'>}
DEBUG:swh.loader.cli:loader class: <class 'swh.loader.git.loader.GitLoader'>
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/loader_git.yml
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): storage0.internal.staging.swh.network:5002
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/add HTTP/1.1" 200 38
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/visit/add HTTP/1.1" 200 198
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /origin/visit/get_latest HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): github.com:443
DEBUG:urllib3.connectionpool:https://github.com:443 "GET /NixOS/nixpkgs.git/info/refs?service=git-upload-pack HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
... (snip) ...
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 23003
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /object/find_by_sha1_git HTTP/1.1" 200 3039
INFO:swh.loader.git.BulkLoader:Listed 70746 refs for repo https://github.com/NixOS/nixpkgs.git
DEBUG:urllib3.connectionpool:Resetting dropped connection: storage0.internal.staging.swh.network
DEBUG:urllib3.connectionpool:http://storage0.internal.staging.swh.network:5002 "POST /content/missing HTTP/1.1" 200 9475383

real    59m41.024s
user    52m27.378s
sys     0m44.581s