Page MenuHomeSoftware Heritage

Some git repositories are failing to be ingested because of MemoryError
Closed, ResolvedPublic

Description

Even on worker17 with a bit more involved hardware (64Gib memory), this gets killed with
oom [2] (possibly related to this sentry issue as per the paste reference [1]):

swhworker@worker17:~$ swh loader -C /etc/softwareheritage/loader_oneshot.yml run git https://github.com/keybase/client
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/keybase/client' with type 'git'
Enumerating objects: 556997, done.
Counting objects: 100% (2700/2700), done.
Compressing objects: 100% (2219/2219), done.
Total 556997 (delta 589), reused 2436 (delta 457), pack-reused 554297
INFO:swh.loader.git.loader.GitLoader:Listed 19843 refs for repo https://github.com/keybase/client
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 339, in load
    self.store_data()
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 463, in store_data
    for release in self.get_releases():
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 349, in get_releases
    for raw_obj in self.iter_objects(b"tag"):
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 315, in iter_objects
    PackData.from_file(self.pack_buffer, self.pack_size)
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1337, in _walk_all_chains
    for result in self._follow_chain(offset, type_num, None):
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1393, in _follow_chain
    unpacked = self._resolve_object(offset, obj_type_num, base_chunks)
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1385, in _resolve_object
    unpacked.decomp_chunks)
MemoryError
{'status': 'failed'}

Noticed through P1114#7468 (similar sentry issue [1])

This echoes with another previous task [3].

Loader git version running 0.10 and dulwich 0.19.11 [4]

[1] https://sentry.softwareheritage.org/share/issue/175ffd5551644b8b8171beaf627e105a/

[2] https://grafana.softwareheritage.org/goto/jqkSiFM7z?orgId=1

[3] possibly T2373

[4]

root@pergamon:~# clush -b -w @swh-workers "dpkg -l python3-dulwich python3-swh.loader.git" | grep ii
ii  python3-dulwich        0.19.11-2             amd64        Python Git library - Python3 module
ii  python3-swh.loader.git 0.10.0-1~swh1~bpo10+1 all          Software Heritage Git loader

Event Timeline

ardumont triaged this task as Normal priority.Aug 4 2021, 10:28 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont added a subscriber: vlorentz.

For information, @vlorentz opened a related issue in dulwich [1].

[1] https://github.com/dulwich/dulwich/issues/894

It's exactly the same issue AFAIK

vlorentz renamed this task from Big git repositories are failing to be ingested to Some git repositories are failing to be ingested because of MemoryError.Aug 5 2021, 2:13 PM

[3] possibly T2373

In the end, more like related to T3025

Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:

Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git'
Aug 10 08:54:17 worker05 python3[871]: [62B blob data]
Aug 10 08:54:17 worker05 python3[871]: [586B blob data]
Aug 10 08:54:17 worker05 python3[871]: [473B blob data]
Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401                                         
Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs
Aug 10 08:59:21 worker05 kernel: [    871]  1004   871   247194   161634  1826816    46260             0 python3                              
Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped.            
Aug 10 09:15:29 worker05 kernel: [    871]  1004   871   412057   372785  3145728        0             0 python3                              
Aug 10 09:16:57 worker05 kernel: [    871]  1004   871   823648   784496  6443008        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1                   
Aug 10 09:24:44 worker05 kernel: [    871]  1004   871  2800000  2760713 22286336        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004           
Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0
Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
zack raised the priority of this task from Normal to High.Aug 10 2021, 12:10 PM

I've opened a PR with the proposed patch initially done by val (i patched the tests so the dulwich CI makes it green as well).

In the mean time, i've triggered runs without and with patch (patched dulwich in venv) so we have also concrete data to show/add in the PR or the issue.
That will also allows to check snaphshots ends up the same.

[1] https://github.com/dulwich/dulwich/pull/903

ardumont changed the task status from Open to Work in Progress.Sat, Sep 25, 4:10 PM

Draft analysis [1]
tl; dr: So far so good, the staging workers are reliably (no hash mismatch)
finishing their ingestion with their patched dulwich.

[1] P1176


That patch would also help in decreasing the number of OOM we got [1].
Mercurial loader is no longer an issue for that part, the git loader still is
as demonstrated in this task.

[1] https://grafana.softwareheritage.org/d/j_6mA_Gnk/workers-oom-killer?orgId=1&refresh=1m&from=now-7d&to=now

I forgot to mention that the patched dulwich locally and then the loader-git's tests are fine as well.

To ensure everything is working well with that patch, we executed multiple ingestions with and without the patched [1] dulwich version.

This, to ensure that the patch actually diminishes the footprint memory sufficiently for
the ingestion to run completely with our standard workers without impeding the standard
swh hash computations. For this last check, we ensure the snapshot hashes are the same
at the end of the ingestions (with standard and patched workers). As the swh model is a
merkle dag and the snapshot is the top-level model objects, that's enough.

tl; dr, the patch fixes the problem without hash divergences.

|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| origin (large)        | run (standard)       | run (patched) | snapshot hash comparison | snapshot hash                              |
|                       | X: oom killed        |               | ok: same hash both with  |                                            |
|                       | ok: finished         |               | standard/patched version |                                            |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| keybase/client        | X  (worker0.staging) | ok            | ok                       | \xcddaccc0a2d452098701dec921731e8c96630e2b |
| keybase/client        | ok (worker17)        |               |                          |                                            |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| torvalds/linux        | ok                   | ok            | ok                       | \xde499fdc325524ee0e7c3f57c6c2ae6a09091845 |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| kubernetes/kubernetes | ok                   | ok            | ok                       | \xa2a6299e3527bbba548eec0f0ef80cca9e80f545 |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| NixOS/nixpkgs         | ok                   | ok            | ok                       | \xda0e3e4a3eff6fb6370259fd2bdfcf932fa6ac69 |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| CocoaPods/Specs       | ongoing              | ongoing       | ongoing                  |                                            |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| origin (medium)       | run (standard)       | run (patched) | snapshot hash comparison | snapshot                                   |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| rdicosmo/parmap       | ok                   | ok            | ok                       | \x2d869aa00591d2ac8ec8e7abacdda563d413189d |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| hylang/hy             | ok                   | ok            | ok                       | \x821f28af45edaedc6f70b84c9bc4d407e7436452 |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|
| hylang/hyrule         | ok                   | ok            | ok                       | \x882db61b629bd9f2c7ef3492924e3ff73382d3a6 |
|-----------------------+----------------------+---------------+--------------------------+--------------------------------------------|

More details in the paste about the snapshot extraction [3]

If you are not interested about the details, you can stop reading. Otherwise, feel free
to continue.

3 nodes were used:

  • worker17 (production): overall large machine (64gib ram, 20 cpus) able to handle current large repository ingestion without it being killed. Ingestion is expected to work as is, given enough time. The resulting snapshots after loading become the references.
  • worker[0:2] (staging): Those nodes are smaller and they will fail (OOM kill) the ingestion with their current spec (12Gib ram, 4 cpus) for large repositories without a dulwich version patched. With the patch applied, the loading is expected to work. We shall then be able to compare the snapshots between the runs. The resulting snapshots should be the same as the one generated on worker17 [2]

Note that the ingestion timing is not important for the analysis. It's expected the
staging workers are slowers since the machines are not running the same specs. Plus, the
underlying database does not hold the same information. The production one is more
complete than the staging one (although it's less loaded). It's added to roughly have an
idea of the order of magnitude of the time it takes to ingest those. Again, the most
important criteria are the ingestion must finish with the same snapshot.

[1] git pack walking in DFS instead of BFS order https://github.com/dulwich/dulwich/pull/903

[2] providing the origins ingested were the same at the time of ingestion (snapshot also depends on data being the same).

[3] P1176

I made our jenkins ci build the patched dulwich with the fix discussed here.
It's currently uploaded in the swh debian repository [1]
I've deployed this on staging workers and trigger another run to
ensure everything is fine with it (again). If it is, i'll deploy on other workers tomorrow.

[1]

root@pergamon:/srv/softwareheritage/repository# reprepro ls dulwich
dulwich |   0.19.13-1~bpo9~swh+1 | stretch-swh | source
dulwich | 0.20.25-2~swh1~bpo10+1 |  buster-swh | source
dulwich |         0.20.25-2~swh1 |         sid | source
ardumont claimed this task.

I've deployed dulwich on our workers.
As a bonus, upstream merged the patch \o/.

This can be closed.