Page MenuHomeSoftware Heritage

nixguix: Fails to finish as it's stuck in a loop up to memory error
Closed, MigratedEdits Locked

Description

As P640 shows, this loader does not finish.
A priori, that should be a common problem for all loaders.
It's just more apparent for that one as it loads a lot of various sources in one go.

It's an issue with one of the proxy storage, the buffer one, which fails to flush its contents to the storage due to one of the real hash collision referenced in P622.
It's then stuck in loop of retry content_add but fails to do so, pass to the next artifacts, add some more contents.
Still fails to content add to the storage (because it still has the problematic content in its buffer).
This happily bubbles up memory usage [3] up to a fatal memory error.
Then it's oom-reaped [4]

Possible workaround/fix includes:

  1. drop the buffer proxy storage from the configuration (that could be used as a test to ensure the loader does indeed finish)
  2. make the proxy storage (one of retry/buffer) exclude from the transaction the colliding hash (similar to what's been implemented currently in the journal [1])
  3. deal properly with the hash collision in question
  4. exclude the sources including the hash collision
  5. allow the current buffer proxy storage to be cleared in between failures to add operations.

Right now, heading for 2. for now as the solution for 3. is still a pending question [2]

[1] rDJNL3c0e491352934c67f1d92d1302760a32a333edee

[2] T2332

[3] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=worker0.internal.staging.swh.network&from=1586080516490&to=1586253316491

[4]

[Tue Apr  7 03:34:46 2020] Memory cgroup out of memory: Kill process 16402 (python3) score 996 or sacrifice child
[Tue Apr  7 03:34:46 2020] Killed process 16402 (python3) total-vm:14862676kB, anon-rss:14686960kB, file-rss:9920kB, shmem-rss:8kB
[Tue Apr  7 03:34:47 2020] oom_reaper: reaped process 16402 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

Event Timeline

ardumont triaged this task as Normal priority.Apr 7 2020, 11:49 AM
ardumont created this task.

Right now, heading for 2. for now as the solution for 3. is still a pending question [2]

After discussing with the team, let's try 5. first.

In the mean time (pending reviews), a new load was triggered without the proxy storage and all went fine.

Apr 08 17:12:43 worker0 python3[30323]: [2020-04-08 17:12:43,745: INFO/ForkPoolWorker-1] Task swh.loader.package.nixguix.tasks.LoadNixguix[f31b14bf-875f-42e0-8b2b-ddae39113b42] succeeded in 90226.9008241836s: {'status': 'eventful', 'snapshot_id': '3a810cbc026b7bc147b535392babbda85942f906'}
ardumont renamed this task from nixguix: Fails to finish to nixguix: Fails to finish as it's stuck in a loop up to memory error.Apr 11 2020, 11:51 AM
ardumont changed the task status from Open to Work in Progress.Apr 14 2020, 6:15 PM

while not finished, run is still happy so far

and it finished alright \m/

Apr 16 17:25:04 worker0 python3[15775]: [2020-04-16 17:25:04,091: INFO/ForkPoolWorker-1] Task swh.loader.package.nixguix.tasks.LoadNixguix[7f06250a-7606-475c-8565-9fe0ace6c69c] succeeded in 103331.24047457986s: {'status': 'eventful', 'snapshot_id': 'd820451681c74eec63693b6ea4e4b8417c76bb7a'}

Run with fixed storage (less falsy hash collisions), proxy storage (whose state be cleared in case of errors), fixed loader (which consistently clear internal buffer in case of errors, and definitly set a timeout on download operations):