Page MenuHomeSoftware Heritage

2018-02-16 worker disk full postmortem
Closed, MigratedEdits Locked

Description

The root filesystem on worker13 is full and the host cannot process Mercurial data anymore.

The culprit seems to be the Mercurial loader itself.
/ tmp (part of the / filesystem) contains huge files and directories:

12G     sqldict6a19bb
17G     sqldict89fc16
786M    sqldict8f09fa
634M    sqldictc9c674
4.0K    swh.loader.deposit
14M     swh.loader.mercurial.10o7dn_1
4.0M    swh.loader.mercurial.31j86o5f
4.0K    swh.loader.mercurial.35gljinc
24M     swh.loader.mercurial.4p1f_0o0
52K     swh.loader.mercurial.d0y8x43e
50M     swh.loader.mercurial.dzr2_6_e
135M    swh.loader.mercurial.e1_1ejhy
100M    swh.loader.mercurial.g5d8i9y2
15M     swh.loader.mercurial.h_a2csx2
64M     swh.loader.mercurial.jz9z34ow
7.5M    swh.loader.mercurial.lcpqso3x
128M    swh.loader.mercurial.mjp_zken
919M    swh.loader.mercurial.mzlrf1cf
27M     swh.loader.mercurial.o4mn1ran
64M     swh.loader.mercurial.ox9etasm
64M     swh.loader.mercurial.sgyhmxh3
160K    swh.loader.mercurial.tk3ozyv2
35M     swh.loader.mercurial.wd3f3t4i
26M     swh.loader.mercurial.yo2tfris
12K     tmp7w5wd0ar
588K    tmpiary52zt
588K    tmpmiis9pv7

Related Objects

Event Timeline

cannot process Mercurial data anymore.

well, not only mercurial, any other stuff that needs at least disk to work (loader svn comes to mind easily but not limited to that).

14M swh.loader.mercurial.10o7dn_1

Those folders are the expected ones.
Well, they should be cleaned up.

Those are hosting the repository uncompressed from the archive.

12G sqldict6a19bb
17G sqldict89fc16
786M sqldict8f09fa
634M sqldictc9c674

That must be the disk-based secondary storage cache used from within the mercurial loader @fiendish referred to.

This is quite large indeed!

I did not expect them at that location though.

So, now i've made the deposit/svn/mercurial loaders use private tmp mounting.
So next time this happens, we should only need restarting/stopping the service.
And this will clean up the mess left after this situation arose.

doc: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#PrivateTmp=

Now to avoid the OOM to happen too late and kill other services as well, i thought of quota in the memory usage (systemd permits this).

The memory usage limitation won't be enough for the mercurial/svn loaders to kill other stuff though.

For example, the mercurial loader uses a secondary disk cache when the memory reached a certain point... So that could expand a lot...
Another example would be the svn loader on large repository which could expand a lot on disk as well...

I did not find a way to limit the disk usage (from systemd at least).

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).
I'll keep a listing of those not rescheduled.

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).

Well, first, i tried to reduce the disk usage (cf. commits just before that comment).

And second, there are other stuff to clean up anyway (T976).

If cache files are sticking around, then of course the code should make sure that they go away when done or aborted. But I think that a few G used during processing of extremely large repos should be acceptable. :/

If cache files are sticking around, then of course the code should make sure that they go away when done

It does...

or aborted.

...up to the point where they cannot (OOM killed). This is the current predicament.

But I think that a few G used during processing of extremely large repos should be acceptable. :/

It is.

It's just that we need to find a proper way to also clean up outside the loaders when that situation arose.

The thing is we are in a current position where our workers are doing multiple stuff

Wrapping up:

  • Loaders (swh-worker@swh_loader_{something}.service) now are part of a systemd slice to limit their memory usage (up to 90%). [1]
  • Loaders can now use a /tmp dedicated to their systemd service. That permits, when restarting the service to automatically clean that /tmp. This is activated for svn, mercurial and deposit loaders. [2]
  • Sibling typed loader can clean up amongst themselves (if some are killed and did not have time to finish their job). [3]
  • Relatedly, loaders are now dealing properly with the prepare phase exploding (it did not clean up properly nor update the visit status). [4]

[1] 730c7245f4a89866a97181b2af49808967cbba91
[2] 5092232d7750805734b79d3753c99db3d8f53d10
[3] T982
[4] T982