Description

The root filesystem on worker13 is full and the host cannot process Mercurial data anymore.

The culprit seems to be the Mercurial loader itself.
/ tmp (part of the / filesystem) contains huge files and directories:

12G     sqldict6a19bb
17G     sqldict89fc16
786M    sqldict8f09fa
634M    sqldictc9c674
4.0K    swh.loader.deposit
14M     swh.loader.mercurial.10o7dn_1
4.0M    swh.loader.mercurial.31j86o5f
4.0K    swh.loader.mercurial.35gljinc
24M     swh.loader.mercurial.4p1f_0o0
52K     swh.loader.mercurial.d0y8x43e
50M     swh.loader.mercurial.dzr2_6_e
135M    swh.loader.mercurial.e1_1ejhy
100M    swh.loader.mercurial.g5d8i9y2
15M     swh.loader.mercurial.h_a2csx2
64M     swh.loader.mercurial.jz9z34ow
7.5M    swh.loader.mercurial.lcpqso3x
128M    swh.loader.mercurial.mjp_zken
919M    swh.loader.mercurial.mzlrf1cf
27M     swh.loader.mercurial.o4mn1ran
64M     swh.loader.mercurial.ox9etasm
64M     swh.loader.mercurial.sgyhmxh3
160K    swh.loader.mercurial.tk3ozyv2
35M     swh.loader.mercurial.wd3f3t4i
26M     swh.loader.mercurial.yo2tfris
12K     tmp7w5wd0ar
588K    tmpiary52zt
588K    tmpmiis9pv7

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T682 Ingest Google Code Mercurial repositories
Migrated	gitlab-migration	T561 ingest bitbucket (meta task)
Migrated	gitlab-migration	T593 ingest bitbucket hg/mercurial repositories
Migrated	gitlab-migration	T329 hg / mercurial loader
Migrated	gitlab-migration	T964 2018-02-16 worker disk full postmortem
Migrated	gitlab-migration	T982 failing worker consumes remaining tasks without processing them
Migrated	gitlab-migration	T985 loader*: Make prepare method resilient to error and origin visit status compliant

Event Timeline

ftigeot created this task.Feb 16 2018, 11:24 AM

ftigeot updated the task description. (Show Details)

cannot process Mercurial data anymore.

well, not only mercurial, any other stuff that needs at least disk to work (loader svn comes to mind easily but not limited to that).

14M swh.loader.mercurial.10o7dn_1

Those folders are the expected ones.
Well, they should be cleaned up.

Those are hosting the repository uncompressed from the archive.

12G sqldict6a19bb
17G sqldict89fc16
786M sqldict8f09fa
634M sqldictc9c674

That must be the disk-based secondary storage cache used from within the mercurial loader @fiendish referred to.

This is quite large indeed!

I did not expect them at that location though.

ardumont mentioned this in rDLDHG53566e6460c5: bundle20_loader: Set cache disk within loader's temporary workdir.Feb 16 2018, 2:29 PM

ardumont mentioned this in rSPSITEf683ce8ad10c: deploy/worker: svn/mercurial loader: Use specific /tmp folder.Feb 16 2018, 7:07 PM

ardumont mentioned this in rSPPROF5092232d7750: deploy/worker: Permit to use specific /tmp folder for worker service.

ardumont mentioned this in rSPPROF83c922dbb76f: deploy/worker: Fix to use the right parameter name.Feb 16 2018, 7:10 PM

ardumont mentioned this in rSPSITEe1dfccc9e67f: deploy/worker: deposit loader: Use specific /tmp folder.Feb 16 2018, 7:18 PM

ardumont mentioned this in rSPPROFee4b29647cb4: deploy/worker: deposit loader: Use specific /tmp folder.Feb 16 2018, 7:22 PM

So, now i've made the deposit/svn/mercurial loaders use private tmp mounting.
So next time this happens, we should only need restarting/stopping the service.
And this will clean up the mess left after this situation arose.

doc: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#PrivateTmp=

Now to avoid the OOM to happen too late and kill other services as well, i thought of quota in the memory usage (systemd permits this).

The memory usage limitation won't be enough for the mercurial/svn loaders to kill other stuff though.

For example, the mercurial loader uses a secondary disk cache when the memory reached a certain point... So that could expand a lot...
Another example would be the svn loader on large repository which could expand a lot on disk as well...

I did not find a way to limit the disk usage (from systemd at least).

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).
I'll keep a listing of those not rescheduled.

ardumont mentioned this in T329: hg / mercurial loader.Feb 20 2018, 12:09 PM

ardumont mentioned this in rDLDHG4f5a16cc48db: bundle20_loader: Configure temporary working directory.Feb 20 2018, 2:09 PM

ardumont mentioned this in rDLDHG6e12c90b160a: bundle20_loader: Remove one call to load_directories.

ardumont mentioned this in rDLDHG1c47479d13c4: bundle20_loader: Opening cache size parameter.

ardumont mentioned this in rDLDHGa8087b5293df: bundle20_loader: Add swh specific name to cache and table names.

ardumont mentioned this in rDLDHGfd79b19a97c9: objects: Add custom serialization to use less disk space.

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).

Well, first, i tried to reduce the disk usage (cf. commits just before that comment).

And second, there are other stuff to clean up anyway (T976).

ardumont mentioned this in T965: googlecode import: Analyze and fix errors.Feb 20 2018, 4:49 PM

ardumont mentioned this in rDLDHG881814e9a432: bundle20_loader: Set cache disk within loader's temporary workdir.Feb 20 2018, 8:04 PM

ardumont mentioned this in rDLDHGf308be477d71: bundle20_loader: Add memory configuration cache for both.

If cache files are sticking around, then of course the code should make sure that they go away when done or aborted. But I think that a few G used during processing of extremely large repos should be acceptable. :/

If cache files are sticking around, then of course the code should make sure that they go away when done

It does...

or aborted.

...up to the point where they cannot (OOM killed). This is the current predicament.

But I think that a few G used during processing of extremely large repos should be acceptable. :/

It is.

It's just that we need to find a proper way to also clean up outside the loaders when that situation arose.

The thing is we are in a current position where our workers are doing multiple stuff

ardumont mentioned this in T982: failing worker consumes remaining tasks without processing them.Mar 5 2018, 11:41 AM

ardumont mentioned this in rDLDBASE7aa7ed927bff: core/loader: Move prepare method call within try except clause.Mar 7 2018, 11:39 AM

ardumont closed subtask T985: loader*: Make prepare method resilient to error and origin visit status compliant as Resolved.Mar 7 2018, 12:51 PM

ardumont closed subtask T982: failing worker consumes remaining tasks without processing them as Resolved.Mar 9 2018, 11:17 AM

Wrapping up:

Loaders (swh-worker@swh_loader_{something}.service) now are part of a systemd slice to limit their memory usage (up to 90%). [1]
Loaders can now use a /tmp dedicated to their systemd service. That permits, when restarting the service to automatically clean that /tmp. This is activated for svn, mercurial and deposit loaders. [2]
Sibling typed loader can clean up amongst themselves (if some are killed and did not have time to finish their job). [3]
Relatedly, loaders are now dealing properly with the prepare phase exploding (it did not clean up properly nor update the visit status). [4]

[1] 730c7245f4a89866a97181b2af49808967cbba91
[2] 5092232d7750805734b79d3753c99db3d8f53d10
[3] T982
[4] T982

ardumont mentioned this in rSPSITE5092232d7750: deploy/worker: Permit to use specific /tmp folder for worker service.Jun 15 2018, 2:30 PM

ardumont mentioned this in rSPSITE83c922dbb76f: deploy/worker: Fix to use the right parameter name.

ardumont mentioned this in rSPSITEee4b29647cb4: deploy/worker: deposit loader: Use specific /tmp folder.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T982: failing worker consumes remaining tasks without processing them from Resolved to Migrated.Jan 8 2023, 4:24 PM

gitlab-migration changed the status of subtask T985: loader*: Make prepare method resilient to error and origin visit status compliant from Resolved to Migrated.

2018-02-16 worker disk full postmortemClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

2018-02-16 worker disk full postmortem
Closed, MigratedEdits Locked
Actions

Related Objects
Search...