failing worker consumes remaining tasks without processing them
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Mar 5 2018, 11:41 AM

Description

What is implicit in the parent task T964:

out of ram worker is killed (it cannot clean up since it's killed)
the node running the worker is then mostly idle for that particular work (in regards to the other sister nodes)
so it starts consuming the queue faster than the other workers (since they do actual work)
and fails faster
resulting in an empty queue in the end

That is what i was trying to solve in T964 (well finding proper solution to implement for the moment).

As I realized it was not explicitly mentioned, opening a dedicated issue for it.

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T682 Ingest Google Code Mercurial repositories
Migrated	gitlab-migration	T561 ingest bitbucket (meta task)
Migrated	gitlab-migration	T593 ingest bitbucket hg/mercurial repositories
Migrated	gitlab-migration	T329 hg / mercurial loader
Migrated	gitlab-migration	T964 2018-02-16 worker disk full postmortem
Migrated	gitlab-migration	T982 failing worker consumes remaining tasks without processing them

Event Timeline

ardumont triaged this task as Normal priority.Mar 5 2018, 11:41 AM

ardumont created this task.

So far, the solutions i foresee:

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 1. Capture OOM event -> force a | - separation of concern (code)        |                                 | - is it possible?                  |
| cleanup (somehow                |                                       |                                 | -> check syslog for killed process |
|                                 |                                       |                                 | -> listen to dedicated event       |
|                                 |                                       |                                 | (per worker)                       |
|                                 |                                       |                                 | -> mem_notify                      |
|                                 |                                       |                                 | https://lwn.net/Articles/267013    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 2. Loader starts -> post a      | - separation of concern (code)        | - edge case can go bad,         | how to determine:                  |
| cleanup message to a queue      |                                       | generating quite some messages  | - which worker to clean?           |
| (as soon as it can)             |                                       | in queue until the current      | -> worker-name as task message     |
|                                 |                                       | clean up takes place            | parameter                          |
|                                 |                                       |                                 | - when to clean?                   |
|                                 |                                       |                                 | -> pid alive check                 |
|                                 |                                       |                                 | (python3-psutil)                   |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 3. Loader starts -> check for   | - short-term: pragmatic (current      | - Separation of concern not     |                                    |
| dangling files in a specific    | state)                                | respected (doing much more than |                                    |
| location                        | - can be plugged in loader-core (     | loading                         |                                    |
|                                 | all workers benefit)                  | - must implement this           |                                    |
|                                 | - simpler to implement                | potentially on other layers     |                                    |
|                                 |                                       | (lister, etc                    |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 4. Worker (node?) supervisor    | - separation of concern               |                                 | 1. How to detect falsy behavior:   |
|                                 | - long-term solution                  |                                 | -> checks per worker type:         |
|                                 |                                       |                                 | - disk status                      |
|                                 |                                       |                                 | - ram usage                        |
|                                 |                                       |                                 | - ~task consuming rate             |
|                                 |                                       |                                 | 2. existing techno or ad-hoc?      |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 5. Separate nodes for           | - More control over the resource      | - The main issue could still    |                                    |
| specialized workers             | allocation (e.g. more disk for svn    | happen (but probably less       |                                    |
|                                 | and hg workers, less for git ones...) | often though)                   |                                    |
|                                 | - separation of concern (system       | - push the problem at           |                                    |
|                                 | level now)                            | system/provision/deployment     |                                    |
|                                 | - long-term solution                  | level                           |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|

Note: org-mode table

Still asserting the possibilities (and reading documentation):

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 6. Check before scheduling task | - separation of concern               |                                 | How to check node's state?         |
| (same way scheduling is sent to |                                       |                                 |                                    |
| queue or not depending on size) |                                       |                                 |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 7. Chain temporary folder       | - separation of concern (code)        | Same as 2.                      | pre-requisite:                     |
| step creation task + loading    | - can be applied to all loaders, ...  |                                 | force chaining to execute on       |
| task using temporary folder +   |                                       |                                 | the same worker (chains [2])       |
| + cleaning up task (independent |                                       |                                 |                                    |
| from loading task result        |                                       |                                 |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|

[2] http://docs.celeryproject.org/en/latest/userguide/canvas.html#canvas-chain

ardumont mentioned this in rDLDBASE7aa7ed927bff: core/loader: Move prepare method call within try except clause.Mar 7 2018, 11:39 AM

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 8. Celery signal on postrun     | - separation of concern               | Not working, in OOM kill        | Documentation [3]                  |
|                                 |                                       | scenario, task is killed        |                                    |
|                                 |                                       | (tested)                        |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|


[3] http://docs.celeryproject.org/en/latest/userguide/signals.html#task-postrun

ardumont mentioned this in rDLDBASEd47ef858cbd9: loader/core: Add optional pre_cleanup for dangling files cleaning.Mar 8 2018, 5:31 PM

ardumont mentioned this in rDLDBASE92f575368241: loader/utils: Add clean_dangling_folders function to ease clean up.

ardumont mentioned this in rDLDSVN2093816c8716: svn/loader: Add pre-cleanup step to clean potential dangling folders.

ardumont mentioned this in rDLDSVNe262bfe7a91c: svn/loader: Refactor: Reuse loader.core.utils.clean_dangling_folders.

ardumont mentioned this in rDLDHG6780a6980ace: d/control: Bump to latest python3-swh.loader.core.

ardumont mentioned this in rDLDHGb8d287da6697: bundle20_loader: Add a pre-cleanup step to clean dangling files.

ardumont mentioned this in rDLDHGa56a9da6c3c0: mercurial/loader: Refactor: Reuse loader.core.utils.clean_dangling_files.

Well, after much digging documentations and some tryouts. Finally gave in to solution 3.

So, now the loaders can implement a pre_cleanup method (loader-core, does nothing by default) to try and clean up after their typed siblings (svn, mercurial).

The loaders already used temporary folders for their computations (which is now sandboxed at the systemd level, so no collision should happen between loaders).
Now, they still use temporary folders, but those follows patterned names (swh.loader.{type}-{unique-noise}-{pid}).

Logic:
Waking up, a task checks for dangling folders in the upper root temporary location (configuration).
They check the folder name pattern matching (according to type) and pid existence:
If (no folder or nothing matches or name matches and pid live) then do nothing and continue with loading
If name matches and pid does not exist then clean up that folder and continue with loading

This has been implemented for svn and mercurial using a common method installed in the loader-core.

ardumont mentioned this in T964: 2018-02-16 worker disk full postmortem.Mar 9 2018, 2:00 PM

This task has been migrated to GitLab.

failing worker consumes remaining tasks without processing themClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

failing worker consumes remaining tasks without processing them
Closed, MigratedEdits Locked
Actions

Related Objects
Search...