Details

Reviewers

Group Reviewers

Commits

rDSCH2cbfb788fdf3: Add tests to in memory elasticsearch implementation
rDSCHe9d8a5f95c58: scheduler.backend: Rename appropriately module elasticsearch_memory
rDSCH38d17dec7042: tests/common: Remove uneeded behavior
rDSCHba5920dbdc8e: backend_es: Add tests around elasticsearch client instantiation
rDSCHac32b5eca189: backend: Add alternate memory elasticsearch implem to allow testing
rDSCH7b1c2d58fa33: scheduler.backend_es: Allow using different elasticsearch clients
rDSCHec207fbfe0bb: scheduler.backend: Make the returned result a dict
rDSCHf97bff6ff74c: cli.task: Make page_token actually a string even from the cli
rDSCHd8859d79f3a3: backend_es: Add initialization endpoint
rDSCH18df1240abee: cli.tasks: Unify logging instruction
rDSCHd5cea209ae8f: backend_es: Remove unused endpoint
rDSCHc5e189b55ca3: test: Allow status definition during task template generation
rDSCH844f3e0c67c4: tests.scheduler: Extract common utility function
rDSCH2d566699da5c: scheduler.cli.task: Rename appropriately backend variable
rDSCH793c2337ceca: scheduler.backend_es: Rename backend class appropriately
rDSCHd5bf6b1deae5: cli.task: Rename internal method appropriately
rDSCHb376eb9b0340: backend_es: Enclose close instruction within finally
rDSCHeb1c3d3b5731: backend_es: Use consistent logging instruction
rDSCHf6726e9c4cf6: backend_es: Create index when it does not exist
rDSCH305422b9d0d7: cli.task: Tasks needs to be sorted prior to group by call
rDSCHad54c6bab13c: backend_es: Open indices prior to indexing method calls
rDSCHd603608f0c1d: cli.task: Use the configuration provided by the cli

Summary

This fixes many issues with the current archive cli which prevented it from
working:

cli.task: Use the configuration provided by the cli

-> it was no longer using the config provided, so failing to initialize the elasticsearch client.

cli.task: Tasks needs to be sorted prior to group by call

-> otherwise the initial group by's point is moot

backend_es: Open indices prior to indexing method calls (and close it back when done)

-> in production, most old indices are closed. So we need to open them, do our bidding, then close.

backend_es: Create index when it does not exist

-> new indices need to be created since the last time the cli ran...

Depends on D2457
Related to T1931

Test Plan

prod: I tried the following diff and D2457 in prod and it's green
Add tests using an in-memory elasticsearch client (oriented towards storage, not search)

Diff Detail

Repository

rDSCH Scheduling utilities

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Dec 14 2019, 11:34 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptDec 14 2019, 11:34 AM

Build is green
See https://jenkins.softwareheritage.org/job/DSCH/job/tox/325/ for more details.

Harbormaster completed remote builds in B9761: Diff 8672.Dec 14 2019, 11:36 AM

ardumont edited the summary of this revision. (Show Details)Dec 14 2019, 11:37 AM

ardumont mentioned this in T1931: scheduler's cron cleanup error when filtering tasks to archive.

ardumont added inline comments.Dec 14 2019, 11:42 AM

swh/scheduler/backend_es.py
151 ↗	(On Diff #8672)	I did not find a better way to check for the index's state... If you know better, by all means ;)

Rebase on latest D2457's equivalent git branch

Build is green
See https://jenkins.softwareheritage.org/job/DSCH/job/tox/327/ for more details.

Harbormaster completed remote builds in B9763: Diff 8674.Dec 14 2019, 2:33 PM

It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

And this will be an issue if two people run this command at the same time. If user A runs it, then B runs it, and A finishes before B and closes the index, then what happens to B?

swh/scheduler/backend_es.py
189–191 ↗	(On Diff #8674)	You may want to have this in a `finally:`.
swh/scheduler/cli/task.py
529	I know it's off-topic, but this function should just be named `get_index_name`... it doesn't do any grouping.

This revision now requires changes to proceed.Dec 14 2019, 5:06 PM

Rebase on latest diff

Adapt according to review:

backend_es: Enclose close instruction within finally
cli.task: Rename internal method appropriately

+ backend_es: Use consistent logging instruction

Build is green
See https://jenkins.softwareheritage.org/job/DSCH/job/tox/330/ for more details.

Harbormaster completed remote builds in B9766: Diff 8677.Dec 14 2019, 6:48 PM

It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

Yes, well, that's what i mentioned in the diff's description ;)

No unit test for now, i'd need to check with @vlorentz to reuse swh-search's elasticsearch fixtures

And this will be an issue if two people run this command at the same time. If user A runs it, then B runs it, and A finishes before B and closes the index, then what happens to B?

Yes, nothing enforces concurrency.

Today, it's running through one cron once a month (with no concurrency planned).

ardumont edited the test plan for this revision. (Show Details)Dec 14 2019, 6:52 PM

ardumont retitled this revision from scheduler.cli.task: Make `task archive` cli work to scheduler.cli.task: Make `task archive` actually archive scheduler's tasks.Dec 16 2019, 9:59 AM

In D2458#58002, @ardumont wrote:
It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

Yes, well, that's what i mentioned in the diff's description ;)
No unit test for now, i'd need to check with @vlorentz to reuse swh-search's elasticsearch fixtures

I said mocking (replacing the actual ElasticSearch with stub code), not fixtures (running an actual ES from pytest)

LGTM now, except for the missing tests

I said mocking (replacing the actual ElasticSearch with stub code), not fixtures (running an actual ES from pytest)

yes, i understood you .
In the end, given how it stopped working for months, it's not enough to mock though.

I think mocking is good enough. Sure it won't catch all errors, but that's the job of monitoring.

Running ES in tests would slow them down considerably, because ES takes some time to start. Not to mention it needs to be installed and takes a couple hundred MB.

Add tests around the task cli:

add in-memory elasticsearch implementation (oriented towards storage, not search). I still need to add tests on this part (right now it's tested through the other cogs) (This makes the test fast)

This also opens an initialize endpoint in the backend_es (which so far was installed manually)

Main test is:
- add tasks to the scheduler
- simulate a run (associate task_run to the task)
- disable some of those tasks
- trigger the 'task archive' cli
- ensure the tasks are removed from the scheduler

This also makes the page_token a string (which it was not yet)

Build is green
See https://jenkins.softwareheritage.org/job/DSCH/job/tox/337/ for more details.

Harbormaster completed remote builds in B9813: Diff 8726.Dec 17 2019, 11:08 AM

ardumont edited the test plan for this revision. (Show Details)Dec 17 2019, 11:38 AM

Sorry, I didn't realize how much new code was needed to write these mocks.

It looks good, but I am slightly confused about names. "backend_es_memory.py" does not contain a backend, in the the sense that "ElasticSearchBackend" means "elasticsearch backend of the scheduler", right?

Sorry, I didn't realize how much new code was needed to write these mocks.

heh ;)

It looks good, but I am slightly confused about names. "backend_es_memory.py" does not contain a backend, in the the sense that "ElasticSearchBackend" means "elasticsearch backend of the scheduler", right?

Right.

The confusion also bothers me.
I'll rename but i need to find that name first ;)

backend_es_memory is currently defining MemoryElasticsearch which is really an elasticsearch implementation oriented towards storage.
That implementation is used for the backend_es.ElasticSearchBackend (within test) as an elasticsearch (collaborator) client.
(For production, the expected collaborator client implementation used is the official one elasticsearch.Elasticsearch).

Maybe in some way, we can see the memory implementation as a client.
So swh.scheduler.es_client.py maybe?