Page MenuHomeSoftware Heritage

scheduler.cli.task: Make `task archive` actually archive scheduler's tasks
ClosedPublic

Authored by ardumont on Dec 14 2019, 11:34 AM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Commits
rDSCH2cbfb788fdf3: Add tests to in memory elasticsearch implementation
rDSCHe9d8a5f95c58: scheduler.backend: Rename appropriately module elasticsearch_memory
rDSCH38d17dec7042: tests/common: Remove uneeded behavior
rDSCHba5920dbdc8e: backend_es: Add tests around elasticsearch client instantiation
rDSCHac32b5eca189: backend: Add alternate memory elasticsearch implem to allow testing
rDSCH7b1c2d58fa33: scheduler.backend_es: Allow using different elasticsearch clients
rDSCHec207fbfe0bb: scheduler.backend: Make the returned result a dict
rDSCHf97bff6ff74c: cli.task: Make page_token actually a string even from the cli
rDSCHd8859d79f3a3: backend_es: Add initialization endpoint
rDSCH18df1240abee: cli.tasks: Unify logging instruction
rDSCHd5cea209ae8f: backend_es: Remove unused endpoint
rDSCHc5e189b55ca3: test: Allow status definition during task template generation
rDSCH844f3e0c67c4: tests.scheduler: Extract common utility function
rDSCH2d566699da5c: scheduler.cli.task: Rename appropriately backend variable
rDSCH793c2337ceca: scheduler.backend_es: Rename backend class appropriately
rDSCHd5bf6b1deae5: cli.task: Rename internal method appropriately
rDSCHb376eb9b0340: backend_es: Enclose close instruction within finally
rDSCHeb1c3d3b5731: backend_es: Use consistent logging instruction
rDSCHf6726e9c4cf6: backend_es: Create index when it does not exist
rDSCH305422b9d0d7: cli.task: Tasks needs to be sorted prior to group by call
rDSCHad54c6bab13c: backend_es: Open indices prior to indexing method calls
rDSCHd603608f0c1d: cli.task: Use the configuration provided by the cli
Summary

This fixes many issues with the current archive cli which prevented it from
working:

  • cli.task: Use the configuration provided by the cli

    -> it was no longer using the config provided, so failing to initialize the elasticsearch client.
  • cli.task: Tasks needs to be sorted prior to group by call

    -> otherwise the initial group by's point is moot
  • backend_es: Open indices prior to indexing method calls (and close it back when done)

    -> in production, most old indices are closed. So we need to open them, do our bidding, then close.
  • backend_es: Create index when it does not exist

    -> new indices need to be created since the last time the cli ran...

Depends on D2457
Related to T1931

Test Plan
  • prod: I tried the following diff and D2457 in prod and it's green
  • Add tests using an in-memory elasticsearch client (oriented towards storage, not search)

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

swh/scheduler/backend_es.py
151 ↗(On Diff #8672)

I did not find a better way to check for the index's state...
If you know better, by all means ;)

Rebase on latest D2457's equivalent git branch

It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

And this will be an issue if two people run this command at the same time. If user A runs it, then B runs it, and A finishes before B and closes the index, then what happens to B?

swh/scheduler/backend_es.py
189–191 ↗(On Diff #8674)

You may want to have this in a finally:.

swh/scheduler/cli/task.py
529

I know it's off-topic, but this function should just be named get_index_name... it doesn't do any grouping.

This revision now requires changes to proceed.Dec 14 2019, 5:06 PM

Rebase on latest diff

Adapt according to review:

  • backend_es: Enclose close instruction within finally
  • cli.task: Rename internal method appropriately

+ backend_es: Use consistent logging instruction

It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

Yes, well, that's what i mentioned in the diff's description ;)

No unit test for now, i'd need to check with @vlorentz to reuse swh-search's elasticsearch fixtures

And this will be an issue if two people run this command at the same time. If user A runs it, then B runs it, and A finishes before B and closes the index, then what happens to B?

Yes, nothing enforces concurrency.

Today, it's running through one cron once a month (with no concurrency planned).

ardumont retitled this revision from scheduler.cli.task: Make `task archive` cli work to scheduler.cli.task: Make `task archive` actually archive scheduler's tasks.Dec 16 2019, 9:59 AM

It should have tests, either by mocking SWHElasticSearchClient or the elasticsearch library itself.

Yes, well, that's what i mentioned in the diff's description ;)

No unit test for now, i'd need to check with @vlorentz to reuse swh-search's elasticsearch fixtures

I said mocking (replacing the actual ElasticSearch with stub code), not fixtures (running an actual ES from pytest)

LGTM now, except for the missing tests

I said mocking (replacing the actual ElasticSearch with stub code), not fixtures (running an actual ES from pytest)

yes, i understood you .
In the end, given how it stopped working for months, it's not enough to mock though.

I think mocking is good enough. Sure it won't catch all errors, but that's the job of monitoring.

Running ES in tests would slow them down considerably, because ES takes some time to start. Not to mention it needs to be installed and takes a couple hundred MB.

Add tests around the task cli:

  • add in-memory elasticsearch implementation (oriented towards storage, not search). I still need to add tests on this part (right now it's tested through the other cogs) (This makes the test fast)
  • This also opens an initialize endpoint in the backend_es (which so far was installed manually)
  • Main test is:
    • add tasks to the scheduler
    • simulate a run (associate task_run to the task)
    • disable some of those tasks
    • trigger the 'task archive' cli
    • ensure the tasks are removed from the scheduler
  • This also makes the page_token a string (which it was not yet)

Sorry, I didn't realize how much new code was needed to write these mocks.

It looks good, but I am slightly confused about names. "backend_es_memory.py" does not contain a backend, in the the sense that "ElasticSearchBackend" means "elasticsearch backend of the scheduler", right?

Sorry, I didn't realize how much new code was needed to write these mocks.

heh ;)

It looks good, but I am slightly confused about names. "backend_es_memory.py" does not contain a backend, in the the sense that "ElasticSearchBackend" means "elasticsearch backend of the scheduler", right?

Right.

The confusion also bothers me.
I'll rename but i need to find that name first ;)

backend_es_memory is currently defining MemoryElasticsearch which is really an elasticsearch implementation oriented towards storage.
That implementation is used for the backend_es.ElasticSearchBackend (within test) as an elasticsearch (collaborator) client.
(For production, the expected collaborator client implementation used is the official one elasticsearch.Elasticsearch).

Maybe in some way, we can see the memory implementation as a client.
So swh.scheduler.es_client.py maybe?

  • tests/common: Remove uneeded behavior
  • backend_es: Add tests around elasticsearch client instantiation
  • Add tests to in memory elasticsearch implementation

So swh.scheduler.es_client.py maybe?

even one of the following:

  • swh.scheduler.es_memory
  • swh.scheduler.elasticsearch_memory.py

i guess the longer name is more explicit.

This revision is now accepted and ready to land.Dec 17 2019, 1:15 PM
  • scheduler.backend: Rename appropriately module elasticsearch_memory
This revision was automatically updated to reflect the committed changes.