Page MenuHomeSoftware Heritage

scheduler.backend_es: Leave index opened when streaming bulk
ClosedPublic

Authored by ardumont on Feb 3 2020, 9:25 AM.

Details

Summary

Prior to this commit, we had the proper behavior of closing index when done
streaming. Unfortunately, this created too much gc on es nodes down the line.
So for now, we remove that behavior.

Note that this implies we need another cog that makes a pass once in a while on
indices to close.

Also, this has been running on production for approximately 1 or 2 weeks now
and no more gc issues arose since then.

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

douardda added inline comments.
swh/scheduler/backend_es.py
247

not very fond of the phrasing here. Maybe something like:
"keep the index open for performance reasons." and if it's true "The index will be automatically closed on exit" (because of some __del__ method, gc or whatever handles this properly).

Improve sentence phrasing

swh/scheduler/backend_es.py
247

not very fond of the phrasing here. Maybe something like: "keep the index open for performance reasons."

will adapt.

and if it's true "The index will be automatically closed on exit" (because of some del method, gc or whatever handles this properly).

it's not.
Thus the message in the description about another cog closing those (for now, it's me).

okay-ish but lifecycle of ES related services/objects is unclear to me.

About the index creation and opening, why isn't is done at class instantiation ? Why not also call the initialize() method *if needed* at this moment?

Would it be required to close the index at instance destruction (__del__) ?

As for refactorings, the ElasticSearchBackend.create() method looks useless to me (useless complexity). This is pretty obvious when reading this snippet of code:

if not self.storage.indices.exists(index_name):
    self.create(index_name)

where we test some condition using a method of self.storage.indices but we react to it using a method of`self` (not sure a I'm really clear here; let's discuss IRL if needed).

This revision now requires changes to proceed.Feb 6 2020, 11:11 AM

okay-ish but lifecycle of ES related services/objects is unclear to me.

ok

About the index creation and opening, why isn't is done at class instantiation ?

The class instantiation is done at the start of the cli.
And then we are working on multiple months periods (one index per month).
So we cannot know what period we are working on yet.

Why not also call the initialize() method *if needed* at this moment?

I guess we could call it in the constructor (that's not the point of the diff though ;)

I think the methods called from initialize are idempotent (since they are named with the put verb and underneath the calls are done on a rest api, afair).

Would it be required to close the index at instance destruction (__del__) ?

The point of the diff is to avoid closing indices as this creates too much gc on the esnodes.
Right now, we are in a transition period.
We need to cleanup old data sparsed around multiple years (running).
Scheduled data are inconsistent given the time period (cf. oral discussion from this morning) so even if the cli does its best to group data per month, it's not enough.
Thus the cli opening and closing too many indices for a same period.
That creates side-effects down the line, typically the gc on esnodes (which are also running kafka...)

As for refactorings, the ElasticSearchBackend.create() method looks useless to me (useless complexity). This is pretty obvious when reading this snippet of code:

if not self.storage.indices.exists(index_name):
    self.create(index_name)

where we test some condition using a method of self.storage.indices but we react to it using a method of`self` (not sure a I'm really clear here; let's discuss IRL if needed).

yeah, i don't get what you mean ;)
as usual, i just compacted code within methods but that may be irrelevant in the current context (although again, that's not in the current diff's scope either ;)

yeah, i don't get what you mean ;)
as usual, i just compacted code within methods but that may be irrelevant in the current context (although again, that's not in the current diff's scope either ;)

Maybe i start to understand slightly.
Yes, this class can be simplified a bit.

I started with i don't know how to do what i wanted to do.
So, i iterated adding red test, then slowly progressing, make it go green, adding methods, etc...
And making sure the stuff worked.
When it all worked, i stopped and forgot to refactor to simplify a bit...
That can be seen though the test_backend_es.py.
The tests there are mostly orchestration of that backend's methods.
Ensuring it's working as expected.

There you go ;)
That refactoring simplification, if needed, can be done but in another diff.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 26 2020, 10:45 AM
This revision was automatically updated to reflect the committed changes.