Page MenuHomeSoftware Heritage

Add a new indexer that combines all metadata-related indexers in a single task.
ClosedPublic

Authored by vlorentz on Jan 22 2019, 3:01 PM.

Details

Summary

Scheduling a task at the end of each origin-head and revision-metadata task
was way too slow (approx. 5 seconds, see T1492).

This Diff adds a new indexer that calls the three other indexers without
going through the scheduler, which is a lot faster and requires less code.

Further Diffs will clean up the code that is now useless.

Test Plan

tox + docker

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jan 22 2019, 3:01 PM
douardda requested changes to this revision.Jan 23 2019, 9:41 AM
douardda added a subscriber: douardda.

This kind of refactoring needs a bit of thought... The problem addressed here is a performance problem of the scheduler API, it seems. This response is presented as a workaround for that. I'd like to have better arguments than that to justify this refactoring.

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio? How long these other indexers take? Does it make sense to go though the scheduler for (very) short tasks? What 'short' means in this context?

Note that I don't mean this refactoring must not be done. I'd like to have strong arguments. And a beginning of reflexion on this scheduling problem.

swh/indexer/tests/test_origin_metadata.py
161

print?

213–215

same ? as above

This revision now requires changes to proceed.Jan 23 2019, 9:41 AM
vlorentz updated this revision to Diff 3130.Jan 23 2019, 10:35 AM
  • Remove prints.

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio?

Task management is two parts: running each of the three tasks in the pipeline and scheduling a new task at the end of OriginHeadIndexer and RevisionMetadataIndexer tasks. The former is fast enough, the latter takes 5 seconds twice.

The payload is approximately 1 second; so that's a 1/10 ratio.

How long these other indexers take?

What do you mean?

Does it make sense to go though the scheduler for (very) short tasks?
What 'short' means in this context?

No it doesn't. I decided early on to run these three indexers in separate tasks because it made sense "logically" (they do three separate things); but with time I no longer see the point.
There are still very connected and pass data to each other.

"Formerly", there were 4 read requests to the graph storage and 7 requests to the indexer storage. With this new indexer: 3 read requests to the graph storage and 5 to the indexer storage. Plus 1 read request to the objstorage for both.
Computation time is dominated by these either way.

There difference in the number of request is due to the different parts being able to pass data directly to each other instead of re-fecthing it, as they no longer communicate through the scheduler.

vlorentz updated this revision to Diff 3131.Jan 23 2019, 10:53 AM
  • Remove another print.
vlorentz edited the summary of this revision. (Show Details)Jan 24 2019, 1:24 PM
douardda requested changes to this revision.Jan 24 2019, 5:28 PM
douardda added inline comments.
swh/indexer/metadata.py
322–328

why not overload the prepare() method?

330–332

kill this method and add a USE_TOOLS = False class attribute, see D990

334–336

Should not be required any more, see D990

swh/indexer/tests/test_origin_metadata.py
140–143

If you wait for D993 (like 1/2h), I killed these awful XXXTestIndexer...

This revision now requires changes to proceed.Jan 24 2019, 5:28 PM
vlorentz updated this revision to Diff 3193.Jan 25 2019, 9:57 AM
vlorentz marked 3 inline comments as done.
  • Rebase
  • Apply changes required by D990.
vlorentz marked 2 inline comments as done.Jan 25 2019, 9:57 AM
vlorentz added inline comments.
swh/indexer/metadata.py
322–328

So tests can override just this part without reimplementing the whole prepare(). Though it should eventually go away

vlorentz marked an inline comment as done.Jan 25 2019, 9:58 AM
vlorentz added inline comments.
swh/indexer/tests/test_origin_metadata.py
140–143

Not all of them; OriginHeadTestIndexer and RevisionMetadataTestIndexer are used as "sub-indexer" by this test and test_pipeline.

vlorentz marked an inline comment as done.Jan 25 2019, 10:10 AM
vlorentz added inline comments.
swh/indexer/tests/test_origin_metadata.py
140–143

(I'm working on it)

ardumont accepted this revision.Jan 28 2019, 3:10 PM
vlorentz updated this revision to Diff 3221.Jan 28 2019, 4:02 PM
  • rebase
douardda requested changes to this revision.Jan 29 2019, 9:51 AM

Minor stuffs to fix then we are good to go.

swh/indexer/metadata.py
322–328

Did I miss something? I do not see this method being overloaded in the tests below.

swh/indexer/tests/test_origin_metadata.py
16

Please try to keep your imports grouped and sorted

This revision now requires changes to proceed.Jan 29 2019, 9:51 AM
vlorentz updated this revision to Diff 3234.Jan 29 2019, 10:25 AM
  • Rebase
  • Remove _prepare_sub_indexers.
  • Fix rebase
  • Sort imports.
vlorentz marked 3 inline comments as done.Jan 29 2019, 10:25 AM
ardumont accepted this revision.Jan 29 2019, 10:49 AM
This revision was not accepted when it landed; it landed in state Needs Review.Jan 29 2019, 12:06 PM
This revision was automatically updated to reflect the committed changes.