Page MenuHomeSoftware Heritage

Add a new indexer that combines all metadata-related indexers in a single task.
ClosedPublic

Authored by vlorentz on Jan 22 2019, 3:01 PM.

Details

Summary

Scheduling a task at the end of each origin-head and revision-metadata task
was way too slow (approx. 5 seconds, see T1492).

This Diff adds a new indexer that calls the three other indexers without
going through the scheduler, which is a lot faster and requires less code.

Further Diffs will clean up the code that is now useless.

Test Plan

tox + docker

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
full-origin-metadata-indexer
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 3652
Build 4751: tox-on-jenkinsJenkins
Build 4750: arc lint + arc unit

Event Timeline

douardda added a subscriber: douardda.

This kind of refactoring needs a bit of thought... The problem addressed here is a performance problem of the scheduler API, it seems. This response is presented as a workaround for that. I'd like to have better arguments than that to justify this refactoring.

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio? How long these other indexers take? Does it make sense to go though the scheduler for (very) short tasks? What 'short' means in this context?

Note that I don't mean this refactoring must not be done. I'd like to have strong arguments. And a beginning of reflexion on this scheduling problem.

swh/indexer/tests/test_origin_metadata.py
145

print?

197–199

same ? as above

This revision now requires changes to proceed.Jan 23 2019, 9:41 AM

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio?

Task management is two parts: running each of the three tasks in the pipeline and scheduling a new task at the end of OriginHeadIndexer and RevisionMetadataIndexer tasks. The former is fast enough, the latter takes 5 seconds twice.

The payload is approximately 1 second; so that's a 1/10 ratio.

How long these other indexers take?

What do you mean?

Does it make sense to go though the scheduler for (very) short tasks?
What 'short' means in this context?

No it doesn't. I decided early on to run these three indexers in separate tasks because it made sense "logically" (they do three separate things); but with time I no longer see the point.
There are still very connected and pass data to each other.

"Formerly", there were 4 read requests to the graph storage and 7 requests to the indexer storage. With this new indexer: 3 read requests to the graph storage and 5 to the indexer storage. Plus 1 read request to the objstorage for both.
Computation time is dominated by these either way.

There difference in the number of request is due to the different parts being able to pass data directly to each other instead of re-fecthing it, as they no longer communicate through the scheduler.

douardda added inline comments.
swh/indexer/metadata.py
335–341

why not overload the prepare() method?

343–345

kill this method and add a USE_TOOLS = False class attribute, see D990

347–349

Should not be required any more, see D990

swh/indexer/tests/test_origin_metadata.py
123–126

If you wait for D993 (like 1/2h), I killed these awful XXXTestIndexer...

This revision now requires changes to proceed.Jan 24 2019, 5:28 PM
vlorentz marked 3 inline comments as done.
  • Rebase
  • Apply changes required by D990.
vlorentz added inline comments.
swh/indexer/metadata.py
335–341

So tests can override just this part without reimplementing the whole prepare(). Though it should eventually go away

vlorentz added inline comments.
swh/indexer/tests/test_origin_metadata.py
123–126

Not all of them; OriginHeadTestIndexer and RevisionMetadataTestIndexer are used as "sub-indexer" by this test and test_pipeline.

vlorentz added inline comments.
swh/indexer/tests/test_origin_metadata.py
123–126

(I'm working on it)

Minor stuffs to fix then we are good to go.

swh/indexer/metadata.py
335–341

Did I miss something? I do not see this method being overloaded in the tests below.

swh/indexer/tests/test_origin_metadata.py
13

Please try to keep your imports grouped and sorted

This revision now requires changes to proceed.Jan 29 2019, 9:51 AM
  • Rebase
  • Remove _prepare_sub_indexers.
  • Fix rebase
  • Sort imports.
This revision was not accepted when it landed; it landed in state Needs Review.Jan 29 2019, 12:06 PM
This revision was automatically updated to reflect the committed changes.