Details

This kind of refactoring needs a bit of thought... The problem addressed here is a performance problem of the scheduler API, it seems. This response is presented as a workaround for that. I'd like to have better arguments than that to justify this refactoring.

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio? How long these other indexers take? Does it make sense to go though the scheduler for (very) short tasks? What 'short' means in this context?

Note that I don't mean this refactoring must not be done. I'd like to have strong arguments. And a beginning of reflexion on this scheduling problem.

swh/indexer/tests/test_origin_metadata.py
160	print?
212–214	same ? as above

This revision now requires changes to proceed.Jan 23 2019, 9:41 AM

Remove prints.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/246/ for more details.

Harbormaster completed remote builds in B3655: Diff 3130.Jan 23 2019, 10:38 AM

In D986#21034, @douardda wrote:

The proper question to answer requires to have 'profiling' data: what is the 'payload'/'task management scaffolding' ratio?

Task management is two parts: running each of the three tasks in the pipeline and scheduling a new task at the end of OriginHeadIndexer and RevisionMetadataIndexer tasks. The former is fast enough, the latter takes 5 seconds twice.

The payload is approximately 1 second; so that's a 1/10 ratio.

How long these other indexers take?

What do you mean?

Does it make sense to go though the scheduler for (very) short tasks?
What 'short' means in this context?

No it doesn't. I decided early on to run these three indexers in separate tasks because it made sense "logically" (they do three separate things); but with time I no longer see the point.
There are still very connected and pass data to each other.

"Formerly", there were 4 read requests to the graph storage and 7 requests to the indexer storage. With this new indexer: 3 read requests to the graph storage and 5 to the indexer storage. Plus 1 read request to the objstorage for both.
Computation time is dominated by these either way.

There difference in the number of request is due to the different parts being able to pass data directly to each other instead of re-fecthing it, as they no longer communicate through the scheduler.

Remove another print.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/247/ for more details.

Harbormaster completed remote builds in B3656: Diff 3131.Jan 23 2019, 10:56 AM

vlorentz edited the summary of this revision. (Show Details)Jan 24 2019, 1:24 PM

douardda requested changes to this revision.Jan 24 2019, 5:28 PM

douardda added inline comments.

swh/indexer/metadata.py
321–327	why not overload the prepare() method?
329–331	kill this method and add a USE_TOOLS = False class attribute, see D990
333–335	Should not be required any more, see D990
swh/indexer/tests/test_origin_metadata.py
139–142	If you wait for D993 (like 1/2h), I killed these awful XXXTestIndexer...

This revision now requires changes to proceed.Jan 24 2019, 5:28 PM

Rebase
Apply changes required by D990.

vlorentz marked 2 inline comments as done.Jan 25 2019, 9:57 AM

vlorentz added inline comments.

swh/indexer/metadata.py
321–327	So tests can override just this part without reimplementing the whole `prepare()`. Though it should eventually go away

vlorentz marked an inline comment as done.Jan 25 2019, 9:58 AM

vlorentz added inline comments.

swh/indexer/tests/test_origin_metadata.py
139–142	Not all of them; `OriginHeadTestIndexer` and `RevisionMetadataTestIndexer` are used as "sub-indexer" by this test and `test_pipeline`.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/271/ for more details.

Harbormaster completed remote builds in B3721: Diff 3193.Jan 25 2019, 10:00 AM

vlorentz marked an inline comment as done.Jan 25 2019, 10:10 AM

vlorentz added inline comments.

swh/indexer/tests/test_origin_metadata.py
139–142	(I'm working on it)

Rebase on top of D1008
Adapt for D1008.

vlorentz added a parent revision: D1008: Kill RevisionMetadataTestIndexer and one of the OriginHeadTestIndexers..Jan 25 2019, 10:20 AM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/273/ for more details.

Harbormaster completed remote builds in B3723: Diff 3195.Jan 25 2019, 10:25 AM

vlorentz added a child revision: D1010: Make metadata indexers store the mappings used to translate metadata..Jan 25 2019, 3:38 PM

ardumont accepted this revision.Jan 28 2019, 3:10 PM

rebase

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/280/ for more details.

Harbormaster completed remote builds in B3758: Diff 3221.Jan 28 2019, 4:05 PM

Minor stuffs to fix then we are good to go.

swh/indexer/metadata.py
321–327	Did I miss something? I do not see this method being overloaded in the tests below.
swh/indexer/tests/test_origin_metadata.py
15	Please try to keep your imports grouped and sorted