Page MenuHomeSoftware Heritage

indexer: Persist computations every 'write_batch_size' computations
ClosedPublic

Authored by ardumont on Nov 15 2018, 6:53 PM.

Diff Detail

Branch
range-mimetype-indexer
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2409
Build 2946: arc lint + arc unit

Event Timeline

Nitpick: I would expect a variable named batch_write to be a boolean. Maybe rename it to write_batch_size? (meaning "the size of a write batch")

swh/indexer/indexer.py
474

So, if my understanding is good, this return statement returns None if the for loop did not run, and the last results otherwise. Is that what you mean?

And it looks more complicated than it should, particularly because of the shadowing of results.

This revision now requires changes to proceed.Nov 16 2018, 10:14 AM
ardumont added inline comments.
swh/indexer/indexer.py
474

So, if my understanding is good, this return statement returns None if the for loop did not run, and the last results otherwise. Is that what you mean?

Yes. I do not particularly care for the results (i do for the tests here).
The content indexer's goal is to write to db (done by persist_index_computations).
What i am aiming with this that the indexer actually write somewhere even in the case of huge range...

Again, it's interesting it gives the result here for tests purposes only.
In the task, we do nothing with the results except for determining the status 'eventful' or not.

And it looks more complicated than it should, particularly because of the shadowing of results.

I might be able to drop the initial results. I did not try.

ardumont added inline comments.
swh/indexer/indexer.py
474

And it looks more complicated than it should, particularly because of the shadowing of results.

I might be able to drop the initial results. I did not try.

No, it does not. Unbound error.
Also, regarding complicated, the docstring explains the result is partial (at least, it tried to).

Change option name to write_batch_size

Build has FAILED

because stacked diff.

vlorentz added inline comments.
swh/indexer/indexer.py
434

I only just realized that if the caller of index_contents does not consume its values, the yield will block. You should either fix that or mark index_contents as protected method (ie. add a _ prefix)

474

Then remove the return, and set self.results instead? This way there is no return value to serialize and send through Redis.

This revision now requires changes to proceed.Nov 16 2018, 1:22 PM
ardumont added inline comments.
swh/indexer/indexer.py
434

It's a protected method in my mind indeed.

474

Then remove the return, and set self.results instead?

Not functional, i don't like that kind of stuff.
Especially for tests.
(I did not like my partial result as well but did not see better at the time so thx).

This way there is no return value to serialize and send through Redis.

Discussion took place on irc.

The gist of it is:

  • returns a boolean (True: data indexed, False: no data indexed)
  • test the data when testing index_contents instead.

Adapt according to discussion

  • d/*: Bump dependency to latest swh-storage
  • indexer: Use protected method names
  • indexer: Use functional approach to result
  • indexer: Use the right type information
ardumont retitled this revision from indexer: Persist computations every 'batch_write' computations to indexer: Persist computations every 'write_batch_size' computations.Nov 16 2018, 2:31 PM
This revision is now accepted and ready to land.Nov 16 2018, 2:35 PM
ardumont added inline comments.
swh/indexer/indexer.py
474

This way there is no return value to serialize and send through Redis.

Forgot to reply.
There is no serialization (and no redis? -> celery/rabbitmq).
The task (the one spawning the mimetype indexer here) only checks if there is data returned.
If there is, it returns a status eventful, uneventful (as a dict) otherwise ;)
And that's what is serialized.

That's why the boolean is enough.

This revision was automatically updated to reflect the committed changes.