Page MenuHomeSoftware Heritage

Implement mimetype indexer using range of sha1s
ClosedPublic

Authored by ardumont on Nov 15 2018, 4:08 PM.

Details

Summary

This implements the new mimetype range indexer only. The previous
indexers (base class ContentIndexer, and ContentMimetypeIndexer) are
still present.

The base class because it's still needed for the other not yet
migrated indexers (fossology license is next, language, ctags).

The ContentMimetypeIndexer because i'm not 100% certain we really want
to drop it entirely. That could be used if we wanted to make a journal
client just for the content indexing on new contents.

Note:

  • The tests need to be rewritten to reuse in-memory storages (and ideally data generation). It's somewhat partially done but it's not good enough.I did not want to start this right now because i want to deploy! It needs to be done nonethelesss at some point (i'll open a task).
  • I created a new type of tasks named swh.indexer.tasks.StatusTask who returns the task's status. That's more consistent with all other swh tasks (outside swh-indexer). We need to align on this behavior.

Related T991

Test Plan

tox

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz added inline comments.
swh/indexer/tests/test_mimetype.py
292 ↗(On Diff #2073)

assertCountEqual to fix tests.

328 ↗(On Diff #2073)

assertCountEqual to fix tests.

This revision now requires changes to proceed.Nov 15 2018, 4:28 PM
swh/indexer/indexer.py
374

That name doesn't tell at all what it should do. Maybe something like get_contents_from_range or inflate_range.

387

Should be renamed to something like list_content_to_index.

394

Set[bytes]

407

That loop runs either a single time or forever. You should rewrite the loop to:

while start:
    # do the thing
    start = result['next']
415

representing

424

Any reason to use a 2-tuple, rather than just taking start and end as parameters?

436

set()

swh/indexer/mimetype.py
165 ↗(On Diff #2073)

Same issue as ContentIndexer._new's loop.

ContentIndexer._new and MimetypeRangeIndexer.range seem to do very similar jobs. Shouldn't they be merged somehow?

swh/indexer/tasks.py
24–28 ↗(On Diff #2073)

wtf, that code is completely broken.

swh/indexer/tests/test_mimetype.py
82 ↗(On Diff #2073)

More indent on this line?

ardumont added inline comments.
swh/indexer/indexer.py
407

\m/

424

I kept the original args of the base indexer.

swh/indexer/mimetype.py
165 ↗(On Diff #2073)

Yes, that sounds appropriate (given how i messed up early on ;) but i'd like correctness first.

swh/indexer/tasks.py
24–28 ↗(On Diff #2073)

Well, now that the orchestrator is out of the way, might be the conditional is no longer relevant (i think so).

swh/indexer/tests/test_mimetype.py
82 ↗(On Diff #2073)

huh, right!

ardumont added inline comments.
swh/indexer/tests/test_mimetype.py
82 ↗(On Diff #2073)

In the end, no. flake8 complains otherwise:

./swh/indexer/tests/test_mimetype.py:82:17: E125 continuation line with same indent as next logical line
tox.ini
34 ↗(On Diff #2073)

Hum, that should not be here, removing.

  • indexer: Improve docstrings
  • indexer: Use meaningful method names
  • indexer: Fix and simplify pagination algorithm
ardumont marked an inline comment as done.

Remove one commit too many in the last diff update.

(~> my local repository was missing one commit on origin/master)

  • tests: Refactor to simplify testing and fix ci
vlorentz added inline comments.
swh/indexer/indexer.py
424

You kept the type, but they don't have the same meaning

swh/indexer/tasks.py
24–28 ↗(On Diff #2073)

What meant is that run() does not return an indexer, it returns the results. So it's either indexer = self.Indexer(); indexer.run(); return indexer.results or return self.Indexer().run().

Anyway, that's outside the scope of this Diff.

swh/indexer/tests/test_mimetype.py
82 ↗(On Diff #2073)

Even with one extra indent level?

This revision now requires changes to proceed.Nov 16 2018, 10:07 AM

In general, to be clear, there are still things to adapt.
But i'm doing it as incremental as i can.

For example, the 'policy_update': update-dups, ignore-dups is not that meaningful.
incremental for example is better...

But then i'd need to change so many docstrings...
I will do it but in due time.
Right now, i'd like to deploy when this is ready that is (which is pretty soon i think).

swh/indexer/indexer.py
374

Thanks. I settled on indexed_contents_in_range.

424

Ah and also i did not realize i could simply change the parameters ;)

swh/indexer/tasks.py
24–28 ↗(On Diff #2073)

Yes, it returns the result (that's what i did below).
I did not touch that part except for adding a docstring.
It's out of scope, i'll change in another diff.
I'll probably rename the task WithResultTask or something.

ardumont marked an inline comment as done.
  • indexer.mimetype: Fix default configuration
  • indexer: Clarify input parameters and explicit [start, end] range
  • tests: Fix visual indentation strangeness
ardumont added inline comments.
swh/indexer/indexer.py
423

Build has FAILED

That's why. Rebase error, that should be a list.
Fixing this.

swh/indexer/tests/test_mimetype.py
82 ↗(On Diff #2073)

yes

Fix error introduced during rebase

This revision is now accepted and ready to land.Nov 16 2018, 1:29 PM
This revision was automatically updated to reflect the committed changes.