Page MenuHomeSoftware Heritage

Add a CLI tool to reindex origins based on mapping used.
ClosedPublic

Authored by vlorentz on Feb 20 2019, 5:15 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
scheduling-cli
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 4357
Build 5758: tox-on-jenkinsJenkins
Build 5757: arc lint + arc unit

Event Timeline

douardda added a subscriber: douardda.

Please add an entry_point in the setup.py for this cli.

swh/indexer/cli.py
24–46

It's weird to be able to load a config file which content is not used anywhere, so it seems.

I would expect every config-like option of the schedule group to be a config entry read from the config file, with possible cli voerload via these options.

Also, the --no-dry-run flag value is useless IMHO. Just keep --dry-run with is_flag=True.

61

What's the point of this limit value? Looks like a hard max limit of the number of origins to reindex at once, so it should at least be mentioned in the command's docstring.

However I'm wondering, won't this value prevent some origins to ever be reindexable? I mean if I run this command once and get hit by this limit, how do I know I've been so? How can I 'resume' the reindexation from there?

79–80

What are the expected values for this option?

81

What other task-type can be used here? Can I create say an inconsistent 'swh-origin-git-update' task using this cli tool?

87

Why is this command limited to already indexed origins only?

This revision now requires changes to proceed.Feb 21 2019, 10:42 AM
vlorentz added inline comments.
swh/indexer/cli.py
24–46

It's weird to be able to load a config file which content is not used anywhere, so it seems.

I would expect every config-like option of the schedule group to be a config entry read from the config file, with possible cli voerload via these options.

Will do

Also, the --no-dry-run flag value is useless IMHO. Just keep --dry-run with is_flag=True.

I agree; I did that for uniformity with swh.scheduler.

61

That's a per-request limit, but this function keeps making requests for more origins until it exhausts them all (hence the while loop, especially start = origins[-1]+1)

81

You can. As indexer_origin_metadata is a config in swh.scheduler's db, it makes sense to have it configurable here too.

87

Origins not already indexed are out of scope of this diff, and will be handled as part of T1528.

vlorentz added inline comments.
swh/indexer/cli.py
87

Actually, that's also out of scope of that task, so here is a new one: T1536 (note that this new task is independent from the indexers)

vlorentz marked 5 inline comments as done.
  • Bump required swh.scheduler version.
  • Honor --dry-run.
  • Use a config file for scheduler/idx_storage/storage.
  • Better doc for --mappings.
This revision is now accepted and ready to land.Feb 22 2019, 11:27 AM
  • Add subcommand to list mappings.
This revision was automatically updated to reflect the committed changes.