Page MenuHomeSoftware Heritage

listers: Open cli to allow a first listing pass on new forge instance
ClosedPublic

Authored by ardumont on Aug 28 2019, 12:06 PM.

Details

Summary

The output of that listing results in scheduled tasks with policy 'oneshot' and some priority (well depending on the input cli call).
Thus ingesting those faster and without manual intervention as we currently do [1]

Example use case (for now is for small forges):

$ export SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml
$ swh lister run --lister gitlab \
                 --priority high \
                 --db-url postgresql://postgres@localhost:5432/swh-listers \
                 api_baseurl=https://gitlab.ow2.org/api/v4/

Related T1919

[1] along the lines of psql update "the listed tasks with priority to high", then wait for it to finish somehow, and then at some point, remove that priority with psql update yet again... tedious!

Also, to explicit this.
This does not replace a usual deployment which triggers recurring tasks with no priority.
This is to ease a first pass on new (small?) forges without having to plunge within the scheduler and drag a whole new task
(we discussed about some refactoring, with schema impacts, on the scheduler but that's for another time).

Test Plan
doco up -d swh-lister

Then trigger the sample cli use case and check the scheduler/lister db:

 policy  | priority |                                              arguments                                          $
---------+----------+-------------------------------------------------------------------------------------------------$
 oneshot | high     | {"args": ["https://gitlab.ow2.org/sat4j/sat4j.git"], "kwargs": {}}
 oneshot | high     | {"args": ["https://gitlab.ow2.org/stamp/unit-test-ampli.git"], "kwargs": {}}
 oneshot | high     | {"args": ["https://gitlab.ow2.org/stamp/conf-test-ampli.git"], "kwargs": {}}
 ...

A usual deployment would have seen the policy set to recurring and the priority
to null.

Note:

docker-compose.override.yml:

version: '2'

services:
  swh-listers-db:
    ports:
      - "5432:5432"

  swh-scheduler-db:
    ports:
      - "5433:5432"

  swh-lister:
    volumes:
      - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister"

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont added a project: Lister.
ardumont retitled this revision from listers: Allow to override policy and priority for scheduled tasks to listers: Open cli to allow a first listing pass on new forge instance.Aug 28 2019, 1:48 PM
ardumont edited the summary of this revision. (Show Details)

Looking into how to add tests to the cli.
That should not block a first review pass though ;)

Looking into how to add tests to the cli.

Take a look at the index or scheduler cli

That should not block a first review pass though ;)

Doesn't it?

  • cli: Bootstrap tests on cli

At least on new_lister method which instantiates listers.

swh/lister/cli.py
31 ↗(On Diff #6439)

naming this function 'new_lister' is a bit weird, since this actually is a kind of factory. In other modules, we use "get_xxx" (get_storage, get_objstorage. etc.) for "similar" functions.

177 ↗(On Diff #6439)

For the sake of consistency, use the same 'postgres://lister' value as the 'list' command here.

swh/lister/cli.py
205 ↗(On Diff #6448)

This description is a bit confusing (IMHO).

First it should make it clear that this command does NOT schedule a lister task, but instead it runs the lister directly.

Then, it should be made crystal clear that the priority and policy concern the loader tasks this lister schedules only.

Thanks for your feedback all ;)

swh/lister/cli.py
177 ↗(On Diff #6439)

Right!

205 ↗(On Diff #6448)

Yes. I'll improve on it.

Also, i'm wondering whether to open the policy at all.
As a user, i clearly want to use it solely with oneshot for now.

I just want to bypass the manual interventions i need to do all the time when triggering a new forge instance (cgit, gitlab, phabricator so far).

What do you think?

swh/lister/cli.py
204 ↗(On Diff #6448)

Also: I'm not very fond of the subcommand name "list" (as in "swh lister list") ... replace this "list" by "scrape" may be? Or simply "run" since we do run a lister here.

Adapt according to reviews/discussion:

  • cli: Align --db-url's default value with the list subcommand
  • cli: Unify new_lister method name to get_lister
  • cli: Rename 'list' subcommand to 'run'
  • cli: Remove the --policy parameter
  • cli: Improve the 'run' subcommand dostring
ardumont edited the summary of this revision. (Show Details)
ardumont added inline comments.
swh/lister/cli.py
204 ↗(On Diff #6448)

Clearer, thx!

This revision is now accepted and ready to land.Aug 28 2019, 3:57 PM

Rework commits (squash some)

This revision was automatically updated to reflect the committed changes.
ardumont edited the summary of this revision. (Show Details)