Page MenuHomeSoftware Heritage

implement listers as plugins

Authored by douardda on May 22 2019, 2:49 PM.



This diff is mainly here for discussion on how to implement such a thing
and what kind of 'plugin system' could be provided, especially for features
like lister, loaders and more generally scheduler-managed workers.

Depends on D1503.

Diff Detail

rDLS Listers
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Would it be possible to deduplicate the register code by putting all these functions in a single file directly in swh/lister/?

And you can replace the model import by Lister.MODEL, that's one less import to do

anlambert added inline comments.

We should build that list dynamically instead of hardcoding it. This will ease the adding of new listers.

Iterating on the sub-directories of the swh/lister folder could do the trick (core and tests must be
excluded though).

Other solution, import every submodules from swh.lister and check the presence of the register
function to determine if it is a plugin or not.


For npm, there is two models to initialize: swh.lister.npm.models.NpmModel and swh.lister.npm.models.NpmVisitModel


this looks a lot like the code to generate SUPPORTED_LISTERS.

No you cannot, that's the whole point of this idea: being able to declare plugins without having to load every possible python package or (recursively?) look for them in "some well known places".

I did not want to force listers to be 'installed' in a 'swh.lister' namespace (in the sense of PEP420 ).

Using this method based on entry points, a lister can be anywhere and does not need to lies within our swh namespace, and it is effectively loaded only if needed.

I did implement the main swh.lister as plugins here mainly to show how it can be done. These default/basic listers could come with the main swh.lister package (preloaded) without using the plugin mechanism. This is debatable.

TBH, I'm far from convinced this 'register' function is fine as it is in this diff (neither the data structure returned by the function nor the function name, however this later can be anything, since it's fully in the entrypoint declaration).


I know... Not sure yet if it is a good idea to avoid it.


that's typically why I'm not convinced by the 'API' of the plugin loading mechanism here. The true initialization work is in fact done in the 'init' hook, which is a simple function and thus can initialize as many databases/tables as one wants.


you could do:

LISTERS = {'.', 1)[1]: entry_point
               for entry_point
               in pkg_resources.iter_entry_points('swh.workers')
               if'.', 1)[0] == 'lister'}

@douardda ping about this.

Also we added some new listers, so that need some rebase ;)


yeah, i like the @vlorentz's improvment proposal

This revision now requires changes to proceed.Jul 18 2019, 11:18 AM

rebase + changes in the plugin "API"

now the register function is expected to return a dict with:

  • 'models': list of SQLAlchemy model/tables
  • 'task_modules': list of module names where celery tasks are defined,
  • 'lister': as before (Lister class),
  • (optionnally) 'init': initialization function.

If 'init' is not present, default implementation creates tables corresponding
to table/models declared in 'models'.

rebase and small fixes

Depends on D1934

rebase and better commit message

This sounds good.

It untangles some part like the npm initialization (multiple tables to init). And makes it more declarative, /me likes.

The ci is unhappy though.

I'm not sure why... oh might be because of the needed changes in the scheduler part... (the diff this diff depends on or something).

This revision is now accepted and ready to land.Sep 3 2019, 1:20 PM

rebase + better ci msg + improve config handling in

  • implement the 'standard' config loading mechanism: use SWH_CONFIG_FILENAME

+ add the --config-file option at the swh lister (group) level (also move
the --db-url there).

  • drop the --lister option in db-init cli command (see ci message).

The ci is unhappy though.

I'm not sure why... oh might be because of the needed changes in the scheduler part... (the diff this diff depends on or something).

Because it needs D1939 and I forgot to specify the dependency...

This revision was automatically updated to reflect the committed changes.