Page MenuHomeSoftware Heritage

implement listers as plugins
ClosedPublic

Authored by douardda on May 22 2019, 2:49 PM.

Details

Summary

This diff is mainly here for discussion on how to implement such a thing
and what kind of 'plugin system' could be provided, especially for features
like lister, loaders and more generally scheduler-managed workers.

Depends on D1503.

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

douardda created this revision.May 22 2019, 2:49 PM
vlorentz added a subscriber: vlorentz.EditedMay 22 2019, 2:55 PM

Would it be possible to deduplicate the register code by putting all these functions in a single file directly in swh/lister/?

And you can replace the model import by Lister.MODEL, that's one less import to do

anlambert added inline comments.
setup.py
58–65

We should build that list dynamically instead of hardcoding it. This will ease the adding of new listers.

Iterating on the sub-directories of the swh/lister folder could do the trick (core and tests must be
excluded though).

Other solution, import every submodules from swh.lister and check the presence of the register
function to determine if it is a plugin or not.

swh/lister/npm/__init__.py
12

For npm, there is two models to initialize: swh.lister.npm.models.NpmModel and swh.lister.npm.models.NpmVisitModel

vlorentz added inline comments.May 27 2019, 2:23 PM
swh/lister/cli.py
83–86

this looks a lot like the code to generate SUPPORTED_LISTERS.

douardda added inline comments.Jun 4 2019, 5:26 PM
setup.py
58–65

No you cannot, that's the whole point of this idea: being able to declare plugins without having to load every possible python package or (recursively?) look for them in "some well known places".

I did not want to force listers to be 'installed' in a 'swh.lister' namespace (in the sense of PEP420 ).

Using this method based on entry points, a lister can be anywhere and does not need to lies within our swh namespace, and it is effectively loaded only if needed.

I did implement the main swh.lister as plugins here mainly to show how it can be done. These default/basic listers could come with the main swh.lister package (preloaded) without using the plugin mechanism. This is debatable.

TBH, I'm far from convinced this 'register' function is fine as it is in this diff (neither the data structure returned by the function nor the function name, however this later can be anything, since it's fully in the entrypoint declaration).

swh/lister/cli.py
83–86

I know... Not sure yet if it is a good idea to avoid it.

swh/lister/npm/__init__.py
12

that's typically why I'm not convinced by the 'API' of the plugin loading mechanism here. The true initialization work is in fact done in the 'init' hook, which is a simple function and thus can initialize as many databases/tables as one wants.

vlorentz added inline comments.Jun 26 2019, 12:01 PM
swh/lister/cli.py
83–86

you could do:

LISTERS = {entry_point.name.split('.', 1)[1]: entry_point
               for entry_point
               in pkg_resources.iter_entry_points('swh.workers')
               if entry_point.name.split('.', 1)[0] == 'lister'}
SUPPORTED_LISTERS = list(LISTERS)

@douardda ping about this.

Also we added some new listers, so that need some rebase ;)

swh/lister/cli.py
83–86

yeah, i like the @vlorentz's improvment proposal

vlorentz requested changes to this revision.Jul 18 2019, 11:18 AM
This revision now requires changes to proceed.Jul 18 2019, 11:18 AM
douardda updated this revision to Diff 6453.Aug 28 2019, 5:56 PM

rebase + changes in the plugin "API"

now the register function is expected to return a dict with:

  • 'models': list of SQLAlchemy model/tables
  • 'task_modules': list of module names where celery tasks are defined,
  • 'lister': as before (Lister class),
  • (optionnally) 'init': initialization function.

If 'init' is not present, default implementation creates tables corresponding
to table/models declared in 'models'.

douardda updated this revision to Diff 6491.Aug 30 2019, 6:05 PM

rebase and small fixes

Depends on D1934

douardda updated this revision to Diff 6517.Sep 2 2019, 1:30 PM

rebase and better commit message

ardumont accepted this revision.Sep 2 2019, 4:27 PM

This sounds good.

It untangles some part like the npm initialization (multiple tables to init). And makes it more declarative, /me likes.


The ci is unhappy though.

I'm not sure why... oh might be because of the needed changes in the scheduler part... (the diff this diff depends on or something).

vlorentz accepted this revision.Sep 3 2019, 1:20 PM
This revision is now accepted and ready to land.Sep 3 2019, 1:20 PM
douardda updated this revision to Diff 6530.Sep 3 2019, 3:08 PM

rebase + better ci msg + improve config handling in cli.py

  • implement the 'standard' config loading mechanism: use SWH_CONFIG_FILENAME

+ add the --config-file option at the swh lister (group) level (also move
the --db-url there).

  • drop the --lister option in db-init cli command (see ci message).
douardda added a comment.EditedSep 3 2019, 3:12 PM

The ci is unhappy though.
I'm not sure why... oh might be because of the needed changes in the scheduler part... (the diff this diff depends on or something).

Because it needs D1939 and I forgot to specify the dependency...

ardumont accepted this revision.Sep 3 2019, 3:16 PM
This revision was automatically updated to reflect the committed changes.