Page MenuHomeSoftware Heritage

Remarks on the tutorial "Run a new lister"
Open, NormalPublic

Description

I'm trying to follow the tutorial "Run a new lister" (https://docs.softwareheritage.org/devel/swh-lister/run_a_new_lister.html) and i'm encountering some issues.

In the para 1., i have to "edit" the file docker-compose-override.yml but i was not able to find this file. I think I have to create it in the swh-docker-dev repository. Then, it seems to me the first code block is not correct: is the identation of swh-listers correct?

In the para 2., it's not easy to follow instructions of the README of swh-lister:

  • from where comes the tool createdb?
  • how can i deploy a postgres database?
  • how to get the python module swh.lister.cli?

I think I have to use swh-environment and/or swh-docker-dev repositories, but it's not clear to me!

Actually, it would be really nice have this kind of tutorial on a hello-world example, with all required steps explained on the same page.
Of course, i understand this is time consuming and boring to do ;)
But, i would be happy to help you improving such kind of tutorial by being a beta tester!

Event Timeline

lewo created this task.Sep 1 2019, 10:13 AM
lewo created this object in space S1 Public.
lewo updated the task description. (Show Details)Sep 1 2019, 10:15 AM
zack triaged this task as Normal priority.Sep 1 2019, 10:45 AM

Hello,

Thanks, this is highly appreciated feedback ;)

, i have to "edit" the file docker-compose-override.yml but i was not able to find this file. I think I have to create it in the swh-docker-dev repository. Then, it seems to me the first code block is not correct: is the identation of swh-listers correct?

Yes, you have to create it.
In the swh-docker-dev repository, there is a sample committed.
We did not created/committed a docker-compose.override.yml since it's not a generic file. It's user oriented file.

For the indentation, i don't know, might be indeed.

Here is mine as another example (which actually works ;)

version: '2'

services:
  swh-listers-db:
    ports:
      - "5432:5432"

  swh-scheduler-db:
    ports:
      - "5433:5432"

  swh-lister:
    volumes:
      - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister"
  #     - "$SWH_ENVIRONMENT_HOME/swh-scheduler:/src/swh-scheduler"
  # swh-scheduler-api:
  #   volumes:
  #     - "$SWH_ENVIRONMENT_HOME/swh-scheduler:/src/swh-scheduler"

Note:

  • SWH_ENV_HOME is an env variable of mine to avoid repetition (it's where my swh-env is located). I quite like it as it allows agnostic documentation (too bad i'm not pushing more to integrate it everywhere).

In the para 2., it's not easy to follow instructions of the README of swh-lister:

Well, it kinda depends how you choose to deploy (docker-dev or your own machine...).
We'd rather the user choose the swh-docker-dev path though (reproducibility, less cogs to set up for the user, less hassle on our part to understand what's wrong on the user's machine ;).

from where comes the tool createdb?

That most probably came from the swh.lister.cli.
It seems that it no longer goes by that name though.

It's exposed through the cli now with either (use the first 1).

swh lister db-init -h
Usage: swh lister db-init [OPTIONS] [github|gitlab|bitbucket|debian|pypi|npm|p
                          habricator|gnu|cran|cgit|packagist|all]...

  Initialize the database model for given listers.



Options:
  -d, --db-url TEXT  SQLAlchemy DB URL; see <http://docs.sqlalchemy.org/en/lat
                     est/core/engines.html#database-urls>
  -D, --drop-tables  Drop tables before creating the database schema
  -h, --help         Show this message and exit.

or:

python3 -m swh.lister.cli db-init -h

Note: In docker-dev, the db is already setup though.

how can i deploy a postgres database?

Mmm, installing the postgres service, create a role for your user (as postgres, createrole lewo), create the db (createdb from postgres tool)...

But really, this should already be done in docker-dev.

To ensure this, take a look at the docker logs when booting up services.

docker-compose logs --follow swh-lister

You should mention of db initialized.

how to get the python module swh.lister.cli?

following the main tutorial, you should have virtualenv setup.
So, in the following, i'll call that virtualenv swh

$ workon swh 
$ cd swh-lister
$ pip install -e .  # to install from the current repository cloned
$ pip install swh-lister   # or to install from pypi

Actually, it would be really nice have this kind of tutorial on a hello-world example, with all required steps explained on the same page.

Indeed.

Of course, i understand this is time consuming and boring to do ;)

Well, for my part, the boring part is not the issue ;)

The issue is trying to find the sweet spot on being clear without being too much redundant (in between docs we have...).
I guess we should make clear (if not already), we want people to use the swh-docker-dev environment.
And continue documenting with implicitely relying on this.
Then, we could slice and dice in the lister README (it is confusing right now).

But, i would be happy to help you improving such kind of tutorial by being a beta tester!

Cool.

HTH

lewo added a comment.Sep 1 2019, 9:22 PM

Ok, thx a lot for your answers!

I've deployed the docker compose stack and I'm trying to run the lister locally. In the lister README, the provided Gitlab Python snippet needs to be updated since I get

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    'per_page': 20
  File "/mnt/data/home/lewo/repos/swh-environment/swh-lister/venv/lib/python3.7/site-packages/celery/local.py", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/mnt/data/home/lewo/repos/swh-environment/swh-lister/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 45, in __call__
    return super().__call__(*args, **kwargs)
  File "/mnt/data/home/lewo/repos/swh-environment/swh-lister/venv/lib/python3.7/site-packages/celery/app/task.py", line 394, in __call__
    return self.run(*args, **kwargs)
TypeError: full_gitlab_relister() takes 1 positional argument but 2 were given

It seems to be better when I unpack the kwargs. I then get this log but I don't know what to do next:/

DEBUG:swh.lister.core.lister_base:Loading config from lister_gitlab
INFO:swh.core.config:Loading config file /home/lewo/.config/swh/lister_gitlab.yml
DEBUG:swh.lister.core.lister_base:<swh.lister.gitlab.lister.GitLabLister object at 0x7f9427d94990> CONFIG={'content_size_limit': 104857600, 'log_db': 'dbname=softwareheritage-log', 'storage': {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}}, 'scheduler': {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}}, 'lister': {'cls': 'local', 'args': {'db': 'postgresql:///postgresql@localhost/swh-listers'}}, 'credentials': [], 'cache_responses': True, 'celery': {'task_broker': 'amqp://guest:guest@localhost//', 'task_modules': ['swh.lister.bitbucket.tasks', 'swh.lister.cran.tasks', 'swh.lister.debian.tasks', 'swh.lister.github.tasks', 'swh.lister.gitlab.tasks', 'swh.lister.gnu.tasks', 'swh.lister.npm.tasks', 'swh.lister.phabricator.tasks', 'swh.lister.pypi.tasks'], 'task_queues': ['swh.lister.bitbucket.tasks.FullBitBucketRelister', 'swh.lister.bitbucket.tasks.IncrementalBitBucketLister', 'swh.lister.bitbucket.tasks.RangeBitBucketLister', 'swh.lister.bitbucket.tasks.ping', 'swh.lister.cran.tasks.CRANListerTask', 'swh.lister.cran.tasks.ping', 'swh.lister.debian.tasks.DebianListerTask', 'swh.lister.debian.tasks.ping', 'swh.lister.github.tasks.FullGitHubRelister', 'swh.lister.github.tasks.IncrementalGitHubLister', 'swh.lister.github.tasks.RangeGitHubLister', 'swh.lister.github.tasks.ping', 'swh.lister.gitlab.tasks.FullGitLabRelister', 'swh.lister.gitlab.tasks.IncrementalGitLabLister', 'swh.lister.gitlab.tasks.RangeGitLabLister', 'swh.lister.gitlab.tasks.ping', 'swh.lister.gnu.tasks.GNUListerTask', 'swh.lister.gnu.tasks.ping', 'swh.lister.npm.tasks.NpmIncrementalListerTask', 'swh.lister.npm.tasks.NpmListerTask', 'swh.lister.npm.tasks.ping', 'swh.lister.phabricator.tasks.FullPhabricatorLister', 'swh.lister.phabricator.tasks.IncrementalPhabricatorLister', 'swh.lister.phabricator.tasks.ping', 'swh.lister.pypi.tasks.PyPIListerTask', 'swh.lister.pypi.tasks.ping']}, 'cache_dir': '/home/lewo/.cache/swh/lister/gitlab'}
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
/mnt/data/home/lewo/repos/swh-environment/swh-lister/venv/lib/python3.7/site-packages/swh/scheduler/__init__.py:69: DeprecationWarning: Call to deprecated class SWHRemoteAPI. (Use the RPCClient instead) -- Deprecated since version 0.0.64.
  return SchedulerBackend(**args)
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 0xacab.org:443
DEBUG:urllib3.connectionpool:https://0xacab.org:443 "HEAD /api/v4/projects?page=1&order_by=id&sort=asc HTTP/1.1" 200 0

Now, I've to figure out if it is working and how to list created loader tasks...

Lister's input are forge's url (among other things).
Lister's output are scheduling tasks (for loader) in the scheduler db (with their arguments set for the loader to take it on).

$ workon swh
$ doco exec swh-scheduler-api bash
$ swh scheduler task list --task-type list-gitlab-full 

Note:

  • doco: alias on docker-compose within swh venv
  • Use swh scheduler task -h to ensure i'm correct on options ;)

The listers use a db as cache though.
So you could check that cache.

$ doco exec swh-lister bash
$ psql swh-listers
> select count(*) from gitlab_repo;
...

Note:

  • I assume you ran the gitlab lister from the log
  • To know what table is what for the lister you could check the python module swh.lister.<listern-name>.models, the db table should be explicited there (or \d in the psql repl from the last sample).

Cheers,