Page MenuHomeSoftware Heritage

Reschedule origin-intrinsic-metadata tasks for all origins
Closed, ResolvedPublic

Description

Now that the idx-storage writes to kafka, we need to backfill.

Rather than write code to read from the database to kafka (like we did with swh-storage), this can be done simply by re-indexing all the origins, using swh scheduler schedule_origins

Event Timeline

Rather than write code to read from the database to kafka (like we did with swh-storage), this can be done simply by re-indexing all the origins, using swh scheduler schedule_origins

(As discussed elsewhere) True if we focus solely on origins for now.
Less so if we want also to deal with "content" indexers later ;)

That suggested cli does not show up but i've only took a quick glance ¯\_(ツ)_/¯:

swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml -h
Usage: swh scheduler [OPTIONS] COMMAND [ARGS]...

  Software Heritage Scheduler tools.

  Use a local scheduler instance by default (plugged to the main scheduler
  db).

Options:
  -C, --config-file FILE  Configuration file.
  -d, --database TEXT     Scheduling database DSN (imply cls is 'local')
  -u, --url TEXT          Scheduler's url access (imply cls is 'remote')
  --no-stdout             Do NOT output logs on the console
  -h, --help              Show this message and exit.

Commands:
  celery-monitor  Monitoring of Celery
  journal-client  Keep the the origin visits stats table up to date from a...
  origin          Manipulate listed origins.
  rpc-serve       Starts a swh-scheduler API HTTP server.
  simulator       Scheduler simulator.
  start-listener  Starts a swh-scheduler listener service.
  start-runner    Starts a swh-scheduler runner service.
  task            Manipulate tasks.
  task-type       Manipulate task types.

Note:

swhscheduler@scheduler0:~$ dpkg -l python3-swh.scheduler
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                  Version              Architecture Description
+++-=====================-====================-============-=================================
ii  python3-swh.scheduler 0.9.2-1~swh1~bpo10+1 all          Software Heritage Scheduler

Note2: yes, i've looked at the subcli origin and this seems not related...

swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml origin -h
Usage: swh scheduler origin [OPTIONS] COMMAND [ARGS]...

  Manipulate listed origins.

Options:
  -h, --help  Show this message and exit.

Commands:
  grab-next       Grab the next COUNT origins to visit using the TYPE
                  loader...
  schedule-next   Send the next COUNT origin visits of the TYPE loader to...
  update-metrics  Update the scheduler metrics on listed origins.

try swh scheduler task schedule_origins

That's it! [1]

Thanks

[1]

swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task -h | grep schedule_origins
  schedule_origins  Schedules tasks for origins that are already known.

staging:

This needs a storage access so edit a dedicated configuration file.

swhscheduler@scheduler0:~$ cat scheduler.yml
---
...
storage:
  cls: remote
  url: http://storage1.internal.staging.swh.network:5002/
...

Then trigger the run:

swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file scheduler.yml task schedule_origins index-origin-metada
ta
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/scheduler/cli/task.py", line 364, in schedule_origin_metadata_index
    storage = get_storage("remote", url=storage_url)
  File "/usr/lib/python3/dist-packages/swh/storage/__init__.py", line 68, in get_storage
    storage = Storage(**kwargs)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 230, in __init__
    base_url = url if url.endswith("/") else url + "/"
AttributeError: 'NoneType' object has no attribute 'endswith'

And *sighs*, it does not work...

Ah no! I misused the cli, with the right flags:

swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file scheduler.yml task schedule_origins --storage-url http://storage1.internal.staging.swh.network:5002 index-origin-metadata
...
page_token: 9801

page_token: 9901
...
page_token: 500701

Scheduled 50079 tasks (500784 origins).
Done.

Although, now i'm wondering something.
Is that enough to write what's not in the topics?

@vlorentz Isn't the filtering done by the indexer (to avoid computing again what's already computed) preventing those already computed metatada from being written in the topic?

ardumont changed the task status from Open to Work in Progress.Feb 11 2021, 9:41 AM
ardumont moved this task from Backlog to in-progress on the System administration board.

@ardumont no, OriginMetadataIndexer lacks a filter step.

@ardumont no, OriginMetadataIndexer lacks a filter step.

Ok. I shall attend to it for production soon then.
Thanks.

Running:

swhscheduler@saatchi:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task schedule_
origins --storage-url http://saam.internal.softwareheritage.org:5002 --batch-size 20 index-origin-metadata | tee /tmp/schedule-origins.txt

Done scheduling:

...
page_token: 152321042

page_token: 152321142

Scheduled 7614775 tasks (152295489 origins).
Done.
ardumont moved this task from deployed/landed to done on the System administration board.