Now that the idx-storage writes to kafka, we need to backfill.
Rather than write code to read from the database to kafka (like we did with swh-storage), this can be done simply by re-indexing all the origins, using swh scheduler schedule_origins
Now that the idx-storage writes to kafka, we need to backfill.
Rather than write code to read from the database to kafka (like we did with swh-storage), this can be done simply by re-indexing all the origins, using swh scheduler schedule_origins
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1523 Search tools on metadata | ||
Migrated | gitlab-migration | T1117 Origin search is *slow* when you look for very common words | ||
Migrated | gitlab-migration | T1910 Redesign origin search using a dedicated component (swh-search) | ||
Migrated | gitlab-migration | T2052 Publish swh-search on PyPI | ||
Migrated | gitlab-migration | T2167 Deploy swh-search | ||
Migrated | gitlab-migration | T2174 Add debian package for swh-search | ||
Migrated | gitlab-migration | T2182 Switch production swh-web to use swh-search instead of postgresql search. | ||
Migrated | gitlab-migration | T2590 Finish the indexer -> swh-search pipeline | ||
Migrated | gitlab-migration | T3037 Reschedule origin-intrinsic-metadata tasks for all origins | ||
Migrated | gitlab-migration | T2780 Enable the journal-writer for the swh-idx-storage in production |
Rather than write code to read from the database to kafka (like we did with swh-storage), this can be done simply by re-indexing all the origins, using swh scheduler schedule_origins
(As discussed elsewhere) True if we focus solely on origins for now.
Less so if we want also to deal with "content" indexers later ;)
That suggested cli does not show up but i've only took a quick glance ¯\_(ツ)_/¯:
swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml -h Usage: swh scheduler [OPTIONS] COMMAND [ARGS]... Software Heritage Scheduler tools. Use a local scheduler instance by default (plugged to the main scheduler db). Options: -C, --config-file FILE Configuration file. -d, --database TEXT Scheduling database DSN (imply cls is 'local') -u, --url TEXT Scheduler's url access (imply cls is 'remote') --no-stdout Do NOT output logs on the console -h, --help Show this message and exit. Commands: celery-monitor Monitoring of Celery journal-client Keep the the origin visits stats table up to date from a... origin Manipulate listed origins. rpc-serve Starts a swh-scheduler API HTTP server. simulator Scheduler simulator. start-listener Starts a swh-scheduler listener service. start-runner Starts a swh-scheduler runner service. task Manipulate tasks. task-type Manipulate task types.
Note:
swhscheduler@scheduler0:~$ dpkg -l python3-swh.scheduler Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=====================-====================-============-================================= ii python3-swh.scheduler 0.9.2-1~swh1~bpo10+1 all Software Heritage Scheduler
Note2: yes, i've looked at the subcli origin and this seems not related...
swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml origin -h Usage: swh scheduler origin [OPTIONS] COMMAND [ARGS]... Manipulate listed origins. Options: -h, --help Show this message and exit. Commands: grab-next Grab the next COUNT origins to visit using the TYPE loader... schedule-next Send the next COUNT origin visits of the TYPE loader to... update-metrics Update the scheduler metrics on listed origins.
That's it! [1]
Thanks
[1]
swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task -h | grep schedule_origins schedule_origins Schedules tasks for origins that are already known.
staging:
This needs a storage access so edit a dedicated configuration file.
swhscheduler@scheduler0:~$ cat scheduler.yml --- ... storage: cls: remote url: http://storage1.internal.staging.swh.network:5002/ ...
Then trigger the run:
swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file scheduler.yml task schedule_origins index-origin-metada ta Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/scheduler/cli/task.py", line 364, in schedule_origin_metadata_index storage = get_storage("remote", url=storage_url) File "/usr/lib/python3/dist-packages/swh/storage/__init__.py", line 68, in get_storage storage = Storage(**kwargs) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 230, in __init__ base_url = url if url.endswith("/") else url + "/" AttributeError: 'NoneType' object has no attribute 'endswith'
And *sighs*, it does not work...
Ah no! I misused the cli, with the right flags:
swhscheduler@scheduler0:~$ /usr/bin/swh scheduler --config-file scheduler.yml task schedule_origins --storage-url http://storage1.internal.staging.swh.network:5002 index-origin-metadata ... page_token: 9801 page_token: 9901 ... page_token: 500701 Scheduled 50079 tasks (500784 origins). Done.
Although, now i'm wondering something.
Is that enough to write what's not in the topics?
@vlorentz Isn't the filtering done by the indexer (to avoid computing again what's already computed) preventing those already computed metatada from being written in the topic?
@ardumont no, OriginMetadataIndexer lacks a filter step.
Ok. I shall attend to it for production soon then.
Thanks.
Running:
swhscheduler@saatchi:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task schedule_ origins --storage-url http://saam.internal.softwareheritage.org:5002 --batch-size 20 index-origin-metadata | tee /tmp/schedule-origins.txt
Done scheduling:
... page_token: 152321042 page_token: 152321142 Scheduled 7614775 tasks (152295489 origins). Done.