Page MenuHomeSoftware Heritage

save code now: also add new origins for unknown repos
Closed, ResolvedPublic

Description

When we save an unknown origin due to a Save code now request, we schedule a one-shot task for the ingestion, but don't add the origin for future crawling. It might make sense to do both.

It is possibly also the only reasonable place where we can have heuristics to de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.

(Thanks @singpolyma for the heads-up.)

Related to T1110
Related to T2187

Event Timeline

zack triaged this task as Low priority.Feb 10 2019, 1:25 PM
zack created this task.
ardumont raised the priority of this task from Low to Normal.Jun 2 2021, 12:39 PM
ardumont updated the task description. (Show Details)

It is possibly also the only reasonable place where we can have heuristics to
de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.

That concern will be kept out of this task for now. It can be dealt with alongside [1]

[1] T2187#65715

When we save an unknown origin due to a Save code now request, we schedule a one-shot
task for the ingestion, but don't add the origin for future crawling. It might make
sense to do both.

It *makes* sense.

Implementation wise, considering the "save code now" as a Lister, it thus references
origins into the scheduler model which shall get scheduled by the "next-gen" scheduler.

The first part got deployed (modification in the webapp routines to update the save code
now statuses).

The webapp now records successfull save code now origins in the listed origins models of
the scheduler. [1]

Remain the runner to actually consume those regularly to be deployed (or maybe it's already the case, i need to check that part).

[1]

softwareheritage-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where name='save-code-now';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-16 06:36:44.269692+00 |    22 |
+-------------------------------+-------+
(1 row)

Time: 27.663 ms

Remain the runner to actually consume those regularly to be deployed (or maybe it's already the case, i need to check that part).

So that part [1] is not actually deployed, there is a dedicated task for it.

[1] T2345

The scheduler is getting there.
We are now able to trigger a runner for that part:

(ve) swhscheduler@saatchi:~$ swh scheduler -C $SWH_CONFIG_FILENAME    origin send-to-celery     --policy origins_without_last_update      --lister-uuid '860d41f8-d0c0-4733-a4d8-437c386bc31f'     --queue save_code_now:swh.loader.git.tasks.
UpdateGitRepository     git
10000 slots available in celery queue
5348 visits to send to celery

The uuid is the lister id for the save code now lister.

It's regularly crawled now so closing this.