Page MenuHomeSoftware Heritage

semi-automated addition of new "forges"
Closed, MigratedEdits Locked


Use case: extend archive coverage to a specific GitLab instance (specified by URL) as seamlessly as possible.

(The obvious generalization is replacing GitLab with any kind of supported listable source code origin out there, e.g., another Debian-like distro, another PyPI instance, etc.)

We currently can, with a single command (1) add an entry to the list of "forges" being listed.
What we lack to implement the "as seamlessly as possible" part above is:

(2) immediately do the full listing, with high scheduling priority
(3) once (2) is done, immediately load all listed origins, with high scheduling priority
(4) bonus point: notify the user once (3) is done

As another bonus point, having the above doable with a single CLI command would be great.

Once we have this, it will be the obvious building block of a "save forge now" user-visible functionality in the Web UI (which will be tracked in a separate task).

Event Timeline

zack triaged this task as Normal priority.Feb 21 2019, 8:16 PM
zack created this task.
zack added a project: Scheduling utilities.

As a test case, we've been asked to archive this small (for now) GitLab instance:
We can easily find tons of other small public instances for further testing.

D1504 is related as it proposes a way to remove lister setup steps (db model adaptation, adding new lister task type in scheduler db). This works towards the "seamlessly as possible" part.

  • (1) add an entry to the list of "forges" being listed.
  • (2) immediately do the full listing, with high scheduling priority

As of today in the current state of tooling and infra, we can do the following in 2 cli
calls. Just to mention the tooling evolved a bit and that api for that kinda exist.

After the (number) describing the need, the associated cli that can answer the need:

(1) Providing the task is already registered in the scheduler [3]



# default: gitlab instance (full or incremental listing, depends on the lister)
$ swh scheduler task add list-gitlab-(incremental|full) url= 

# heptapod instance (specific gitlab-full which can list other dvcs than git)
$ swh scheduler task add list-gitlab-full url=<url> name=heptapod

# cgit instance
$ swh scheduler task add list-cgit url=<url> ...

As usual, hell is in the details... Besides the url which is a common parameter for all
listers, there can exists:

  • multiple listing natures: full or incremental, it depends on lister implementations. I gather we could schedule immediately a full listing and then schedule an incremental visit starting the very next day.
  • extra parameters, for example, the cgit lister have an optional base_git_url. It's useful when the listing schema differs from the main url... Origins are only listed as suffix urls to which we prepend the base_git_url so origins are actually ingested.

(2) We now have at least one special worker which consumes tasks from dedicated queues:
oneshot:swh.loader.git.tasks.UpdateGitRepository, ... That was the loader in charge of
ingesting the sourceforge origins without being too much stressful on the upstream forge
to ingest.

I guess we could generalize to multiple dedicated worker(s) specialized on consuming the
first listing of new forges. Subsequent listing then are dealt with normally like the

In any case, the cli I was referring to is:

$ swh scheduler origin send-to-celery \
  --lister-uuid $lister_uuid \
  --queue $queue \
  --policy never_visited_oldest_update_first \


  • $visit_type in {git, svn, hg, ...}
  • $queue in {oneshot:swh.loader.git.tasks.UpdateGitRepository, ...}

Note: The current limit in that cli is that it's a dedicated call per visit type. So
multiple calls would be needed for multiple dvcs support forges. In that regard, the
remaining scheduler cog defined in [4] most likely could help. If we were, for example,
to allow it a particular scheduling policy for first visit origins?


[4] T3667