List interesting origins for the content provenance information prototype
List GitHub repositories sorted by stars, until we fill up disk spa^W^W^W^W have a reasonable amount of them (10k repos for a start sounds reasonable).

Relevant API:

rDSNIP0d79928 generates a list of 1k+ stars repositories (pass in a GitHub api login and token as arguments).

Getting the unprocessed origins:

create temporary table repos (url text);
\copy repos from '~/github_repos' csv
select id from repos
left join origin using(url)
where type='git' and
      not exists (select 1 from cache_revision_origin cro
                  where = cro.origin);

Adding the tasks to the queue (in SWH_WORKER_INSTANCE=provenance-cache ipython3 on uffizi):

from swh.worker.celery_backend.config import app

ids = [...]
for id in ids:
    app.tasks[''].delay(id, 1)