List GitHub repositories sorted by stars, until we fill up disk spa^W^W^W^W have a reasonable amount of them (10k repos for a start sounds reasonable).
Relevant API: https://api.github.com/search/repositories?q=stars:%3E=1000&order=stars
List GitHub repositories sorted by stars, until we fill up disk spa^W^W^W^W have a reasonable amount of them (10k repos for a start sounds reasonable).
Relevant API: https://api.github.com/search/repositories?q=stars:%3E=1000&order=stars
Status | Assigned | Task | ||
---|---|---|---|---|
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T547 Azure prototype: Content provenance information API | ||
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T551 List interesting origins for the content provenance information prototype |
rDSNIP0d79928 generates a list of 1k+ stars repositories (pass in a GitHub api login and token as arguments).
Getting the unprocessed origins:
create temporary table repos (url text); \copy repos from '~/github_repos' csv select id from repos left join origin using(url) where type='git' and not exists (select 1 from cache_revision_origin cro where origin.id = cro.origin);
Adding the tasks to the queue (in SWH_WORKER_INSTANCE=provenance-cache ipython3 on uffizi):
from swh.worker.celery_backend.config import app import swh.storage.provenance.tasks ids = [...] for id in ids: app.tasks['swh.storage.provenance.tasks.PopulateCacheRevisionOrigin'].delay(id, 1)