Page MenuHomeSoftware Heritage

List interesting origins for the content provenance information prototype
Closed, MigratedEdits Locked

Description

List GitHub repositories sorted by stars, until we fill up disk spa^W^W^W^W have a reasonable amount of them (10k repos for a start sounds reasonable).

Relevant API: https://api.github.com/search/repositories?q=stars:%3E=1000&order=stars

Event Timeline

zack added a parent task: Unknown Object (Maniphest Task).Aug 30 2016, 2:04 PM

rDSNIP0d79928 generates a list of 1k+ stars repositories (pass in a GitHub api login and token as arguments).

Getting the unprocessed origins:

create temporary table repos (url text);
\copy repos from '~/github_repos' csv
select id from repos
left join origin using(url)
where type='git' and
      not exists (select 1 from cache_revision_origin cro
                  where origin.id = cro.origin);

Adding the tasks to the queue (in SWH_WORKER_INSTANCE=provenance-cache ipython3 on uffizi):

from swh.worker.celery_backend.config import app
import swh.storage.provenance.tasks

ids = [...]
for id in ids:
    app.tasks['swh.storage.provenance.tasks.PopulateCacheRevisionOrigin'].delay(id, 1)