Page MenuHomeSoftware Heritage

Retrieve fork information for github repositories in swh.lister.github
Closed, MigratedEdits Locked

Description

Our current database for the github lister doesn't store the "parent" repository of a fork. This prevents us from using the smart updater to its fullest, basing our original clone on the data of the parent.

This can be done by making a new query of the github api for each fork repository.

Current estimate: 15 million fork repos / (5000 queries / hour) / (24 hours / day) = 125 days to populate a new table with a single github account for api access.

Event Timeline

We could probably cheat by using the data from ghtorrent for the repositories that have already been listed, as this is a one-shot job and stale data is more interesting than no data. Worst case scenario: we base our clone on a repo that doesn't exist or doesn't contain the right data, and then we just fallback to regular cloning.

olasd renamed this task from Retrieve fork information for github repositories in swh.loader.github to Retrieve fork information for github repositories in swh.lister.github.Feb 23 2016, 1:38 PM
olasd added a project: GitHub lister.
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM
olasd claimed this task.

We sidestepped the problem by just importing fork repos as regular repos.