What we can collect is a subset of what we can see here: https://api.github.com/repos/SoftwareHeritage/swh-core
The way we'll collect it depends on what info we want; so let's try to list it exhaustively:
* owner avatar/URL ?| description | priority | [[ https://docs.github.com/en/rest/reference/repos#list-public-repositories | REST /repositories ]] | [[ https://docs.github.com/en/rest/reference/repos#list-repositories-for-a-user | REST /users/{username}/repos ]] (works for orgs too) | REST (specific query) | GraphQL via user (impossible for orgs) | GraphQL direct | comment |
* description| -- | -- | -- | -- | -- | -- | -- | -- |
* whether it's a fork, and of what| owner avatar + homepage URL | low | free (avatar only) | no | 1 req/user | free | 1 req + 1 point/user |
* whether it's a mirror, and of what| description | high | free | free | 1 req | free | N/A |
* created_at / updated_at| whether it's a fork | high | free | free | N/A | free | N/A |
* pushed_at| what it's a fork of | high | no | no | 1 req/forkedrepo | 1 point/100repos | 1 req + 1 point/forked-repo |
* homepage| whether it's a mirror | high | no | free | N/A | free | N/A |
* "topics" (don't seem to be in the "main" REST API yet, preview only: https://docs.github.com/en/rest/reference/repos#get-all-repository-topics-preview-notices and https://docs.github.com/en/rest/overview/api-previews#repository-topics )| what it's a mirror of | high | no | free | N/A | free | N/A |
* stargazers_count / watchers_count| created_at / updated_at | high | no | free | N/A | free | N/A |
* list of stargazers / watchers ?| pushed_at | high | free | free | N/A | free | N/A |
* forks count| homepage | high | no | free | N/A | free | N/A |
* list of forks?| "topics" | high | no | no | 1 [[ https://docs.github.com/en/rest/reference/repos#get-all-repository-topics-preview-notices | preview ]] req/100 repos/user | MAX_TOPICS points/100repos/user | 1 req/repo + 1 point/repo (assuming less than 100 topics) |
* license/language? (GH extracts it from the intrinsic metadata we collect too, so probably not very useful)| stargazers_count / watchers_count | mid | no | free | N/A | free | N/A |
* assets? this should probably be done by a specific loader though; it's closer to a package manager than to metadata| list of stargazers / watchers | low | no | no | 1 req/100people | too expensive | 1 req/repo/100people + ceil(1 point/1repo + 1 point/100people) |
* release notes that aren't on git tags?| forks count | mid | no | free | N/A | free | N/A |
| list of forks | no | no | no | 1req/100forks/repo | too expensive | ceil(1 req/repo/100forks) + ceil(1 point/1repo + 1 point/100forks) |
| license | low | no | free | N/A | free | N/A | (GH extracts it from the intrinsic metadata we collect too, so probably not very useful) |
| main language | low | no | free | N/A | no | no | (ditto) |
| all languages | low | no | no | 1req/repo (assuming <100 languages) | too expensive | 1 req/repo + 1 point/repo (assuming <100 languages) | (ditto)
| assets | out of scope | | | | | | this should probably be done by a specific loader though; it's closer to a package manager than to metadata
| release notes | out of scope | | | | | | that aren't on git tags
Notes:
Anything else?* costs are computed assuming we will send a REST **OR** GraphQL query per user/org regardless of whether we want the property. These costs are:
* for REST: ceil(1 req/100repos/user)
* for GraphQL: ceil(1 req/100repos/user) + ceil(1 point/100repo)
* Rate-limits are:
* for REST: 5000req/hour/token
* for GraphQL: 5000points/hour/token (no req/hour limit AFAICT, but I'm including them in the calculation because they use resources on our side)
* N/A means it's pointless to send that extra query, as we can get it in a strictly more efficient way