Page MenuHomeSoftware Heritage

Decide what metadata we want to / can collect from GitHub
Closed, ResolvedPublic


What we can collect is a subset of what we can see here:

The way we'll collect it depends on what info we want; so let's try to list it exhaustively:

descriptionpriorityREST /repositoriesREST /users/{username}/repos (works for orgs too)REST (specific query)GraphQL via user (impossible for orgs)GraphQL directcomment
owner avatar + homepage URLlowfree (avatar only)no1 req/userfree1 req + 1 point/user
descriptionhighfreefree1 reqfreeN/A
whether it's a forkhighfreefreeN/AfreeN/A
what it's a fork ofhighnono1 req/forkedrepo1 point/100repos1 req + 1 point/forked-repo
whether it's a mirrorhighnofreeN/AfreeN/A
what it's a mirror ofhighnofreeN/AfreeN/A
created_at / updated_athighnofreeN/AfreeN/A
"topics"highnono1 preview req/100 repos/userMAX_TOPICS points/100repos/user1 req/repo + 1 point/repo (assuming less than 100 topics)Not available in the "production" REST API
stargazers_count / watchers_countmidnofreeN/AfreeN/A
list of stargazers / watcherslownono1 req/100peopletoo expensive1 req/repo/100people + ceil(1 point/1repo + 1 point/100people)
forks countmidnofreeN/AfreeN/A
list of forksnonono1req/100forks/repotoo expensiveceil(1 req/repo/100forks) + ceil(1 point/1repo + 1 point/100forks)
licenselow/midnofreeN/AfreeN/A(GH extracts it from the intrinsic metadata we collect too, so probably not very useful)
main languagelow/midnofreeN/Anono(ditto)
all languageslownono1req/repo (assuming <100 languages)too expensive1 req/repo + 1 point/repo (assuming <100 languages)(ditto)
assetsout of scopethis should probably be done by a specific loader though; it's closer to a package manager than to metadata
release notesout of scopethat aren't on git tags


  • costs are computed assuming we will send a REST OR GraphQL query per user/org regardless of whether we want the property. These costs are:
    • for REST: ceil(1 req/100repos/user)
    • for GraphQL: ceil(1 req/100repos/user) + ceil(1 point/100repo)
  • Rate-limits are:
    • for REST: 5000req/hour/token
    • for GraphQL: 5000points/hour/token (no req/hour limit AFAICT, but I'm including them in the calculation because they use resources on our side)
  • N/A means it's pointless to send that extra query, as we can get it in a strictly more efficient way

Event Timeline

vlorentz triaged this task as Normal priority.Aug 31 2021, 11:11 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)

At the moment, I think that all the properties you have selected in the task are needed.
+1 for License (it is something they show on the interface even if it is based on a heuristic).

I agree that assets is a separate story (so it might be a separate loader)
I think that releases information is also important but should be saved in the ERMDS on the release's SWHID.
Maybe add a latest release as part of the origin metadata (similarly to the way it is visible on GitHub with link to latest release).

At the moment, I think that all the properties you have selected in the task are needed.

It is an absolute need? Even for all the ones with question marks?

Because these are somewhat expensive to fetch, so it will slow down the metadata fetchers.

Here's an opinionated and prioritized list.

Must have:

  • description
  • homepage
  • stargazers_count / watchers_count
  • license/language
  • "topics" (these are the "tags", right?)

Nice to have:

  • forks count
  • whether it's a fork, and of what
  • whether it's a mirror, and of what
  • created_at / updated_at / pushed_at

Don't care but why not if they're cheap:

  • owner avatar/URL
  • list of stargazers / watchers
  • list of forks

Off-topic (for this task):

  • assets? (cf. T17)
  • release notes that aren't on git tags?

I think we should strive to get both the must and nice to have, and AFAIU it should be relatively easy to do so (except maybe for the "topics"?).

In the "but why not" category, all the lists could grow indefinitely, and we will have the converse information anyway via the "nice to have" items, so I'd pass on that.
I'm not clear on the owner/avatar thing. If it's a single data point per project, it could probably move up the previous category. If not: what is that?

The off-topics I think belong elsewhere.

"topics" (these are the "tags", right?)

yes, but the name "tags" is already taken by git

Don't care but why not if they're cheap:

"list of stargazers / watchers" and "list of forks" are very expensive; they'll probably increase our API usage considerably (10 to 100 times is my guess).
So that answers the question.

I'm not clear on the owner/avatar thing. If it's a single data point per project, it could probably move up the previous category.

It is.

do we need the "list of forks" if we keep the "fork of what"? I mean these are the 2 ends of the fork relation, right?

vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)

I updated the task with a breakdown of the cost of getting each info.

In total/summary, to get all the mid and high priority items:

    • main request: 1 req/100repos/user
    • to get the fork relationships: 1 req/forkedrepo (if querying from child) OR 1req/100forks/repo (if querying from parent)
    • topics: 1 preview req/100 repos/user (possibly mergeable with the main request)
  • GraphQL API:
    • main request: 1 req/100repos/user + 1 point/100repos
    • to get the fork relationships: 1 point/100repos (if querying in the main request) OR 1 req + 1 point/forked-repo (if using a specific request) -> clearly the first option is better, as far more than 1% of repositories are forks
    • topics: MAX_TOPICS points/100repos/user (if querying in the main request) OR 1 req/repo + 1 point/repo. MAX_TOPICS is a constant we can choose, so we could use a hybrid approach by setting it low to catch most repositories, and send specific queries for repos with many topics

Summary's summary: In terms of numbers, GraphQL is more attractive, as we can get slightly more info per account quota. Additionally, responses are not 10 times larger than needed because the REST API includes lots of URLs (thanks HATEAOS...)

However, two big disadvantages of the GraphQL API:

  1. we need to know somehow whether an entity is a user or an organization. (ie. try either, and waste 1req/entity + 1 point/entity if our guess is wrong). Users can be turned into orgs (but not the other way around), so this is not entirely cacheable.
  2. there seems to be a bug that prevents querying repositories for organizations entirely (Github devs fixed it)

Looks like *what* we want to collect is a solved issue.

For the *how*, we'll just have to wait and see.

In summary, we would archive everything with priority "high" or "mid", as well as the "license" and "main language" fields, as they are all easy to fetch and store