Decide what metadata we want to / can collect from GitHub
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Aug 31 2021, 11:11 AM

Description

What we can collect is a subset of what we can see here: https://api.github.com/repos/SoftwareHeritage/swh-core

The way we'll collect it depends on what info we want; so let's try to list it exhaustively:

description	priority	REST /repositories	REST /users/{username}/repos (works for orgs too)	REST (specific query)	GraphQL via user (impossible for orgs)	GraphQL direct	comment
owner avatar + homepage URL	low	free (avatar only)	no	1 req/user	free	1 req + 1 point/user
description	high	free	free	1 req	free	N/A
whether it's a fork	high	free	free	N/A	free	N/A
what it's a fork of	high	no	no	1 req/forkedrepo	1 point/100repos	1 req + 1 point/forked-repo
whether it's a mirror	high	no	free	N/A	free	N/A
what it's a mirror of	high	no	free	N/A	free	N/A
created_at / updated_at	high	no	free	N/A	free	N/A
pushed_at	high	free	free	N/A	free	N/A
homepage	high	no	free	N/A	free	N/A
"topics"	high	no	no	1 preview req/100 repos/user	MAX_TOPICS points/100repos/user	1 req/repo + 1 point/repo (assuming less than 100 topics)	Not available in the "production" REST API
stargazers_count / watchers_count	mid	no	free	N/A	free	N/A
list of stargazers / watchers	low	no	no	1 req/100people	too expensive	1 req/repo/100people + ceil(1 point/1repo + 1 point/100people)
forks count	mid	no	free	N/A	free	N/A
list of forks	no	no	no	1req/100forks/repo	too expensive	ceil(1 req/repo/100forks) + ceil(1 point/1repo + 1 point/100forks)
license	low/mid	no	free	N/A	free	N/A	(GH extracts it from the intrinsic metadata we collect too, so probably not very useful)
main language	low/mid	no	free	N/A	no	no	(ditto)
all languages	low	no	no	1req/repo (assuming <100 languages)	too expensive	1 req/repo + 1 point/repo (assuming <100 languages)	(ditto)
assets	out of scope						this should probably be done by a specific loader though; it's closer to a package manager than to metadata
release notes	out of scope						that aren't on git tags

Notes:

costs are computed assuming we will send a REST OR GraphQL query per user/org regardless of whether we want the property. These costs are:
- for REST: ceil(1 req/100repos/user)
- for GraphQL: ceil(1 req/100repos/user) + ceil(1 point/100repo)
Rate-limits are:
- for REST: 5000req/hour/token
- for GraphQL: 5000points/hour/token (no req/hour limit AFAICT, but I'm including them in the calculation because they use resources on our side)
N/A means it's pointless to send that extra query, as we can get it in a strictly more efficient way

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4283 Load https://github.com/chromium/chromium with a higher packfile size limit
Migrated	gitlab-migration	T3273 Use "fork" relationships to speed-up initial load of large repositories
Migrated	gitlab-migration	T2201 Indexing / mining
Migrated	gitlab-migration	T2202 Collect extrinsic metadata
Migrated	gitlab-migration	T833 When listing an origin, add origin level metadata to RMD storage
Migrated	gitlab-migration	T2693 fetch extrinsic origin metadata from GitLab instances
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T1102 Handle all GitHub elements
Migrated	gitlab-migration	T1740 fetch extrinsic origin metadata from GitHub
Migrated	gitlab-migration	T1344 Write specs about metadata workflow
Migrated	gitlab-migration	T1738 Define and specify extrinsic origin metadata
Migrated	gitlab-migration	T1739 Define an architecture to fetch extrinsic metadata outside listers and loaders
Migrated	gitlab-migration	T1737 Define and specify metadata providers
Migrated	gitlab-migration	T1747 Review APIs to get metadata from supported origins
Migrated	gitlab-migration	T3542 Decide what metadata we want to / can collect from GitHub

Event Timeline

vlorentz triaged this task as Normal priority.Aug 31 2021, 11:11 AM

vlorentz created this task.

vlorentz updated the task description. (Show Details)Aug 31 2021, 11:15 AM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Aug 31 2021, 11:52 AM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Aug 31 2021, 11:55 AM

vlorentz updated the task description. (Show Details)Aug 31 2021, 11:58 AM

At the moment, I think that all the properties you have selected in the task are needed.
+1 for License (it is something they show on the interface even if it is based on a heuristic).

I agree that assets is a separate story (so it might be a separate loader)
I think that releases information is also important but should be saved in the ERMDS on the release's SWHID.
Maybe add a latest release as part of the origin metadata (similarly to the way it is visible on GitHub with link to latest release).

In T3542#69656, @moranegg wrote:

At the moment, I think that all the properties you have selected in the task are needed.

It is an absolute need? Even for all the ones with question marks?

Because these are somewhat expensive to fetch, so it will slow down the metadata fetchers.

Here's an opinionated and prioritized list.

Must have:

description
homepage
stargazers_count / watchers_count
license/language
"topics" (these are the "tags", right?)

Nice to have:

forks count
whether it's a fork, and of what
whether it's a mirror, and of what
created_at / updated_at / pushed_at

Don't care but why not if they're cheap:

owner avatar/URL
list of stargazers / watchers
list of forks

Off-topic (for this task):

assets? (cf. T17)
release notes that aren't on git tags?

I think we should strive to get both the must and nice to have, and AFAIU it should be relatively easy to do so (except maybe for the "topics"?).

In the "but why not" category, all the lists could grow indefinitely, and we will have the converse information anyway via the "nice to have" items, so I'd pass on that.
I'm not clear on the owner/avatar thing. If it's a single data point per project, it could probably move up the previous category. If not: what is that?

The off-topics I think belong elsewhere.

"topics" (these are the "tags", right?)

yes, but the name "tags" is already taken by git

Don't care but why not if they're cheap:

"list of stargazers / watchers" and "list of forks" are very expensive; they'll probably increase our API usage considerably (10 to 100 times is my guess).
So that answers the question.

I'm not clear on the owner/avatar thing. If it's a single data point per project, it could probably move up the previous category.

It is.

do we need the "list of forks" if we keep the "fork of what"? I mean these are the 2 ends of the fork relation, right?

no and yes, respectively

vlorentz updated the task description. (Show Details)Sep 1 2021, 4:03 PM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Sep 2 2021, 11:54 AM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Sep 2 2021, 12:04 PM

I updated the task with a breakdown of the cost of getting each info.

In total/summary, to get all the mid and high priority items:

REST API:
- main request: 1 req/100repos/user
- to get the fork relationships: 1 req/forkedrepo (if querying from child) OR 1req/100forks/repo (if querying from parent)
- topics: 1 preview req/100 repos/user (possibly mergeable with the main request)
GraphQL API:
- main request: 1 req/100repos/user + 1 point/100repos
- to get the fork relationships: 1 point/100repos (if querying in the main request) OR 1 req + 1 point/forked-repo (if using a specific request) -> clearly the first option is better, as far more than 1% of repositories are forks
- topics: MAX_TOPICS points/100repos/user (if querying in the main request) OR 1 req/repo + 1 point/repo. MAX_TOPICS is a constant we can choose, so we could use a hybrid approach by setting it low to catch most repositories, and send specific queries for repos with many topics

Summary's summary: In terms of numbers, GraphQL is more attractive, as we can get slightly more info per account quota. Additionally, responses are not 10 times larger than needed because the REST API includes lots of URLs (thanks HATEAOS...)

However, two big disadvantages of the GraphQL API:

we need to know somehow whether an entity is a user or an organization. (ie. try either, and waste 1req/entity + 1 point/entity if our guess is wrong). Users can be turned into orgs (but not the other way around), so this is not entirely cacheable.
~~there seems to be a bug that prevents querying repositories for organizations entirely~~ (Github devs fixed it)

jayeshv added a subscriber: jayeshv.Sep 6 2021, 3:14 PM

Looks like *what* we want to collect is a solved issue.

For the *how*, we'll just have to wait and see.

In summary, we would archive everything with priority "high" or "mid", as well as the "license" and "main language" fields, as they are all easy to fetch and store

vlorentz updated the task description. (Show Details)Apr 19 2022, 12:10 PM

This task has been migrated to GitLab.

Decide what metadata we want to / can collect from GitHubClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Decide what metadata we want to / can collect from GitHub
Closed, MigratedEdits Locked
Actions

Related Objects
Search...