Page MenuHomeSoftware Heritage

GitHub loading optimization: skip repos with old enough updated_at/pushed_at timestamps
Open, NormalPublic


The GitHub API allows to inspect when a repo has been last modified, see updated_at/pushed_at fields in this example.

Given how significant GitHub is in our archive coverage it makes sense to add a forge-specific optimization that skip loading repos for which those timestamps are older than our last visit of the corresponding origins.

(Note: I'm not exactly sure what the difference among the two fields are; I'm assuming pushed_at is for git push and updated_at for metadata changes. But I think even the most conservative approach, skip only if both fields are older than our last visit would be a good start.)

Assuming that doing an API call at the loader level is faster than actually trying to load the repo (which seems obvious to me, but it's not like I have actually benchmarked it *g*), this optimization should help a lot in clearing our backlog of repos to re-visit, for all GitHub repos that haven't changed.

I'm not sure where this forge-specific optimization belongs, but if it worse it's something we're can extend in the future to, e.g., GitLab.

Event Timeline

zack triaged this task as Normal priority.Jan 21 2020, 1:33 PM
zack created this task.
zack updated the task description. (Show Details)
olasd added a subscriber: olasd.Jan 22 2020, 1:25 PM

I agree that this may be a useful optimization for some upstreams where getting the state of the remote repository is expensive.

I've done some back of the envelope timing to get a gut feel of whether this would be appropriate:

With the current git loader, "loading" repositories where there was no change since the last visit takes a consistent minimum of around 0.55 seconds. This number gets a bit larger if the upstream repository has lots of branches (/ pull requests), but not much. Note that this accounts for opening/closing the visit, as well as uploading a new snapshot (see below).

With a very scientific measurement (running time GET <api_url for a random repo> for a few random repos that we have just loaded), it looks like getting output from the GitHub API on a random repo takes a minimum of around 0.37 seconds, and has taken up to 1.5 seconds.

I'm a bit worried that distributing API requests across workers to try to go faster will push us /way/ over the estimate of number of requests that we used to ask for delisting of our workers on the github side; We'll need to be careful about that.

(Unrelatedly, we're currently making a pass on all repositories to record snapshots with their symbolic references instead of dereferencing them, so we can't really deploy that optimization yet if we want that process to finish).