GitHub loading optimization: skip repos with old enough updated_at/pushed_at timestamps
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Jan 21 2020, 1:33 PM

Description

The GitHub API allows to inspect when a repo has been last modified, see updated_at/pushed_at fields in this example.

Given how significant GitHub is in our archive coverage it makes sense to add a forge-specific optimization that skip loading repos for which those timestamps are older than our last visit of the corresponding origins.

(Note: I'm not exactly sure what the difference among the two fields are; I'm assuming pushed_at is for git push and updated_at for metadata changes. But I think even the most conservative approach, skip only if both fields are older than our last visit would be a good start.)

Assuming that doing an API call at the loader level is faster than actually trying to load the repo (which seems obvious to me, but it's not like I have actually benchmarked it *g*), this optimization should help a lot in clearing our backlog of repos to re-visit, for all GitHub repos that haven't changed.

I'm not sure where this forge-specific optimization belongs, but if it worse it's something we're can extend in the future to, e.g., GitLab.

Event Timeline

zack triaged this task as Normal priority.Jan 21 2020, 1:33 PM

zack created this task.

zack updated the task description. (Show Details)

I agree that this may be a useful optimization for some upstreams where getting the state of the remote repository is expensive.

I've done some back of the envelope timing to get a gut feel of whether this would be appropriate:

With the current git loader, "loading" repositories where there was no change since the last visit takes a consistent minimum of around 0.55 seconds. This number gets a bit larger if the upstream repository has lots of branches (/ pull requests), but not much. Note that this accounts for opening/closing the visit, as well as uploading a new snapshot (see below).

With a very scientific measurement (running time GET <api_url for a random repo> for a few random repos that we have just loaded), it looks like getting output from the GitHub API on a random repo takes a minimum of around 0.37 seconds, and has taken up to 1.5 seconds.

I'm a bit worried that distributing API requests across workers to try to go faster will push us /way/ over the estimate of number of requests that we used to ask for delisting of our workers on the github side; We'll need to be careful about that.

(Unrelatedly, we're currently making a pass on all repositories to record snapshots with their symbolic references instead of dereferencing them, so we can't really deploy that optimization yet if we want that process to finish).

This task has been migrated to GitLab.

GitHub loading optimization: skip repos with old enough updated_at/pushed_at timestampsClosed, MigratedEdits LockedActions

Description

Event Timeline

GitHub loading optimization: skip repos with old enough updated_at/pushed_at timestamps
Closed, MigratedEdits Locked
Actions