Page MenuHomeSoftware Heritage

skip exogenous branches when ingesting github/gitlab git repositories
Open, NormalPublic


When ingesting git repositories hosted on popular collaborative development platforms (e.g., GitHub, GitLab) we currently crawl all git-accessible branches, including branches that would not "normally" be retrieved by developers when cloning the repositories. A typical example are branches pointing to pull requests/merge requests submitted to the repo (but not necessarily merged), but other cases exist. Let's call these branches "exogenous" w.r.t. the native branches created by the repository owner.

This has various undesirable effects:

  • for software provenance use cases, we might conclude that a given repo has distributed a given piece of code even when in fact it is just a patch proposed by someone else than the repository owner and never accepted
  • having a lot of exogenous branches inflates the perceived activity: we might consider a git loader visit "eventful" (and hence warranting visiting the same repo again soon) simply because a pull request against that repo has been submitted, while no real activity *in* the repo took place

I propose to ignore exogenous branches when ingesting git repositories.

Several considerations are in order:

  • in terms of archival, we will archive code pointed by exogenous branches anyway, while archiving the originating repos (there is a race condition here, but is nothing new)
  • we need to decide which branches to exclude, and that is going to be a platform-specific heuristic, that will need to be maintained over time and properly engineered in the loader code (will be documented/iterated upon in followups to this task)
  • we will have a discrepancy the kind of branches that "old" snapshots contain w.r.t. what "new" snapshts will; as discussed in other occasions, that's just life and we should not try to rewrite old snapshots. We should just document in a public journal of notable changes to the archive policy (T2460) this change and when it starts to be in effect

Event Timeline

zack triaged this task as Normal priority.Jun 19 2020, 9:50 AM
zack created this task.

as a related data point, the current graph export code applies the following heuristic to decide which outbound edges from snapshot nodes to emit:

  • keep branch names starting with refs/heads/
  • keep branch names starting with refs/tags/
  • drop everything else

It's a bit extreme, maybe (?), but empirically it works quite well and exclude all the following ref name patterns that are common offenders for github/gitlab:

  • refs/pull/* (GitHub)
  • refs/merge-requests/* (GitLab)

It will also have the advantage that it could be a platform-agnostic heuristic, easy to integrate as Git loader configuration without discrimination.

The heuristic you're talking about only applies for branches which name starts with refs/. All other branches are passed through unscathed, I think (which is a good thing, because most snapshots we generate as swh don't do refs/).

Before @seirl wrote the filtering code in the graph export, I pulled a list of distinct branch names by number of unique targets they point at. The result of my (fairly trivial) "name pattern" analysis is P670.

Git's "refspec" defaults to +refs/heads/*:refs/remotes/origin/*, which means "pull all refs named refs/heads/* to local refs named refs/remotes/origin/*. The "tag refspec" (which is enabled when doing git pull --tags) is refs/tags/*:refs/tags/* which pulls all refs named refs/tags/*.

In contrast, git clone --mirror pulls all refs without filtering (so it also gets refs/pull/X/heads and, even worse, refs/pull/X/merges, which are automatically generated by GitHub).