When ingesting git repositories hosted on popular collaborative development platforms (e.g., GitHub, GitLab) we currently crawl all git-accessible branches, including branches that would not "normally" be retrieved by developers when cloning the repositories. A typical example are branches pointing to pull requests/merge requests submitted to the repo (but not necessarily merged), but other cases exist. Let's call these branches "exogenous" w.r.t. the native branches created by the repository owner.
This has various undesirable effects:
- for software provenance use cases, we might conclude that a given repo has distributed a given piece of code even when in fact it is just a patch proposed by someone else than the repository owner and never accepted
- having a lot of exogenous branches inflates the perceived activity: we might consider a git loader visit "eventful" (and hence warranting visiting the same repo again soon) simply because a pull request against that repo has been submitted, while no real activity *in* the repo took place
I propose to ignore exogenous branches when ingesting git repositories.
Several considerations are in order:
- in terms of archival, we will archive code pointed by exogenous branches anyway, while archiving the originating repos (there is a race condition here, but is nothing new)
- we need to decide which branches to exclude, and that is going to be a platform-specific heuristic, that will need to be maintained over time and properly engineered in the loader code (will be documented/iterated upon in followups to this task)
- we will have a discrepancy the kind of branches that "old" snapshots contain w.r.t. what "new" snapshts will; as discussed in other occasions, that's just life and we should not try to rewrite old snapshots. We should just document in a public journal of notable changes to the archive policy (to be established) this change and when it starts to be in effect