Page MenuHomeSoftware Heritage

Consider dropping pull request references from the git loader ingestion
Open, HighPublic

Description

The loader git currently filters out references considered not that interesting [1]:

  • auto-merged github pull requests (reference names starting with refs/pulls and finishes with /merge)
  • peeled refs (reference names finishing with ^{})

Nonetheless, the current loader git actually still load pull requests references and
there can be a lot depending on the repository. See for example a recent snapshot on the
torvalds/linux [2] repository.

We should consider whether that's still relevant to ingest those references.

The webapp already considers this noise and has been filtering them out from the
browsing since v0.0.288 version [3]. So that tends toward ignoring them as well during the ingestion.

That should also alleviate other current considerations [4].

[1] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/utils.py$89-90

[2] https://archive.softwareheritage.org/api/1/snapshot/c2847dfd741eae21606027cf29250d1ebcd63fb4/

[3] rDWAPPScc652d5240

[4] T3625

Revisions and Commits

Event Timeline

ardumont triaged this task as High priority.Mon, Oct 4, 1:07 PM
ardumont created this task.

According to the snippet referenced by @ardumont, all branch names starting with refs/pull/ should be filtered out.
But in the recent snapshot of torvalds/linux there are a lot of branch names like that.
How come?

According to the snippet referenced by @ardumont, all branch names starting with refs/pull/ should be filtered out.

No, the snippet mentioned filters out refs whose name starts with refs/pulls and finishes with /merge
(i realize i made a typo in the main description..., it's now fixed)

@zack Maybe the diff ^ will clarify a bit ;)

No, the snippet mentioned filters out refs whose name starts with refs/pulls and finishes with /merge
(i realize i made a typo in the main description..., it's now fixed)

Oh, right! So I'm actually proposing that we filter out all branches whose name start with refs/pulls (with no other conditions attached).
(I think this probably deserves a more broad discussion on swh-devel though.)

So I'm actually proposing that we filter out all branches whose name start with refs/pulls (with no other conditions attached).

Yes, that's what diff D6401 does.


@olasd @rdicosmo @anlambert @vsellier @vlorentz any thoughts about this?

FTR without D6401, the packfile received from GH for the CocoaPods/Specs repo contains 21162 references, 21146 of which are starting with /refs/pull/ and 7126 are ending with /merge (even if those have been explicitly not asked thanks to the filtering in RepoRepresentation.determine_wanted().
When D6401 is applied, we only get the 20-ish references that are not pull request related.

Yes, we must filter this stuff out (we discussed this issue with @zack some time ago, and you may see Torvalds' opinion too https://www.zdnet.com/article/linux-boosts-microsoft-ntfs-support-as-linus-torvalds-complains-about-github-merges/ )

Yes, we must filter this stuff out (we discussed this issue with @zack some time ago)

Overall, I'm not too comfortable with the idea of intentionally not archiving some data that's been generated by users of the repository, even though this interaction is done via a web interface.

These synthetic branches show the actual state of a pull request at the instant of archival, while any other means of finding that information (e.g., having archived the relevant fork, then correlating the branch name or head commit of the pull request within its metadata with that repository) is inherently a racy operation, which means we could be in a situation where the relevant code isn't archived.

I do agree that this is generally noise, if you're only looking at archival of the software itself rather than its overall development process — but I feel that this is noise that can have some value in some cross-cutting analyses that we wouldn't be otherwise able to guarantee.

I would much prefer extending the snapshot data model to mark these synthetic branches as such, giving users of snapshots the option of ignoring, or not, these branches.

Either way (removal or tagging) we use to filter these branches out, we should do a survey of *all* such synthetic branches, not just the ones that happen to be generated by GitHub, and handle them all at once. GitLab puts them in refs/merge-requests/*, Phabricator in refs/phabricator/*, Gerrit in refs/changes/*, ...

Torvalds is ranting about seeing, in his repository (or in proposed changes to his repository), merge operations that have been generated by clicking on a GitHub button (because, in essence, they're not GPG signed by a maintainer). Even though it's an interesting rant, I don't think this has any bearing on the policies that we should follow for archival. We already ignore the "synthetic merges" that GitHub generates out of the pull request branches (refs/pull/*/merge), as these are definitely not user-generated.

Thanks for your feedback @olasd. I see three main arguments raised there: (1) the raciness of archiving those data via other means (= related forks), (2) the completeness of our canvassing of synthetic refs, (3) annotating rather than not archiving "synthetic" refs.

For (1), sure, it's racy, hence we could lose stuff that gets removed from GitHub before we have the time to archive it. But this is a drop in the ocean in comparison with our lag/backlog.

And it should be weighed against the drawbacks, from the clutter in our conceptual graph to the potential performance overhead. My main concern remains the former: any developer out there would consider as belonging to their repo only what they get with "git clone" (which is what repo maintainers have either pushed or merged themselves, not what random people have simply proposed for addition via a pull request). That and only that is what we should archive. If your concern is missing stuff, we should rather use other refs as scheduler hints to speed up the archival of corresponding repos, not cram all the extra stuff in that repo "while we are there".

For (2), also agreed, we should do that canvasing of all refs that should be pruned. But it sounds like a "perfect enemy of the good" situation. Either we just do that canvasing quickly (it shouldn't take long, right?), or we proceed by incremental improvements. We start filtering out the GitHub refs, which will have a massive quantitative impact by itself, and later on add other refs to the filtering.

I see the appeal of (3), but it still doesn't solve the conceptual problem of archiving stuff that does not correspond to the archival unit that IMO a developer would expect.

In T3627#71809, @zack wrote:

Thanks for your feedback @olasd. I see three main arguments raised there: (1) the raciness of archiving those data via other means (= related forks), (2) the completeness of our canvassing of synthetic refs, (3) annotating rather than not archiving "synthetic" refs.

For (1), sure, it's racy, hence we could lose stuff that gets removed from GitHub before we have the time to archive it. But this is a drop in the ocean in comparison with our lag/backlog.

Agreed.

In T3627#71809, @zack wrote:

And it should be weighed against the drawbacks, from the clutter in our conceptual graph to the potential performance overhead. My main concern remains the former: any developer out there would consider as belonging to their repo only what they get with "git clone" (which is what repo maintainers have either pushed or merged themselves, not what random people have simply proposed for addition via a pull request).

Filtering what developers get using git clone would mean only allowing the default git fetch refspecs (that is, the HEAD symbolic reference, refs/heads/* and refs/tags/*). Shall we be going all this way?

(afaict, git clone --mirror pulls all refs indiscriminately, synthetic or not)

In T3627#71809, @zack wrote:

That and only that is what we should archive. If your concern is missing stuff, we should rather use other refs as scheduler hints to speed up the archival of corresponding repos, not cram all the extra stuff in that repo "while we are there".

We generally don't know *where* to find the code that's been submitted as pull requests (i.e. we'd need to scrape all their metadata, with forge specific APIs), so I assume we won't get to actually doing that in the forseeable future. There's also the case where the source of the pull request has been removed, and only the synthetic refs on the destination repo are available. But that's all probably fine, as long as it's an explicit decision ("a drop in the ocean", as you've put it).

In T3627#71809, @zack wrote:

For (2), also agreed, we should do that canvasing of all refs that should be pruned. But it sounds like a "perfect enemy of the good" situation. Either we just do that canvasing quickly (it shouldn't take long, right?), or we proceed by incremental improvements. We start filtering out the GitHub refs, which will have a massive quantitative impact by itself, and later on add other refs to the filtering.

Yes, sure. Evolutions of this filtering will need to be preeminently documented, because we're starting to ignore data that used to be archived.

For now I see the following candidates, which should get us 90+% of the way there:

In T3627#71809, @zack wrote:

I see the appeal of (3), but it still doesn't solve the conceptual problem of archiving stuff that does not correspond to the archival unit that IMO a developer would expect.

I'm not sure I agree with this, but I do understand the reasoning and, as long as it's taken as a conscious tradeoff that is well documented, it's fine by me.

Ah, another question I've been thinking about: should we go back to existing visits of git repositories and give them a new, pruned snapshot? Our data model now allows it: we can just append a new final OriginVisitStatus pointing at a pruned snapshot.

An alternative to annotating synthetic refs: add a "type" or "forge_type" attribute to snapshots.

Pros:

  1. somewhat consistent with revisions (except revisions have the type of VCS, snapshots would have the type of forge)
  2. no forge-specific code in the loader (the type can be passed by the lister)
  3. we can update the lists a posteriori without changing the snapshot object (ie. it's easier to improve the filtering)
  4. allows fine-grained filtering by clients/UIs depending on how they want to use the data

Cons:

  1. clients need to be aware of all forges if they want to do it. (possible solution: provide official recommendations in a machine-readable format, via the swh-web API)
  2. it feels weird
  3. swh-graph probably can't use it

An alternative to annotating synthetic refs: add a "type" or "forge_type" attribute to snapshots.

Isn't this (more/less) what the 'type' of visit is about? shouldn't it be on the visit?
I mean in the end, this is a piece of contextual information related to the visit of the origin.
The OriginVisit.type tells what loader has been used to visit this origin, this new annotation can be seen as a refinement of this piece of information.