Page MenuHomeSoftware Heritage

Consider dropping pull request references from the git loader ingestion
Open, HighPublic

Description

The loader git currently filters out references considered not that interesting [1]:

  • auto-merged github pull requests (reference names starting with refs/pulls and finishes with /merge)
  • peeled refs (reference names finishing with ^{})

Nonetheless, the current loader git actually still load pull requests references and
there can be a lot depending on the repository. See for example a recent snapshot on the
torvalds/linux [2] repository.

We should consider whether that's still relevant to ingest those references.

The webapp already considers this noise and has been filtering them out from the
browsing since v0.0.288 version [3]. So that tends toward ignoring them as well during the ingestion.

That should also alleviate other current considerations [4].

[1] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/utils.py$89-90

[2] https://archive.softwareheritage.org/api/1/snapshot/c2847dfd741eae21606027cf29250d1ebcd63fb4/

[3] rDWAPPScc652d5240

[4] T3625

Event Timeline

ardumont triaged this task as High priority.Oct 4 2021, 1:07 PM
ardumont created this task.

According to the snippet referenced by @ardumont, all branch names starting with refs/pull/ should be filtered out.
But in the recent snapshot of torvalds/linux there are a lot of branch names like that.
How come?

According to the snippet referenced by @ardumont, all branch names starting with refs/pull/ should be filtered out.

No, the snippet mentioned filters out refs whose name starts with refs/pulls and finishes with /merge
(i realize i made a typo in the main description..., it's now fixed)

@zack Maybe the diff ^ will clarify a bit ;)

No, the snippet mentioned filters out refs whose name starts with refs/pulls and finishes with /merge
(i realize i made a typo in the main description..., it's now fixed)

Oh, right! So I'm actually proposing that we filter out all branches whose name start with refs/pulls (with no other conditions attached).
(I think this probably deserves a more broad discussion on swh-devel though.)

So I'm actually proposing that we filter out all branches whose name start with refs/pulls (with no other conditions attached).

Yes, that's what diff D6401 does.


@olasd @rdicosmo @anlambert @vsellier @vlorentz any thoughts about this?

FTR without D6401, the packfile received from GH for the CocoaPods/Specs repo contains 21162 references, 21146 of which are starting with /refs/pull/ and 7126 are ending with /merge (even if those have been explicitly not asked thanks to the filtering in RepoRepresentation.determine_wanted().
When D6401 is applied, we only get the 20-ish references that are not pull request related.

Yes, we must filter this stuff out (we discussed this issue with @zack some time ago, and you may see Torvalds' opinion too https://www.zdnet.com/article/linux-boosts-microsoft-ntfs-support-as-linus-torvalds-complains-about-github-merges/ )

Yes, we must filter this stuff out (we discussed this issue with @zack some time ago)

Overall, I'm not too comfortable with the idea of intentionally not archiving some data that's been generated by users of the repository, even though this interaction is done via a web interface.

These synthetic branches show the actual state of a pull request at the instant of archival, while any other means of finding that information (e.g., having archived the relevant fork, then correlating the branch name or head commit of the pull request within its metadata with that repository) is inherently a racy operation, which means we could be in a situation where the relevant code isn't archived.

I do agree that this is generally noise, if you're only looking at archival of the software itself rather than its overall development process — but I feel that this is noise that can have some value in some cross-cutting analyses that we wouldn't be otherwise able to guarantee.

I would much prefer extending the snapshot data model to mark these synthetic branches as such, giving users of snapshots the option of ignoring, or not, these branches.

Either way (removal or tagging) we use to filter these branches out, we should do a survey of *all* such synthetic branches, not just the ones that happen to be generated by GitHub, and handle them all at once. GitLab puts them in refs/merge-requests/*, Phabricator in refs/phabricator/*, Gerrit in refs/changes/*, ...

Torvalds is ranting about seeing, in his repository (or in proposed changes to his repository), merge operations that have been generated by clicking on a GitHub button (because, in essence, they're not GPG signed by a maintainer). Even though it's an interesting rant, I don't think this has any bearing on the policies that we should follow for archival. We already ignore the "synthetic merges" that GitHub generates out of the pull request branches (refs/pull/*/merge), as these are definitely not user-generated.

Thanks for your feedback @olasd. I see three main arguments raised there: (1) the raciness of archiving those data via other means (= related forks), (2) the completeness of our canvassing of synthetic refs, (3) annotating rather than not archiving "synthetic" refs.

For (1), sure, it's racy, hence we could lose stuff that gets removed from GitHub before we have the time to archive it. But this is a drop in the ocean in comparison with our lag/backlog.

And it should be weighed against the drawbacks, from the clutter in our conceptual graph to the potential performance overhead. My main concern remains the former: any developer out there would consider as belonging to their repo only what they get with "git clone" (which is what repo maintainers have either pushed or merged themselves, not what random people have simply proposed for addition via a pull request). That and only that is what we should archive. If your concern is missing stuff, we should rather use other refs as scheduler hints to speed up the archival of corresponding repos, not cram all the extra stuff in that repo "while we are there".

For (2), also agreed, we should do that canvasing of all refs that should be pruned. But it sounds like a "perfect enemy of the good" situation. Either we just do that canvasing quickly (it shouldn't take long, right?), or we proceed by incremental improvements. We start filtering out the GitHub refs, which will have a massive quantitative impact by itself, and later on add other refs to the filtering.

I see the appeal of (3), but it still doesn't solve the conceptual problem of archiving stuff that does not correspond to the archival unit that IMO a developer would expect.

In T3627#71809, @zack wrote:

Thanks for your feedback @olasd. I see three main arguments raised there: (1) the raciness of archiving those data via other means (= related forks), (2) the completeness of our canvassing of synthetic refs, (3) annotating rather than not archiving "synthetic" refs.

For (1), sure, it's racy, hence we could lose stuff that gets removed from GitHub before we have the time to archive it. But this is a drop in the ocean in comparison with our lag/backlog.

Agreed.

In T3627#71809, @zack wrote:

And it should be weighed against the drawbacks, from the clutter in our conceptual graph to the potential performance overhead. My main concern remains the former: any developer out there would consider as belonging to their repo only what they get with "git clone" (which is what repo maintainers have either pushed or merged themselves, not what random people have simply proposed for addition via a pull request).

Filtering what developers get using git clone would mean only allowing the default git fetch refspecs (that is, the HEAD symbolic reference, refs/heads/* and refs/tags/*). Shall we be going all this way?

(afaict, git clone --mirror pulls all refs indiscriminately, synthetic or not)

In T3627#71809, @zack wrote:

That and only that is what we should archive. If your concern is missing stuff, we should rather use other refs as scheduler hints to speed up the archival of corresponding repos, not cram all the extra stuff in that repo "while we are there".

We generally don't know *where* to find the code that's been submitted as pull requests (i.e. we'd need to scrape all their metadata, with forge specific APIs), so I assume we won't get to actually doing that in the forseeable future. There's also the case where the source of the pull request has been removed, and only the synthetic refs on the destination repo are available. But that's all probably fine, as long as it's an explicit decision ("a drop in the ocean", as you've put it).

In T3627#71809, @zack wrote:

For (2), also agreed, we should do that canvasing of all refs that should be pruned. But it sounds like a "perfect enemy of the good" situation. Either we just do that canvasing quickly (it shouldn't take long, right?), or we proceed by incremental improvements. We start filtering out the GitHub refs, which will have a massive quantitative impact by itself, and later on add other refs to the filtering.

Yes, sure. Evolutions of this filtering will need to be preeminently documented, because we're starting to ignore data that used to be archived.

For now I see the following candidates, which should get us 90+% of the way there:

In T3627#71809, @zack wrote:

I see the appeal of (3), but it still doesn't solve the conceptual problem of archiving stuff that does not correspond to the archival unit that IMO a developer would expect.

I'm not sure I agree with this, but I do understand the reasoning and, as long as it's taken as a conscious tradeoff that is well documented, it's fine by me.

Ah, another question I've been thinking about: should we go back to existing visits of git repositories and give them a new, pruned snapshot? Our data model now allows it: we can just append a new final OriginVisitStatus pointing at a pruned snapshot.

An alternative to annotating synthetic refs: add a "type" or "forge_type" attribute to snapshots.

Pros:

  1. somewhat consistent with revisions (except revisions have the type of VCS, snapshots would have the type of forge)
  2. no forge-specific code in the loader (the type can be passed by the lister)
  3. we can update the lists a posteriori without changing the snapshot object (ie. it's easier to improve the filtering)
  4. allows fine-grained filtering by clients/UIs depending on how they want to use the data

Cons:

  1. clients need to be aware of all forges if they want to do it. (possible solution: provide official recommendations in a machine-readable format, via the swh-web API)
  2. it feels weird
  3. swh-graph probably can't use it

An alternative to annotating synthetic refs: add a "type" or "forge_type" attribute to snapshots.

Isn't this (more/less) what the 'type' of visit is about? shouldn't it be on the visit?
I mean in the end, this is a piece of contextual information related to the visit of the origin.
The OriginVisit.type tells what loader has been used to visit this origin, this new annotation can be seen as a refinement of this piece of information.

I would like us to conclude this discussion soon.

Here's a couple options for a way forward for the git loader:

A1. Update the git loader to ignore the following "known autogenerated" ref name prefixes:

  • refs/pulls/
  • refs/changes/
  • refs/merge-requests/
  • refs/pipelines/
  • refs/environments/

A2. Update the git loader to only load the following ref name prefixes (replicate git fetch --tags):

  • HEAD
  • refs/heads/
  • refs/tags/

A3. Do nothing to the git loader

For the data model, we have a couple options:

B1. Do nothing
B2. Add a new type field in Snapshot, to help downstreams interpret the data / branch schema for the given snapshot
B3. Add a new "synthetic" flag on Snapshot branches, to help downstreams filter in/out these branches

Finally, once we have a decision on questions A and B, we need to decide what we do to existing visits:

C1. Do nothing
C2. Generate a new snapshot for all git visits, with the branch filtering/tagging enabled, and reference it on a new OriginVisitStatus for these visits.

How should we go forward with these questions? I think we may need an "executive" decision here :-)

B3 I am not convinced a "synthetic" flag on the Snapshot branch makes sense, or at least I find this name confusing, especially considering we already have a synthetic flag on Revision: it's not synthetic in the sense of it's not object crafted by SWH, it comes from the origin.

I am also not convinced this is the best approach: a branch may be considered "synthetic" with git loader version N, but not any more with a version N+1; having this flag engraved in the branch itself make it harder to handle this king of evolution of the loaders.
Having a type filed in the Snapshot (or the OriginVisit) to give context -- at using/reading time -- to interpret the branches seems more flexible and future proof to me.

B3 I am not convinced a "synthetic" flag on the Snapshot branch makes sense, or at least I find this name confusing, especially considering we already have a synthetic flag on Revision: it's not synthetic in the sense of it's not object crafted by SWH, it comes from the origin.

It doesn't really come from the origin either. It comes from the platform hosting the origin.

In the summary I'm in the process of sending to swh-devel, I've used the name "extrinsic" instead of synthetic, which feels a bit more descriptive.

(Unrelatedly, the synthetic and type flags in SWH revisions are inconsistently applied, and not part of the current manifest used for SWHID computation. With the rise of ExtIDs, I'm tempted to propose to file them down the memory hole and forget they ever existed)

I am also not convinced this is the best approach: a branch may be considered "synthetic" with git loader version N, but not any more with a version N+1; having this flag engraved in the branch itself make it harder to handle this kind of evolution of the loaders.

I don't see how this is different to changing the set of refs that would be transcribed in the snapshot out of the remote refs from one version of the loader to the next, or how this is different from adding support for symbolic refs, or any other change of heuristic in the loader. This is "just" something that would need to be documented in the archive changelog.

Having a type filed in the Snapshot (or the OriginVisit) to give context -- at using/reading time -- to interpret the branches seems more flexible and future proof to me.

It means that every single user gets to implement an ever growing set of heuristics to interpret the contents of a snapshot - and breaking in the process if they don't know how to do that. We already have 4 different sets of mildly inconsistent heuristics (swh.web, swh.dataset, swh.graph, and I assume swh.provenance). I'm not sure that's something that we should keep doing. IMO the loader is the best place to centralise these heuristics, and refine them over time.


After writing the summary for swh-devel, I have a new idea:

A4. Have the git loader filter the refs (using the heuristic from A1 or A2), fetch data only for filtered refs, and reference the filtered snapshot from the Origin Visit. Also make the git loader generate a full snapshot out of all the available refs (optionally load it, with dangling branches), and serialize that as timestamped extrinsic metadata on the origin.

This way, we don't /lose/ the information about the available extrinsic branches at time of fetching (even if we make it mildly harder to retrieve it), but we keep the data in the main graph clean.

(A4.5 would have us fetch all objects instead of just those from the partial snapshot - this would generate a bunch of dangling objects. Or if we store the snapshot in the archive, only the "full" snapshot itself would dangle.)

Thanks for the summaries @olasd, both here and on list.
I've followed up on list.

Meanwhile here's what I propose we do (spoiler!):

a) A4: add to the archive Merkle DAG only the filtered snapshot (referencing "intrinsic" branches only, as per A2) and its transitive closure
b) (optional, and not blocking for going ahead with (a) in the meantime) create the non-filtered snapshot too, and store it as extrinsic metadata
c) extend the snapshot model to allow typing outgoing edges from snapshots. As terminology (bikeshed) goes:

  • I don't like "type", as it's too broad and we only have one dimension to discriminate on for now. If we add more in the future, a single category ("type") will not cut it anyway
  • (/me checks the dictionary) I don't think this is intrinsic v. extrinsic, but rather endogenous v. exogenous w.r.t. what a VCS tool does out of the box. Hence I propose to add an "endogeneous" boolean property to the outgoing snapshot edges

Regarding the past, I'm not opposed to C2 (generating new snapshots for all visits), with the following caveats:

  • it should not be a blocker for going ahead with (a)
  • we have a good answer for: how will downstream users be able to "pick" the right snapshots?
In T3627#73323, @zack wrote:

Thanks for the summaries @olasd, both here and on list.
I've followed up on list.

Meanwhile here's what I propose we do (spoiler!):

a) A4: add to the archive Merkle DAG only the filtered snapshot (referencing "intrinsic" branches only, as per A2) and its transitive closure

(for completeness, A2 is the "opt-in" version of filtering, i.e. only keeping the branches in the default git refspec)

b) (optional, and not blocking for going ahead with (a) in the meantime) create the non-filtered snapshot too, and store it as extrinsic metadata

To create a snapshot with properly typed outbound edges, we need to fetch the objects from the remote end (because git doesn't type the objects in its protocol, so in general we don't know the type of objects before having fetched them).

In practice, that should not be a problem: @ardumont did a short analysis of fetching (large) repos with and without (a lot of) merge request branches, and didn't find a substantial difference in number of objects or packfile size (only a few percent size difference). At that point, as we've had to fetch them, I would be tempted to just add the objects to storage (all the way up to the snapshot), and leave them dangling?

c) extend the snapshot model to allow typing outgoing edges from snapshots. As terminology (bikeshed) goes:

  • I don't like "type", as it's too broad and we only have one dimension to discriminate on for now. If we add more in the future, a single category ("type") will not cut it anyway
  • (/me checks the dictionary) I don't think this is intrinsic v. extrinsic, but rather endogenous v. exogenous w.r.t. what a VCS tool does out of the box. Hence I propose to add an "endogeneous" boolean property to the outgoing snapshot edges

In your proposal, snapshots with "endogenous" branches (we need to agree on spelling, btw) would only be referenced as extrinsic metadata on origins, correct?

Regarding the past, I'm not opposed to C2 (generating new snapshots for all visits), with the following caveats:

  • it should not be a blocker for going ahead with (a)

Yeah, absolutely, that can happen asynchronously.

  • we have a good answer for: how will downstream users be able to "pick" the right snapshots?

The origin visit statuses have a date field, and the "last one" is supposed to take precedence. Now whether all users of the data model really do that, I don't know. I know for sure that the dataset exports don't do that, for instance (they only filter origin visit statuses with respect to their "status" (full, partial or otherwise).