Page MenuHomeSoftware Heritage

Survey revisions/releases with partially loaded history
Open, LowPublic

Description

To be able to enable global deduplication of objects in the git loader without mediation via the extid table (and as long as swhids are 1:1 compatible with git identifiers), we need to make sure that the revisions and releases currently in the archive have been properly loaded, with their full history.

We need to survey this, to figure out:

  • how many history holes we have (revision/release objects with parents unknown in the archive)
  • how many history holes we can fix (by reloading a known origin containing the holey object and its parents)
  • whether the remaining history holes would prevent us from turning on global deduplication altogether, or not.

Event Timeline

21:57 guest@softwareheritage => select count(distinct id) from revision_history where not exists (select 1 from revision where id=parent_id);
 count 
───────
  2218
(1 ligne)

That's... much lower than I expected, which is good!

In T3660, @grouss has found many more.
Might be for a different reason (the dataset he analyzed is not the live one), but it's worth a comparison.

You might be interested by what @grouss just opened in T3660
(ah scratched that, zack already mentioned it)

according to the list of nodes provided by seirl there were ~21,000,000 revisions without ancestors according to swh-graph snapshot (2020-12-15)
checking in the current live swh DAG 2 days ago 98% have one in release or snapshot_branch.
indeed I was surprised because I did'nt have to loop over the revision history.

It means that probably less than 2% (2% is an upper bound) of the ~21 10^6 revisions still have problem, unless it comes from the downstream process

In T3656#72364, @grouss wrote:

according to the list of nodes provided by seirl there were ~21,000,000 revisions without ancestors according to swh-graph snapshot (2020-12-15)

Note that we are comparing different things here.

@olasd query is about the number of revisions that have a (declared) ancestor in the history graph, that cannot be found in the archive.
@grouss/@seirl numbers above are about the revisions that do not have a declared ancestor at all.

For instance, all initial commit in any given git repository will not appear in @olasd count, while they do appear in @grouss count.

effort : low

  • quantify the “hole” problem which prevents us to do T3655 altogether

Looks like the number of affected revisions is fluctuating a bit:

zcat revisions_missing_parent_20200123.gz | wc -l
1143

zcat revisions_missing_parent_20220125.gz | wc -l
5427

zcat revisions_missing_parent_20220401.gz | wc -l
4954

Some objects have come and gone between the different snapshots of the issue, so it's clear that the fluctuation is caused by our loading revisions in an arbitrary order in the git loader rather than a more systemic problem : some in-flight revisions might show the issue temporarily.

Loading git revisions in topological order (T3654) would ensure that "in-flight" revisions don't show the problem anymore, leaving only historically broken revisions behind.

It's still just a handful of revisions, so, as long as we do the topological sorting, and we make sure that this check is integrated in the scrubber, the number will go down, and I'll be fairly comfortable doing a global deduplication in all git origins using all the objects in the archive.