Page MenuHomeSoftware Heritage

Survey revisions/releases with partially loaded history
Open, LowPublic

Description

To be able to enable global deduplication of objects in the git loader without mediation via the extid table (and as long as swhids are 1:1 compatible with git identifiers), we need to make sure that the revisions and releases currently in the archive have been properly loaded, with their full history.

We need to survey this, to figure out:

  • how many history holes we have (revision/release objects with parents unknown in the archive)
  • how many history holes we can fix (by reloading a known origin containing the holey object and its parents)
  • whether the remaining history holes would prevent us from turning on global deduplication altogether, or not.

Event Timeline

21:57 guest@softwareheritage => select count(distinct id) from revision_history where not exists (select 1 from revision where id=parent_id);
 count 
───────
  2218
(1 ligne)

That's... much lower than I expected, which is good!

In T3660, @grouss has found many more.
Might be for a different reason (the dataset he analyzed is not the live one), but it's worth a comparison.

You might be interested by what @grouss just opened in T3660
(ah scratched that, zack already mentioned it)

according to the list of nodes provided by seirl there were ~21,000,000 revisions without ancestors according to swh-graph snapshot (2020-12-15)
checking in the current live swh DAG 2 days ago 98% have one in release or snapshot_branch.
indeed I was surprised because I did'nt have to loop over the revision history.

It means that probably less than 2% (2% is an upper bound) of the ~21 10^6 revisions still have problem, unless it comes from the downstream process

In T3656#72364, @grouss wrote:

according to the list of nodes provided by seirl there were ~21,000,000 revisions without ancestors according to swh-graph snapshot (2020-12-15)

Note that we are comparing different things here.

@olasd query is about the number of revisions that have a (declared) ancestor in the history graph, that cannot be found in the archive.
@grouss/@seirl numbers above are about the revisions that do not have a declared ancestor at all.

For instance, all initial commit in any given git repository will not appear in @olasd count, while they do appear in @grouss count.