Nodes with missing ancestors in SWH DAG / SWH-graph
Open, LowPublic


As part of a work to be submitted (A. Pietri, G. Rousseau, S. Zachirolli) we have introduced some checks on the integrity of the dataset and the statistical measures made.

The absence of parents for the DAG nodes is one of its criteria.
Based on the export of 2020-12-15 used for the last version of swh-graph, we have identified

  • 427,531 releases nodes without ancestors : ratio= 2.58 %
  • 1,343,830 directory nodes without ancestors : ratio= 0.01 %
  • 21,591,750 revision nodes without ancestors : ratio= 1.09 %
  • 54,736 snapshot nodes without ancestors : ratio= 0.03 %

Searching at least some of the visits in which the first 1000 revisions of this file were seen, show that >98% do have origin visit in the current version of the DAG (queries between October 12 and 14, 2021 on the main base / posgresql)

"search for at least some" = limited to revisions in the first 1000 where at least one origin was found as part of a provenance query limited to revisions found as target of a release or snapshot_branch
Two extra files:

  • json dump for each revision with each nodes traversed during the provenance query

  • oldest visit for each revision for which at least one visit was found