Page MenuHomeSoftware Heritage

Handling missing DAG nodes
Open, NormalPublic

Description

This is a long-standing and well-known issue, but I don't think a task was open about it yet.

When ingesting an origin, some nodes of the DAG may be missing, for various reasons:

  • corrupted data (eg. a commit in the git history does not match its hash)
  • directory must be found "somewhere else" (eg. SVN external (T611)
  • revisions must be found "somewhere else" (eg. Bazaar stacked branches)
  • ingestion of a (potentially large) repo might stop/crash after having ingested only some of its objects, and the repository might have disappeared when we try again

Currently, what happens is:

  • if the missing object is a git object, then we know its sha1_git, and it's just a dangling reference (though this will be an issue when we will want to implement generation numbers, T1617)
    • even in this (fortunate) case, other objects transitively referenced might remain completely unknown
  • otherwise, objects referencing the missing object cannot even be represented in the SWH data model (and recursively, all objects referencing it)

Event Timeline

I think objects that we refuse to archive because of policy (that is, currently, contents larger than 100MB) also fit that description.

douardda added a subscriber: douardda.

Examples of such missing objects are revisions with attributes that cannot fit the current data model, e.g. out of range dates. We have example of such revisions in kafka, as mentionned in T3200 and T3170.

In SWHIDv2, instead of having a hardcoded "pointer to another revision" directory entry type, we could enable pointers to more generic "unresolved external entities". When possible, we should make these pointers compatible with the current ExtID table, so that users of the data can look the contents of the pointed objects up lazily.