Page MenuHomeSoftware Heritage

Handling missing DAG nodes
Open, NormalPublic

Description

This is a long-standing and well-known issue, but I don't think a task was open about it yet.

When ingesting an origin, some nodes of the DAG may be missing, for various reasons:

  • corrupted data (eg. a commit in the git history does not match its hash)
  • directory must be found "somewhere else" (eg. SVN external (T611)
  • revisions must be found "somewhere else" (eg. Bazaar stacked branches)
  • ingestion of a (potentially large) repo might stop/crash after having ingested only some of its objects, and the repository might have disappeared when we try again

Currently, what happens is:

  • if the missing object is a git object, then we know its sha1_git, and it's just a dangling reference (though this will be an issue when we will want to implement generation numbers, T1617)
    • even in this (fortunate) case, other objects transitively referenced might remain completely unknown
  • otherwise, objects referencing the missing object cannot even be represented in the SWH data model (and recursively, all objects referencing it)

Event Timeline

zack updated the task description. (Show Details)Aug 20 2019, 10:34 AM
olasd added a subscriber: olasd.Aug 20 2019, 10:57 AM

I think objects that we refuse to archive because of policy (that is, currently, contents larger than 100MB) also fit that description.