Maniphest T1957

Handling missing DAG nodes
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Aug 20 2019, 9:58 AM

Description

This is a long-standing and well-known issue, but I don't think a task was open about it yet.

When ingesting an origin, some nodes of the DAG may be missing, for various reasons:

corrupted data (eg. a commit in the git history does not match its hash)
directory must be found "somewhere else" (eg. SVN external (T611)
revisions must be found "somewhere else" (eg. Bazaar stacked branches)
ingestion of a (potentially large) repo might stop/crash after having ingested only some of its objects, and the repository might have disappeared when we try again

Currently, what happens is:

if the missing object is a git object, then we know its sha1_git, and it's just a dangling reference (though this will be an issue when we will want to implement generation numbers, T1617)
- even in this (fortunate) case, other objects transitively referenced might remain completely unknown
otherwise, objects referencing the missing object cannot even be represented in the SWH data model (and recursively, all objects referencing it)

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T617 ingest Google Code Subversion repositories
		Unknown Object (Maniphest Task)
		Unknown Object (Maniphest Task)
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T328 svn / subversion loader
Migrated	gitlab-migration	T611 support for external definitions in the svn/subversion loader
Migrated	gitlab-migration	T1617 Experiment with generation numbers to improve revisions walk performance
Migrated	gitlab-migration	T1957 Handling missing DAG nodes
Migrated	gitlab-migration	T3282 Add support for "uninterpreted upstream object" in SWH model and storage

Event Timeline

vlorentz triaged this task as Normal priority.Aug 20 2019, 9:58 AM

vlorentz created this task.

vlorentz mentioned this in D1862: Allow -1 as Content length..

vlorentz added parent tasks: T611: support for external definitions in the svn/subversion loader, T1617: Experiment with generation numbers to improve revisions walk performance.

zack updated the task description. (Show Details)Aug 20 2019, 10:34 AM

I think objects that we refuse to archive because of policy (that is, currently, contents larger than 100MB) also fit that description.

olasd mentioned this in T2348: swh.journal silently loses large objects instead of rejecting them.Apr 6 2020, 10:22 PM

Examples of such missing objects are revisions with attributes that cannot fit the current data model, e.g. out of range dates. We have example of such revisions in kafka, as mentionned in T3200 and T3170.

douardda added a subtask: T3282: Add support for "uninterpreted upstream object" in SWH model and storage.Apr 22 2021, 2:43 PM

In SWHIDv2, instead of having a hardcoded "pointer to another revision" directory entry type, we could enable pointers to more generic "unresolved external entities". When possible, we should make these pointers compatible with the current ExtID table, so that users of the data can look the contents of the pointed objects up lazily.

vlorentz added a parent task: T3134: SWHID v2.Oct 14 2021, 12:12 PM

vlorentz removed a parent task: T3134: SWHID v2.

vlorentz mentioned this in T3609: SWHIDv2: List issues with SWHIDv1 that should be fixed.Oct 14 2021, 12:15 PM

This task has been migrated to GitLab.

gitlab-migration closed subtask T3282: Add support for "uninterpreted upstream object" in SWH model and storage as Migrated.Jan 8 2023, 5:02 PM

Handling missing DAG nodesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Handling missing DAG nodes
Closed, MigratedEdits Locked
Actions

Related Objects
Search...