loader git: load revisions in topological order
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Oct 14 2021, 11:11 AM

Description

Right now, the git loader loads revision objects in the order that they come in the packfile sent by upstream. This means that if the loader is interrupted, for whatever reason, in the middle of adding revisions, we will have loaded a set of revisions with no guarantees that their parents have been properly added.

This means that we cannot use a simple global lookup of revision objects to reduce the effort of loading "undeclared" forks: we cannot consider that any revision currently present in the SWH archive is complete.

If we can ensure that existing revision objects have been properly loaded, including their parents, making the git loader sort revision objects in topological order before adding them would allow us to use a global lookup of revision objects to reduce the size of packfiles received from the server (instead of restricting ourselves to earlier snapshots of the same origin).

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2206 Quality of Service
Migrated	gitlab-migration	T4080 Minimize archival lag w.r.t. upstream code hosting platforms
Migrated	gitlab-migration	T2207 Improve ingestion efficiency
Migrated	gitlab-migration	T3655 loader git: enable global deduplication of head branches before fetching them
Migrated	gitlab-migration	T3654 loader git: load revisions in topological order

Event Timeline

(I've removed T3653 as parent as this is a somewhat longer term endeavour. Not the topological sorting itself, but making sure that (most) existing revisions aren't dangling, before we can use this topological guarantee)

olasd mentioned this in T3655: loader git: enable global deduplication of head branches before fetching them.Oct 14 2021, 11:18 AM

olasd added a parent task: T3655: loader git: enable global deduplication of head branches before fetching them.

effort : medium

this ensures that objects loaded in the archive are self-consistent
but this increases the processing needed to load git repositories (i.e. it will slow them down)

olasd mentioned this in T3656: Survey revisions/releases with partially loaded history.Apr 11 2022, 4:10 PM

This task has been migrated to GitLab.

loader git: load revisions in topological orderClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

loader git: load revisions in topological order
Closed, MigratedEdits Locked
Actions

Related Objects
Search...