Page MenuHomeSoftware Heritage

Use "fork" relationships to speed-up initial load of large repositories
Closed, MigratedEdits Locked

Description

(I'm writing this task just so that I don't forget the idea, but I don't expect it to be actionable in the short term)

To work incrementally, VCS loaders fetch the last snapshot of the origin, which gives them a set of "heads", they can pass to origins, so origins will detect what revisions they don't need to send.

Unfortunately, when someone forks a large repository (such as https://github.com/chromium/chromium) and we see it for the first time, we don't have that snapshot; so the server needs to send all revisions, and we then discard almost all of them, because they are already in the archive.

However, if we could detect new repositories are forks (from extrinsic metadata, from heuristics based on repository names, ...), we could fetch the snapshot from the original repositories and use them as the base to load the fork incrementally

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

vlorentz triaged this task as Normal priority.Apr 19 2021, 1:49 PM
vlorentz lowered the priority of this task from Normal to Low.
vlorentz created this task.