Page MenuHomeSoftware Heritage

loader git: enable global deduplication of head branches before fetching them
Closed, MigratedEdits Locked


This task tracks the efforts to (re-)enable global deduplication of revisions in the git loader, to reduce the amount of data downloaded from upstreams (and converted uselessly by workers).

  • first enabling partial global deduplication through extid mappings for snapshot heads (for which we know that we have done a complete load of the history): T3635
  • then surveying the opportunity of "just" doing a global lookup for any object types: T3656, and T3654 to avoid creating new "history holes"

Event Timeline

Effort : HIGH

  • could be done globally (i.e. query if any branch target is already in the archive), but does not fill historical “holes”
  • T3635 is a safer, but partial version of this
  • T3654 would enable doing this globally