Page MenuHomeSoftware Heritage

git loader: enable "partial" global deduplication of revisions via the extid mapping table
Closed, MigratedEdits Locked

Description

We discussed with @olasd the possibility to lift the extid mapping table to make the
loader git work even less. That would definitely be helpful to compute less ids
especially when ingesting forks.

Currently, without this ^, we only reuse the previous snapshot references to avoid
ingesting known references again. So when we ingest another fork of something known, we
actually do the work again.

All in all, our deduplication exists at the storage level (db or objstorage) but not
completely at the computation level.

For this to happen, we need to start using the extid table using to map the snapshot
branches references to their corresponding sha1_git ids.

So the loader git adaptations would be to:

  • at the end of the loading, after the snapshot creation of the visit. At this point, we know we don't have dangling references. So it's fine to rely on this to skip some work.
  • retrieve the branches references of that snapshot (filtering aliases) and push for each reference the mapping (version 0, sha1-git of the commit/tag, revision/release id) into the extid table
  • at the beginning of the loading, adapt the reading of unknown references (for the origin) to filter out actual known references through the extid table. If they are present, we know them, we can skip the work. Nonetheless, those references should end up in the final snapshot.

That's actually what's been done recently with the mercurial loader.

Event Timeline

ardumont triaged this task as Normal priority.Oct 8 2021, 2:12 PM
ardumont created this task.
olasd renamed this task from Reduce git loader work (use extid mapping table) to git loader: enable "partial" global deduplication of revisions via the extid mapping table.Oct 14 2021, 11:15 AM

Ok I think what puzzle me in this description is the fact the 2 first bullets of the "git loader adaptations" are actually only one point: at the end of a successful loading, store a mapping in the extid table.

Now, I still don't understand what mapping is to be stored in the extid table. What is the meaning of (version 0, sha1-git of the commit/tag, revision/release id) above? (I would expect a mapping to be a couple).

Then I don't really get how this can help if we don't load revisions in topological order.

Then I don't really get how this can help if we don't load revisions in topological order.

When the snapshot is being added, at the end of a load, the current git loader implementation guarantees that all the branch heads point to objects whose history has been fully loaded in the archive:

  • we know that, in the current loader run, all objects of the previous object types have been successfully loaded
  • we know that the packfile that we're currently loading contains all the objects that have been added in the origin between the snapshot being added and the one in the previous visit.
  • and recursively we know that the previous snapshots are history-complete. Hopefully? (I guess we need T3656 to be sure of it)

Now, I still don't understand what mapping is to be stored in the extid table. What is
meaning of (version 0, sha1-git of the commit/tag, revision/release id) above? (I
expect a mapping to be a couple).

It's a triplet with possible values (we have an extra 'version' field now because we
need to distinguish versions of data, recall the mercurial loader's old implementation
and the new for example that implied this).

The possible values are for git afaik:

  • (version, sha1-git of the commit, swh revision id)
  • (version, sha1-git of the tag, swh release id)

What you did not get was a contraction of ^.

effort : low

  • this adds "redundant" data to the extid table (space usage increase)
  • but this is a cheaper way to enable global deduplication until the topological order is guaranteed