We discussed with @olasd the possibility to lift the extid mapping table to make the
loader git work even less. That would definitely be helpful to compute less ids
especially when ingesting forks.
Currently, without this ^, we only reuse the previous snapshot references to avoid
ingesting known references again. So when we ingest another fork of something known, we
actually do the work again.
All in all, our deduplication exists at the storage level (db or objstorage) but not
completely at the computation level.
For this to happen, we need to start using the extid table using to map the snapshot
branches references to their corresponding sha1_git ids.
So the loader git adaptations would be to:
- at the end of the loading, after the snapshot creation of the visit. At this point, we know we don't have dangling references. So it's fine to rely on this to skip some work.
- retrieve the branches references of that snapshot (filtering aliases) and push for each reference the mapping (version 0, sha1-git of the commit/tag, revision/release id) into the extid table
- at the beginning of the loading, adapt the reading of unknown references (for the origin) to filter out actual known references through the extid table. If they are present, we know them, we can skip the work. Nonetheless, those references should end up in the final snapshot.
That's actually what's been done recently with the mercurial loader.