The current extid table is a n-n mapping between:
- external artifacts, identified by an extid type (e.g. `hg-nodeid`) and an extid (usually some form of hash)
- archived objects in Software Heritage, identified by their SWHID (in practice, an object type and a hash)
During the lifetime of the archive, mostly when we refine our data model or when we improve the fidelity of loaders, we've ended up loading the same external object multiple time, generating multiple archived objects (and SWHIDs) for the same external artifact.
The current assumption in the mercurial loader, is that there is a single archived object for each nodeid. I noticed this in staging, and brushed it off as a partial deployment issue, but this is actually (very) false in production: we have 2 million archived mercurial revisions with multiple associated objects in the archive. (Ref: P1084)
From what I understand, the mercurial loader currently uses the extid table to avoid reprocessing mercurial changesets that have already been archived by a previous run of the loader (on any origin). For this usecase, it needs a mapping from extid to the "latest" archived object corresponding to it.
When discussing this with @vlorentz, multiple ways of resolving this issue came up:
1. Add a mapping version field to the extid table
- add a `mapping_version` (`loader_version`?) field to the extid table.
- allow loaders to query extid mappings using an extid type and a mapping_version "higher than x" constraint (or maybe just "equal to x"?)
- have loaders bump the mapping version when storing future extids, if the external objects need to be reprocessed because the storage changed in an incompatible way
2. Have loaders change their `extid_type` when they do changes that we want to trigger a new archival
3. Wipe the ExtID table from the old extids when doing an incompatible change
We should decide a way forward, and implement it, as this is another blocker on the way to the deployment of the new version of the mercurial loader (loading an existing archived project is broken).