Page MenuHomeSoftware Heritage

Decide a consistent policy on having multiple archived objects for the same extid
Closed, MigratedEdits Locked

Description

The current extid table is a n-n mapping between:

  • external artifacts, identified by an extid type (e.g. hg-nodeid) and an extid (usually some form of hash)
  • archived objects in Software Heritage, identified by their SWHID (in practice, an object type and a hash)

During the lifetime of the archive, mostly when we refine our data model or when we improve the fidelity of loaders, we've ended up loading the same external object multiple time, generating multiple archived objects (and SWHIDs) for the same external artifact.

The current assumption in the mercurial loader, is that there is a single archived object for each nodeid. I noticed this in staging, and brushed it off as a partial deployment issue, but this is actually (very) false in production: we have 2 million archived mercurial revisions with multiple associated objects in the archive. (Ref: P1084)

From what I understand, the mercurial loader currently uses the extid table to avoid reprocessing mercurial changesets that have already been archived by a previous run of the loader (on any origin). For this usecase, it needs a mapping from extid to the "latest" archived object corresponding to it.

When discussing this with @vlorentz, multiple ways of resolving this issue came up:

  1. Add a mapping version field to the extid table
    • add a mapping_version (loader_version?) field to the extid table.
    • allow loaders to query extid mappings using an extid type and a mapping_version "higher than x" constraint (or maybe just "equal to x"?)
    • have loaders bump the mapping version when storing future extids, if the external objects need to be reprocessed because the storage changed in an incompatible way
  1. Have loaders change their extid_type when they do changes that we want to trigger a new archival
  2. Wipe the old extids from the ExtID table when doing an incompatible change

We should decide a way forward, and implement it, as this is another blocker on the way to the deployment of the new version of the mercurial loader (loading an existing archived project is broken).

Event Timeline

olasd triaged this task as Unbreak Now! priority.Jun 30 2021, 6:49 PM
olasd created this task.

The "mapping version field" is the most fleshed out proposal as it would be my preference. My rationale for it against changing extid_type for backwards incompatible changes is that the extid_type is a property of the external artifact, while the mapping version is a property of our archiving infrastructure.

I'm "naturally averse" to the idea of wiping the extid entries (and it would probably be kind of a mess from a kafka perspective), but I'm not against it on principle either.

and it would probably be kind of a mess from a kafka perspective

Ah, good point.

I'm fine with options 1 and 2.

I've the feeling that option (1) will lead in the long run to an explosion on the size of the mapping which will make us eventually converge (slowly) toward option (3).

Option (3) would also be the first one I'd consider here, because the main use case (in my mind at least) is external lookup of an extid to something in the archive. So if that mapping change, but always give back an object in the archive (pointed by a SWHID), that doesn't seem a problem to me. It also doesn't seem to me to be a problem for the use case of avoiding re-archiving stuff (identified by extids) that has been archived in the past. So what kind of "relevant information" will we actually lose if we go with (3)?

Admittedly, (3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions, which I don't know how easy/realistic will be in practice to guarantee. But the simplicity of (3) still has a lot of appeal, even if this property cannot be strongly guaranteed.

So if that mapping change, but always give back an object in the archive (pointed by a SWHID)

(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions

I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.

So what kind of "relevant information" will we actually lose if we go with (3)?

IMO, none, because it's only used by loader. However, it will become useful to other people when we implement T2430 (which depends on storing Disarchive data along with the extid)

(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions

I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.

We clarified this yesterday on IRC, but for the record: @vlorentz was pointing at the race condition between when old mappings are removed and new ones are not created yet; I was considering the fact that eventually all old extids will be resolvable again. There is also a way to avoid this race condition all together: making the extid→swhid direction of the mapping append/update only, but inhibit deletes. That way extids all always resolvable and will point to the new swhids as soon as modified loaders are rerun.

(We also discussed a bunch of other things, which I presume @olasd will eventually summarize here. Not sure we have a decision yet.)

For now I went the simplest way I could think of, which is:

  • add a extid_version field to the ExtID object, defaulting to 0 (and backwards compatible, in terms of deduplication id, with the old objects without the field) - D6019
  • in storage, store this field without using it at all: all queries in both directions still return all extid objects associated with the queried value, regardless of version. - D6023

In production, we currently have ~41 million entries in the extid table. I'm not sure it's worth much more effort to cut down the volume of these. Clients (i.e. loaders) can filter out the objects with versions they know they don't understand, which will make them parse and load the objects again.

The worst case scenario is having 41 million redundant entries, which is probably small enough. And I assume most objects will really be /new/ as we finally deploy the new mercurial loader and load all the bitbucket objects.

We should also reconsider this if we end up having a lot of churn in the "mapping version", but I don't really see that happening: this will only happen when we have strongly backwards incompatible versions of a loader happening, which is not something that we'll do a lot (and even then, we'll only load "live" origins again, so the duplication should stay under control).

Shipped the following modules to solve that problem:

  • swh.model v2.7.0
  • swh.storage v0.35.0
  • swh.loader.mercurial v2.1.0

Now on to deployment for staging first then production if all is well.

ardumont changed the task status from Open to Work in Progress.Jul 29 2021, 12:31 PM

by the way ^