Decide a consistent policy on having multiple archived objects for the same extid
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Jun 30 2021, 6:49 PM

Description

The current extid table is a n-n mapping between:

external artifacts, identified by an extid type (e.g. hg-nodeid) and an extid (usually some form of hash)
archived objects in Software Heritage, identified by their SWHID (in practice, an object type and a hash)

During the lifetime of the archive, mostly when we refine our data model or when we improve the fidelity of loaders, we've ended up loading the same external object multiple time, generating multiple archived objects (and SWHIDs) for the same external artifact.

The current assumption in the mercurial loader, is that there is a single archived object for each nodeid. I noticed this in staging, and brushed it off as a partial deployment issue, but this is actually (very) false in production: we have 2 million archived mercurial revisions with multiple associated objects in the archive. (Ref: P1084)

From what I understand, the mercurial loader currently uses the extid table to avoid reprocessing mercurial changesets that have already been archived by a previous run of the loader (on any origin). For this usecase, it needs a mapping from extid to the "latest" archived object corresponding to it.

When discussing this with @vlorentz, multiple ways of resolving this issue came up:

Add a mapping version field to the extid table
- add a mapping_version (loader_version?) field to the extid table.
- allow loaders to query extid mappings using an extid type and a mapping_version "higher than x" constraint (or maybe just "equal to x"?)
- have loaders bump the mapping version when storing future extids, if the external objects need to be reprocessed because the storage changed in an incompatible way

Have loaders change their extid_type when they do changes that we want to trigger a new archival
Wipe the old extids from the ExtID table when doing an incompatible change

We should decide a way forward, and implement it, as this is another blocker on the way to the deployment of the new version of the mercurial loader (loading an existing archived project is broken).

Revisions and Commits

rDLDHG Mercurial loader
	D6036	rDLDHG9be124af2151 Use versioned ExtIDs in main loader mercurial implementation
rDSTO Storage manager
	D6023	rDSTO7a380458f4ab Implement storage of the ExtID.extid_version field
rCJSWH Jenkins jobs
	D6037	rCJSWHe9a896ce86d0 Update debian maintainer cassandra public keys with latest changes
rDMOD Data model
	D6019	rDMOD1545ef77e36d Add an extid_version field to ExtIDs

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3338 Load the archived bitbucket mercurial repositories
Migrated	gitlab-migration	T3418 Decide a consistent policy on having multiple archived objects for the same extid
Migrated	gitlab-migration	T3447 staging: Deploy swh.loader.mercurial v2.1.0
Migrated	gitlab-migration	T3448 production: Deploy swh.loader.mercurial v2.1.0

Event Timeline

olasd triaged this task as Unbreak Now! priority.Jun 30 2021, 6:49 PM

olasd created this task.

vlorentz updated the task description. (Show Details)Jun 30 2021, 6:55 PM

The "mapping version field" is the most fleshed out proposal as it would be my preference. My rationale for it against changing extid_type for backwards incompatible changes is that the extid_type is a property of the external artifact, while the mapping version is a property of our archiving infrastructure.

I'm "naturally averse" to the idea of wiping the extid entries (and it would probably be kind of a mess from a kafka perspective), but I'm not against it on principle either.

and it would probably be kind of a mess from a kafka perspective

Ah, good point.

I'm fine with options 1 and 2.

I've the feeling that option (1) will lead in the long run to an explosion on the size of the mapping which will make us eventually converge (slowly) toward option (3).

Option (3) would also be the first one I'd consider here, because the main use case (in my mind at least) is external lookup of an extid to something in the archive. So if that mapping change, but always give back an object in the archive (pointed by a SWHID), that doesn't seem a problem to me. It also doesn't seem to me to be a problem for the use case of avoiding re-archiving stuff (identified by extids) that has been archived in the past. So what kind of "relevant information" will we actually lose if we go with (3)?

Admittedly, (3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions, which I don't know how easy/realistic will be in practice to guarantee. But the simplicity of (3) still has a lot of appeal, even if this property cannot be strongly guaranteed.

So if that mapping change, but always give back an object in the archive (pointed by a SWHID)

(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions

I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.

So what kind of "relevant information" will we actually lose if we go with (3)?

IMO, none, because it's only used by loader. However, it will become useful to other people when we implement T2430 (which depends on storing Disarchive data along with the extid)

(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions

I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.

We clarified this yesterday on IRC, but for the record: @vlorentz was pointing at the race condition between when old mappings are removed and new ones are not created yet; I was considering the fact that eventually all old extids will be resolvable again. There is also a way to avoid this race condition all together: making the extid→swhid direction of the mapping append/update only, but inhibit deletes. That way extids all always resolvable and will point to the new swhids as soon as modified loaders are rerun.

(We also discussed a bunch of other things, which I presume @olasd will eventually summarize here. Not sure we have a decision yet.)

olasd added a revision: D6019: Add an extid_version field to ExtIDs.Jul 23 2021, 12:14 PM

olasd added a commit: rDMOD1545ef77e36d: Add an extid_version field to ExtIDs.Jul 23 2021, 4:12 PM

olasd added a revision: D6023: Implement storage of the ExtID.extid_version field.Jul 23 2021, 5:38 PM

For now I went the simplest way I could think of, which is:

add a extid_version field to the ExtID object, defaulting to 0 (and backwards compatible, in terms of deduplication id, with the old objects without the field) - D6019
in storage, store this field without using it at all: all queries in both directions still return all extid objects associated with the queried value, regardless of version. - D6023

In production, we currently have ~41 million entries in the extid table. I'm not sure it's worth much more effort to cut down the volume of these. Clients (i.e. loaders) can filter out the objects with versions they know they don't understand, which will make them parse and load the objects again.

The worst case scenario is having 41 million redundant entries, which is probably small enough. And I assume most objects will really be /new/ as we finally deploy the new mercurial loader and load all the bitbucket objects.

We should also reconsider this if we end up having a lot of churn in the "mapping version", but I don't really see that happening: this will only happen when we have strongly backwards incompatible versions of a loader happening, which is not something that we'll do a lot (and even then, we'll only load "live" origins again, so the duplication should stay under control).

olasd added a commit: rDSTO7a380458f4ab: Implement storage of the ExtID.extid_version field.Jul 27 2021, 4:51 PM

ardumont added a revision: D6036: Use versioned ExtIDs in loader mercurial implementation.Jul 28 2021, 10:32 AM

ardumont added a revision: D6037: Update debian maintainer cassandra public keys with latest changes.Jul 28 2021, 11:13 AM

ardumont added a commit: rCJSWHe9a896ce86d0: Update debian maintainer cassandra public keys with latest changes.Jul 28 2021, 12:37 PM

ardumont added a commit: rDLDHG9be124af2151: Use versioned ExtIDs in main loader mercurial implementation.Jul 29 2021, 11:53 AM

Shipped the following modules to solve that problem:

swh.model v2.7.0
swh.storage v0.35.0
swh.loader.mercurial v2.1.0

Now on to deployment for staging first then production if all is well.

by the way ^

ardumont changed the status of subtask T3447: staging: Deploy swh.loader.mercurial v2.1.0 from Open to Work in Progress.Jul 29 2021, 12:32 PM

ardumont closed subtask T3447: staging: Deploy swh.loader.mercurial v2.1.0 as Resolved.Jul 29 2021, 1:18 PM

ardumont changed the status of subtask T3448: production: Deploy swh.loader.mercurial v2.1.0 from Open to Work in Progress.Jul 29 2021, 5:40 PM

Deployed \o/
Closing.

ardumont closed this task as Resolved.Jul 29 2021, 6:21 PM

ardumont added a parent task: T3338: Load the archived bitbucket mercurial repositories.Jul 29 2021, 6:30 PM

gitlab-migration changed the status of subtask T3447: staging: Deploy swh.loader.mercurial v2.1.0 from Resolved to Migrated.Oct 19 2022, 6:03 PM

gitlab-migration changed the status of subtask T3448: production: Deploy swh.loader.mercurial v2.1.0 from Resolved to Migrated.

This task has been migrated to GitLab.

Decide a consistent policy on having multiple archived objects for the same extidClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Decide a consistent policy on having multiple archived objects for the same extid
Closed, MigratedEdits Locked
Actions

Related Objects
Search...