Page MenuHomeSoftware Heritage

Design and implement a mapping from "original VCS ids" to SWHIDs to help incremental loaders
Closed, MigratedEdits Locked

Description

When doing incremental loading of VCS origins, one of our concerns is minimizing the processing that we have to do on the SWH worker, without incurring too much load on the database.

When we do an incremental load of a git origin, we tell the remote server the set of heads of the previous snapshot, which returns bunch of objects in a packfile. Before processing the objects in the packfile to convert them to swh.model objects, we filter out the objects that are already present in the swh archive. This is easy to do currently, as SWHIDs and git object ids are compatible, so we can just use the regular swh archive object filtering methods. This isn't future-proof, as one day SWHIDs will migrate towards a git-incompatible specification, or git will migrate towards a SWHID-incompatible object id specification.

To do an incremental load of a SVN repo, we work around the issue by spot-checking the last revision loaded, then only importing new revisions. This is easy to do thanks to SVN's linear nature.

Incremental loading of mercurial repos is more complex: once we've downloaded a set of nodes from the server, we're unable to map them to objects in the SWH archive. Computing the SWHIDs of all nodes again is expensive; some repos have millions of nodes, which we'd have to process before even starting to load any new object.

We could work around this by retrieving the set of nodeids from the history of all heads in the previous snapshot of the origin being loaded; but that would not help for forked repositories (which we'd have to process in full again), and this is still a somewhat expensive operation on the backend database (the original nodeid is buried in the revision extra headers), that's not obvious to cache or index.

The proposal is to design and implement a "original VCS id" -> SWHID mapping, which would aid a more effective implementation of incremental loading of mercurial repos, and also future-proof the current implementation of incremental loading of git repos.

There's some design concerns to be solved:

  • scope: mercurial will only need the mapping for revisions; git could use the mapping for all object types, but it'd generally be useful for revisions and releases (snapshots pointing at other objects are quite rare). This drives the size of the mapping too.
  • schema: I suspect something like the following would work:
    • vcs (str / enum)
    • vcs_id (bytea)
    • swhid_type (enum)
    • swhid_value (bytea)
  • storage: this could, arguably, be stored in the extrinsic metadata store. But a more targetted mapping would surely be more effective, as we don't currently have any reverse (value -> swhid) indexes on the metadata store.
  • API: I guess we want to lookup a batch of vcs ids for a given vcs, and return a set of (vcsid, swhid) pairs.

Such an index is also probably a useful user-facing feature in the web search / web api, but it might make sense to only do that within a generic metadata search engine rather than specifically for this.

Event Timeline

olasd triaged this task as Normal priority.Dec 3 2020, 11:14 AM
olasd created this task.

scope: mercurial will only need the mapping for revisions;

Mercurial could also use that mapping for every object. However, just keeping that mapping for revision is already providing a very large complexity boost (or reduction I should says) so that is "good enough" for us. I strongly suspect the same will apply to baazar.

API-wise, batching would be useful, but we can does something "reasonably efficient" without it.

scope: mercurial will only need the mapping for revisions;

Mercurial could also use that mapping for every object. However, just keeping that mapping for revision is already providing a very large complexity boost (or reduction I should says) so that is "good enough" for us. I strongly suspect the same will apply to baazar.

Ack. I think having this for all objects in the short term would make it "too large", and having it for "objects with a history where computing the swhid is (deeply) recursive and expensive" (i.e. revisions and releases) is a decent balance.

In T2849#53968, @olasd wrote:

scope: mercurial will only need the mapping for revisions;

Mercurial could also use that mapping for every object. However, just keeping that mapping for revision is already providing a very large complexity boost (or reduction I should says) so that is "good enough" for us. I strongly suspect the same will apply to baazar.

Ack. I think having this for all objects in the short term would make it "too large", and having it for "objects with a history where computing the swhid is (deeply) recursive and expensive" (i.e. revisions and releases) is a decent balance.

Big +1 on this.

douardda raised the priority of this task from Normal to High.Dec 7 2020, 10:29 AM
douardda added a subscriber: douardda.

Thanks, this would be an important new feature. Some comments/random thoughts below.

Further design point, which I think it's already implicit in what you wrote, but that it'd be useful to make explicit:

  • this new mapping table is needed "only" to speed up things, if we lose it, it will just mean we will be slower in doing future archival (at least for a while), but won't be the end of the world

assuming this is true, loaders will need to be designed with graceful degradation for the incompleteness (or entire disappearance) of the mapping table.
(By the way, we need a cool name for this thing :-))

Something I'm less sure about:

  • the table is just a cache of derived information, that can always be rebuild from other data we have

this is probably not true, unless we also store VCS IDs somewhere else, e.g., in the metadata storage.
One way or the other, I think this point should be documented in the design notes/spec of this thing.

Now a bold one:

  • how is this fundamentally different than the kind of mapping we need to address T2430 ?

sure, that one is not "recursive" and only applies to the external container, but that's not much different from the case of mercurial where only some objects have an ID we want to map to SWHIDs.
So, should we think at a DB/storage scheme that accounts for both needs?

In T2849#54254, @zack wrote:

Thanks, this would be an important new feature. Some comments/random thoughts below.

Further design point, which I think it's already implicit in what you wrote, but that it'd be useful to make explicit:

  • this new mapping table is needed "only" to speed up things, if we lose it, it will just mean we will be slower in doing future archival (at least for a while), but won't be the end of the world

assuming this is true, loaders will need to be designed with graceful degradation for the incompleteness (or entire disappearance) of the mapping table.

Agreed. Missing objects in the mapping table would just mean that the loader does a bit more work to re-create the associated objects, and, hopefully, generate the same SWHIDs again (which would then be used to populate the mapping table).

(By the way, we need a cool name for this thing :-))

Something I'm less sure about:

  • the table is just a cache of derived information, that can always be rebuild from other data we have

this is probably not true, unless we also store VCS IDs somewhere else, e.g., in the metadata storage.
One way or the other, I think this point should be documented in the design notes/spec of this thing.

I think it's important that we keep storing the original upstream identifiers of the objects we're generating/ingesting, as metadata with the same long-term storage characteristics as the rest of the archive. I only see this new component/table/API as a reverse index for this metadata.

Now a bold one:

  • how is this fundamentally different than the kind of mapping we need to address T2430 ?

sure, that one is not "recursive" and only applies to the external container, but that's not much different from the case of mercurial where only some objects have an ID we want to map to SWHIDs.
So, should we think at a DB/storage scheme that accounts for both needs?

This usecase had escaped me but yes, this is the kind of thing I was thinking about when talking about making this index available for user lookups. It would surely make sense to ensure nix/guix tarball hashes or PyPI sdist SHA256s or other upstream package ids can fit in this index.

Ok so the plan is a first step as simple as possible, implementing what @olasd proposed in the task, put this table in the storage, and provide a simple batch get API endpoint.

In T2849#54256, @olasd wrote:
In T2849#54254, @zack wrote:

Thanks, this would be an important new feature. Some comments/random thoughts below.

Further design point, which I think it's already implicit in what you wrote, but that it'd be useful to make explicit:

  • this new mapping table is needed "only" to speed up things, if we lose it, it will just mean we will be slower in doing future archival (at least for a while), but won't be the end of the world

assuming this is true, loaders will need to be designed with graceful degradation for the incompleteness (or entire disappearance) of the mapping table.

Agreed. Missing objects in the mapping table would just mean that the loader does a bit more work to re-create the associated objects, and, hopefully, generate the same SWHIDs again (which would then be used to populate the mapping table).

+1 here, missing data will just waste CPU, which is "fine"

(small notes: if we end up having this for all objects, we can probably recover from some of the corruption issue where a content ID is know, but the actual content is missing/corrupted)

(By the way, we need a cool name for this thing :-))

Something I'm less sure about:

  • the table is just a cache of derived information, that can always be rebuild from other data we have

this is probably not true, unless we also store VCS IDs somewhere else, e.g., in the metadata storage.
One way or the other, I think this point should be documented in the design notes/spec of this thing.

I think it's important that we keep storing the original upstream identifiers of the objects we're generating/ingesting, as metadata with the same long-term storage characteristics as the rest of the archive. I only see this new component/table/API as a reverse index for this metadata.

+1

Question: who should be responsible for filling this table? The loader or the storage (as side effect of revision_add)?

douardda claimed this task.