When doing incremental loading of VCS origins, one of our concerns is minimizing the processing that we have to do on the SWH worker, without incurring too much load on the database.
When we do an incremental load of a git origin, we tell the remote server the set of heads of the previous snapshot, which returns bunch of objects in a packfile. Before processing the objects in the packfile to convert them to swh.model objects, we filter out the objects that are already present in the swh archive. This is easy to do currently, as SWHIDs and git object ids are compatible, so we can just use the regular swh archive object filtering methods. This isn't future-proof, as one day SWHIDs will migrate towards a git-incompatible specification, or git will migrate towards a SWHID-incompatible object id specification.
To do an incremental load of a SVN repo, we work around the issue by spot-checking the last revision loaded, then only importing new revisions. This is easy to do thanks to SVN's linear nature.
Incremental loading of mercurial repos is more complex: once we've downloaded a set of nodes from the server, we're unable to map them to objects in the SWH archive. Computing the SWHIDs of all nodes again is expensive; some repos have millions of nodes, which we'd have to process before even starting to load any new object.
We could work around this by retrieving the set of nodeids from the history of all heads in the previous snapshot of the origin being loaded; but that would not help for forked repositories (which we'd have to process in full again), and this is still a somewhat expensive operation on the backend database (the original nodeid is buried in the revision extra headers), that's not obvious to cache or index.
The proposal is to design and implement a "original VCS id" -> SWHID mapping, which would aid a more effective implementation of incremental loading of mercurial repos, and also future-proof the current implementation of incremental loading of git repos.
There's some design concerns to be solved:
- scope: mercurial will only need the mapping for revisions; git could use the mapping for all object types, but it'd generally be useful for revisions and releases (snapshots pointing at other objects are quite rare). This drives the size of the mapping too.
- schema: I suspect something like the following would work:
- vcs (str / enum)
- vcs_id (bytea)
- swhid_type (enum)
- swhid_value (bytea)
- storage: this could, arguably, be stored in the extrinsic metadata store. But a more targetted mapping would surely be more effective, as we don't currently have any reverse (value -> swhid) indexes on the metadata store.
- API: I guess we want to lookup a batch of vcs ids for a given vcs, and return a set of (vcsid, swhid) pairs.
Such an index is also probably a useful user-facing feature in the web search / web api, but it might make sense to only do that within a generic metadata search engine rather than specifically for this.