Note: This is a partial copy/summary of a discussion on the devel mailing-list.
TL;DR; we may want to generalize the idea of internally referencing any kind of object in the SWH archive a uniform and consistent way, especially Origin objects;
Current situation
We currently have a SWHID object type defined in the SWH data model. It allows to define a SWHID as an entity with attributes:
class SWHID: namespace scheme_version object_type object_id metadata
(metadata being a dict-like structure to store the qualifiers part of a SWHID).
This SWHID entity type is currently used in the data model only by the RawExtrinsicMetadadata object:
class RawExtrinsicMetadata(BaseModel): # target object type = Enum target = Union[str, SWHID] """URL if type=MetadataTargetType.ORIGIN, else core SWHID""" [...] # context origin = Optional[str] visit = Optional[int] snapshot = Optional[SWHID] release = Optional[SWHID] revision = Optional[SWHID] path = Optional[bytes] directory = Optional[SWHID]
"references" to other "core" SWH entity type are also found in the Release object (under the target_type/target couple of attributes), and somehow in the Snapshot object via the target_type/target attributes of the SnapshotBranch object. This later however extends this notion because of the presence of the ALIAS target type for branches.
It is also intrinsically present in all other relations of the Merkle DAG, but since the target type is fixed, there is no need to store a target_type.
The problem
Storing relations to SWH objects
The lack of SWHID on some internal objects, especially on Origin, make it necessary to use the kind of workaround used for RawExtrinsicMetadata.target (aka "URL if type=MetadataTargetType.ORIGIN, else core SWHID"), which is not very satisfying for several reasons:
- it comes with many if/else snippets in the code as soon a one need to deal with a RawExtrinsicMetadata object,
- it requires to store SWHID as strings, which is not very efficient (double the space due to the hash being represented as hex, filter on target's type impractical)
- it forces to have a column to discriminate Origin target but is meaningless for all other target types,
- (opinionated argument) it's overall quite inelegant.
This situation is currently concerning only the RawExtrinsicMetadata object model but it might be present in more cases in the future (see for example the case of the support for ExtID under development as D4807 (where it has been suggested to use SWHID instead of a couple (target_type/target) ).
Also note that all the "context" part of the RawExtrinsicMetadata (origin, visit, snapshot, release', 'revision, path and directory attributes) can be seen as an extended way of storing the target SWHID.
Storing SWHID
As already stated above, a related topic is how we store reference to other SWH objects, and especially SWHID, in the backend database. The current solution (store SWHID string representation for SWHID, ad hoc multi column otherwise) is not ideal for the reasons already listed above.
Possible solutions
The idea is then to normalize (as "improve consistency") as much as possible the SWH data model by using SWHID everywhere it makes sense in terms of modelization, and possibly improve the way we store SWHID objects.
To do so, possible tasks would be:
- Extend current SWHID to Origin, using one of:
- hash of the origin URL as identifier: keeps the fixed-size ids property, already used in some parts of swh; this would require to store origin hashes in a (computed) column of the origin table (in pg) and get rid of the "index-on-sha1-of-column hack."
- the "hexlified" URL as identifier: keep the "resolvable origins" property. This is probably unacceptable since it would break existing swh:1:ori swhids, even if these are not not be used outside swh-graph, and the breakage of the fixed-size id os the SWHID itself would require a SWHIDv2 spec
- or reify the notion of relation to a SWH object in the archive using a new dedicated object type (eg. SWHRef) that consists mainly in a triplet (version, type, id) ; for origins, same choice is to be made for the computation of the id. But since there is no bw compatibility issue there, it is perfectly acceptable to use the url itself ad identifier part of a SWHRef object.
As @olasd wrote:
[using a new a SWHRef object] we can have :
- origins stored as (0, <enum value for origin>, <origin_url encoded as utf-8>)
- core SWHIDs v1 stored as (1, <enum value for swhid type>, <byte array for swhid "id">)
This also allows us to store an edge to a "SWHID v1 for the hash of an origin", without any ambiguity. And this allows us to decode the origin urls
without going through another table.
Below, SWHRef makes reference to the chosen solution above (either a new SWHRef object or an "extended" SWHID one).
Then:
- Refactor the RawExtrinsicMetadata to use only a SWHRef target attribute (get rid of the type)
- Refactor the RawExtrinsicMetadata to use a SWHRef origin attribute (for consistency)
- Use a composite custom type to store SWHRef in Postgresql and the equivalent for Cassandra
- Refactor the Release object to use a SWHRef as target
- Refactor the SnapshotBranch to use a SWHRef as target; this would require to have alias in a dedicated attribute of the SnapshotBranch object or introduce a dedicated SnapshotAliasBranch or similar dedicated to aliases, depending on the choice made for the SWHRef model oject.
- Refactor all the Merkle-DAG-related objects to use SWHRef as "references".
Note: Obviously the last 3 points above are rather radical and imply, if applied as is. a major migration of the database and are listed here mostly for the sake of completeness.