Change Details

//Note: This is a partial copy/summary of a discussion on the devel mailing-list.// TL;DR; we may want to generalize the idea of internally referencing any kind of object in the SWH archive a uniform and consistent way, especially Origin objects; = Current situation = We currently have a `SWHID` object type defined in the SWH data model. It allows to define a SWHID as an entity with attributes: ``` class SWHID: namespace scheme_version object_type object_id metadata ``` (metadata being a dict-like structure to store the qualifiers part of a SWHID). This `SWHID` entity type is currently used in the data model only by the `RawExtrinsicMetadadata` object: ``` class RawExtrinsicMetadata(BaseModel): # target object type = Enum target = Union[str, SWHID] """URL if type=MetadataTargetType.ORIGIN, else core SWHID""" [...] # context origin = Optional[str] visit = Optional[int] snapshot = Optional[SWHID] release = Optional[SWHID] revision = Optional[SWHID] path = Optional[bytes] directory = Optional[SWHID] ``` "edge" type relations to other "core" SWH entity type are also found in the `Release` object (under the `target_type`/`target` couple of attributes), and somehow in the `Snapshot` object via the `target_type`/`target` attributes of the `SnapshotBranch` object. This later however extends this notion because of the presence of the `ALIAS` target type for branches. It is also intrinsically present in all other relations of the Merkle DAG, but since the target type is fixed, there is no need to store a target_type. = The problem = == Storing relations to SWH objects == The lack of SWHID on some internal objects, especially on `Origin`, make it necessary to use the kind of workaround used for `RawExtrinsicMetadata.target` (aka "URL if type=MetadataTargetType.ORIGIN, else core SWHID"), which is not very satisfying for several reasons: - it comes with many `if/else` snippets in the code as soon a one need to deal with a `RawExtrinsicMetadata` object, - it requires to store SWHID as strings, which is not very efficient (double the space due to the hash being represented as hex, filter on target's type impractical) - it forces to have a column to discriminate Origin target but is meaningless for all other target types, - (opinionated argument) it's overall quite inelegant. This situation is currently concerning only the `RawExtrinsicMetadata` object model but it might be present in more cases in the future (see for example the case of the support for ExtID under development as D4807 (where it has been suggested to use SWHID instead of a couple (`target_type`/`target`) ). Also note that all the "context" part of the `RawExtrinsicMetadata` (`origin`, `visit`, `snapshot`, `release', 'revision`, `path` and `directory` attributes) can be seen as an extended way of storing the `target` SWHID. == Storing SWHID == As already stated above, a related topic is how we store reference to other SWH objects, and especially SWHID, in the backend database. The current solution (store SWHID string representation for SWHID, ad hoc multi column otherwise) is not ideal for the reasons already listed above. = Possible solutions = The idea is then to normalize (as "improve consistency") as much as possible the SWH data model by using SWHID everywhere it makes sense in terms of modelization, and possibly improve the way we store SWHID objects. To do so, possible tasks would be: - [] Extend current SWHID to `Origin`, using one of: - hash of the origin URL as identifier: keeps the fixed-size ids property, already used in some parts of swh; this would require to store origin hashes in a (computed) column of the origin table (in pg) and get rid of the "index-on-sha1-of-column hack." - the "hexlified" URL as identifier: keep the "resolvable origins" property. This is probably unacceptable since it would break existing swh:1:ori swhids, even if these are not not be used outside swh-graph, and the breakage of the fixed-size id os the SWHID itself would require a SWHIDv2 spec - [] or reify the notion of relation to a SWH object in the archive using a new dedicated object type (eg. `SWHRef`) that consists mainly in a triplet (version, type, id) ; for origins, same choice is to be made for the computation of the id. But since there is no bw compatibility issue there, it is perfectly acceptable to use the url itself ad identifier part of a SWHRef object. As @olasd wrote: > [using a new a SWHRef object] we can have : > - origins stored as (0, <enum value for origin>, <origin_url encoded as utf-8>) > - core SWHIDs v1 stored as (1, <enum value for swhid type>, <byte array for swhid "id">) > > This also allows us to store an edge to a "SWHID v1 for the hash of an origin", without any ambiguity. And this allows us to decode the origin urls > without going through another table. Below, `SWHRef` makes reference to the chosen solution above (either a new SWHRef object or an "extended" SWHID one). Then: - [] Refactor the `RawExtrinsicMetadata` to use only a SWHRef `target` attribute (get rid of the `type`) - [] Refactor the `RawExtrinsicMetadata` to use a SWHRef `origin` attribute (for consistency) - [] Use a composite custom type to store SWHRef in Postgresql and the equivalent for Cassandra - [] Refactor the `Release` object to use a `SWHRef` as `target` - [] Refactor the `SnapshotBranch` to use a `SWHRef` as `target`; this would require to have alias in a dedicated attribute of the `SnapshotBranch` object or introduce a dedicated `SnapshotAliasBranch` or similar dedicated to aliases, depending on the choice made for the SWHRef model oject. - [] Refactor all the Merkle-DAG-related objects to use SWHRef as "edge relation support". //Note: Obviously the last 3 points above are rather radical and imply, if applied as is. a major migration of the database and are listed here mostly for the sake of completeness. //

//Note: This is a partial copy/summary of a [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00012.html | discussion on the devel mailing-list ]].// TL;DR; we may want to generalize the idea of internally referencing any kind of object in the SWH archive a uniform and consistent way, especially Origin objects; = Current situation = We currently have a `SWHID` object type defined in the SWH data model. It allows to define a SWHID as an entity with attributes: ``` class SWHID: namespace scheme_version object_type object_id metadata ``` (metadata being a dict-like structure to store the qualifiers part of a SWHID). This `SWHID` entity type is currently used in the data model only by the `RawExtrinsicMetadadata` object: ``` class RawExtrinsicMetadata(BaseModel): # target object type = Enum target = Union[str, SWHID] """URL if type=MetadataTargetType.ORIGIN, else core SWHID""" [...] # context origin = Optional[str] visit = Optional[int] snapshot = Optional[SWHID] release = Optional[SWHID] revision = Optional[SWHID] path = Optional[bytes] directory = Optional[SWHID] ``` "edge" type relations to other "core" SWH entity type are also found in the `Release` object (under the `target_type`/`target` couple of attributes), and somehow in the `Snapshot` object via the `target_type`/`target` attributes of the `SnapshotBranch` object. This later however extends this notion because of the presence of the `ALIAS` target type for branches. It is also intrinsically present in all other relations of the Merkle DAG, but since the target type is fixed, there is no need to store a target_type. = The problem = == Storing relations to SWH objects == The lack of SWHID on some internal objects, especially on `Origin`, make it necessary to use the kind of workaround used for `RawExtrinsicMetadata.target` (aka "URL if type=MetadataTargetType.ORIGIN, else core SWHID"), which is not very satisfying for several reasons: - it comes with many `if/else` snippets in the code as soon a one need to deal with a `RawExtrinsicMetadata` object, - it requires to store SWHID as strings, which is not very efficient (double the space due to the hash being represented as hex, filter on target's type impractical) - it forces to have a column to discriminate Origin target but is meaningless for all other target types, - (opinionated argument) it's overall quite inelegant. This situation is currently concerning only the `RawExtrinsicMetadata` object model but it might be present in more cases in the future (see for example the case of the support for ExtID under development as D4807 (where it has been suggested to use SWHID instead of a couple (`target_type`/`target`) ). Also note that all the "context" part of the `RawExtrinsicMetadata` (`origin`, `visit`, `snapshot`, `release', 'revision`, `path` and `directory` attributes) can be seen as an extended way of storing the `target` SWHID. == Storing SWHID == As already stated above, a related topic is how we store reference to other SWH objects, and especially SWHID, in the backend database. The current solution (store SWHID string representation for SWHID, ad hoc multi column otherwise) is not ideal for the reasons already listed above. = Possible solutions = The idea is then to normalize (as "improve consistency") as much as possible the SWH data model by using SWHID everywhere it makes sense in terms of modelization, and possibly improve the way we store SWHID objects. To do so, possible tasks would be: - [] Extend current SWHID to `Origin`, using one of: - hash of the origin URL as identifier: keeps the fixed-size ids property, already used in some parts of swh; this would require to store origin hashes in a (computed) column of the origin table (in pg) and get rid of the "index-on-sha1-of-column hack." - the "hexlified" URL as identifier: keep the "resolvable origins" property. This is probably unacceptable since it would break existing swh:1:ori swhids, even if these are not not be used outside swh-graph, and the breakage of the fixed-size id os the SWHID itself would require a SWHIDv2 spec - [] or reify the notion of relation to a SWH object in the archive using a new dedicated object type (eg. `SWHRef`) that consists mainly in a triplet (version, type, id) ; for origins, same choice is to be made for the computation of the id. But since there is no bw compatibility issue there, it is perfectly acceptable to use the url itself ad identifier part of a SWHRef object. As @olasd wrote: > [using a new a SWHRef object] we can have : > - origins stored as (0, <enum value for origin>, <origin_url encoded as utf-8>) > - core SWHIDs v1 stored as (1, <enum value for swhid type>, <byte array for swhid "id">) > > This also allows us to store an edge to a "SWHID v1 for the hash of an origin", without any ambiguity. And this allows us to decode the origin urls > without going through another table. Below, `SWHRef` makes reference to the chosen solution above (either a new SWHRef object or an "extended" SWHID one). Then: - [] Refactor the `RawExtrinsicMetadata` to use only a SWHRef `target` attribute (get rid of the `type`) - [] Refactor the `RawExtrinsicMetadata` to use a SWHRef `origin` attribute (for consistency) - [] Use a composite custom type to store SWHRef in Postgresql and the equivalent for Cassandra - [] Refactor the `Release` object to use a `SWHRef` as `target` - [] Refactor the `SnapshotBranch` to use a `SWHRef` as `target`; this would require to have alias in a dedicated attribute of the `SnapshotBranch` object or introduce a dedicated `SnapshotAliasBranch` or similar dedicated to aliases, depending on the choice made for the SWHRef model oject. - [] Refactor all the Merkle-DAG-related objects to use SWHRef as "edge relation support". //Note: Obviously the last 3 points above are rather radical and imply, if applied as is. a major migration of the database and are listed here mostly for the sake of completeness. //