Page MenuHomeSoftware Heritage

generalize usage of SWHID for referencing SWH archive objects
Open, HighPublic

Description

Note: This is a partial copy/summary of a discussion on the devel mailing-list.

TL;DR; we may want to generalize the idea of internally referencing any kind of object in the SWH archive a uniform and consistent way, especially Origin objects;

Current situation

We currently have a SWHID object type defined in the SWH data model. It allows to define a SWHID as an entity with attributes:

class SWHID:
    namespace
    scheme_version
    object_type
    object_id
    metadata

(metadata being a dict-like structure to store the qualifiers part of a SWHID).

This SWHID entity type is currently used in the data model only by the RawExtrinsicMetadadata object:

class RawExtrinsicMetadata(BaseModel):
    # target object
    type = Enum
    target = Union[str, SWHID]
    """URL if type=MetadataTargetType.ORIGIN, else core SWHID"""

    [...]
    # context
    origin = Optional[str]
    visit = Optional[int]
    snapshot = Optional[SWHID]
    release = Optional[SWHID]
    revision = Optional[SWHID]
    path = Optional[bytes]
    directory = Optional[SWHID]

"references" to other "core" SWH entity type are also found in the Release object (under the target_type/target couple of attributes), and somehow in the Snapshot object via the target_type/target attributes of the SnapshotBranch object. This later however extends this notion because of the presence of the ALIAS target type for branches.

It is also intrinsically present in all other relations of the Merkle DAG, but since the target type is fixed, there is no need to store a target_type.

The problem

Storing relations to SWH objects

The lack of SWHID on some internal objects, especially on Origin, make it necessary to use the kind of workaround used for RawExtrinsicMetadata.target (aka "URL if type=MetadataTargetType.ORIGIN, else core SWHID"), which is not very satisfying for several reasons:

  • it comes with many if/else snippets in the code as soon a one need to deal with a RawExtrinsicMetadata object,
  • it requires to store SWHID as strings, which is not very efficient (double the space due to the hash being represented as hex, filter on target's type impractical)
  • it forces to have a column to discriminate Origin target but is meaningless for all other target types,
  • (opinionated argument) it's overall quite inelegant.

This situation is currently concerning only the RawExtrinsicMetadata object model but it might be present in more cases in the future (see for example the case of the support for ExtID under development as D4807 (where it has been suggested to use SWHID instead of a couple (target_type/target) ).

Also note that all the "context" part of the RawExtrinsicMetadata (origin, visit, snapshot, release', 'revision, path and directory attributes) can be seen as an extended way of storing the target SWHID.

Storing SWHID

As already stated above, a related topic is how we store reference to other SWH objects, and especially SWHID, in the backend database. The current solution (store SWHID string representation for SWHID, ad hoc multi column otherwise) is not ideal for the reasons already listed above.

Possible solutions

The idea is then to normalize (as "improve consistency") as much as possible the SWH data model by using SWHID everywhere it makes sense in terms of modelization, and possibly improve the way we store SWHID objects.

To do so, possible tasks would be:

  • Extend current SWHID to Origin, using one of:
    • hash of the origin URL as identifier: keeps the fixed-size ids property, already used in some parts of swh; this would require to store origin hashes in a (computed) column of the origin table (in pg) and get rid of the "index-on-sha1-of-column hack."
    • the "hexlified" URL as identifier: keep the "resolvable origins" property. This is probably unacceptable since it would break existing swh:1:ori swhids, even if these are not not be used outside swh-graph, and the breakage of the fixed-size id os the SWHID itself would require a SWHIDv2 spec
  • or reify the notion of relation to a SWH object in the archive using a new dedicated object type (eg. SWHRef) that consists mainly in a triplet (version, type, id) ; for origins, same choice is to be made for the computation of the id. But since there is no bw compatibility issue there, it is perfectly acceptable to use the url itself ad identifier part of a SWHRef object.

As @olasd wrote:

[using a new a SWHRef object] we can have :

  • origins stored as (0, <enum value for origin>, <origin_url encoded as utf-8>)
  • core SWHIDs v1 stored as (1, <enum value for swhid type>, <byte array for swhid "id">)

This also allows us to store an edge to a "SWHID v1 for the hash of an origin", without any ambiguity. And this allows us to decode the origin urls
without going through another table.

Below, SWHRef makes reference to the chosen solution above (either a new SWHRef object or an "extended" SWHID one).
Then:

  • Refactor the RawExtrinsicMetadata to use only a SWHRef target attribute (get rid of the type)
  • Refactor the RawExtrinsicMetadata to use a SWHRef origin attribute (for consistency)
  • Use a composite custom type to store SWHRef in Postgresql and the equivalent for Cassandra
  • Refactor the Release object to use a SWHRef as target
  • Refactor the SnapshotBranch to use a SWHRef as target; this would require to have alias in a dedicated attribute of the SnapshotBranch object or introduce a dedicated SnapshotAliasBranch or similar dedicated to aliases, depending on the choice made for the SWHRef model oject.
  • Refactor all the Merkle-DAG-related objects to use SWHRef as "references".

Note: Obviously the last 3 points above are rather radical and imply, if applied as is. a major migration of the database and are listed here mostly for the sake of completeness.

Related Objects

StatusAssignedTask
Resolvedardumont
OpenNone
ResolvedNone
OpenNone
Resolvedardumont
Work in Progressvlorentz
Work in Progressvlorentz
OpenNone
OpenNone
Work in Progressvlorentz
Work in Progressolasd
Openolasd
Resolvedvlorentz
Openvlorentz
Openvlorentz
Openvlorentz
Openvlorentz
Openvlorentz
Openvlorentz
Openvlorentz
OpenNone

Event Timeline

douardda triaged this task as High priority.Tue, Feb 9, 3:36 PM
douardda created this task.
zack renamed this task from Generalise usage of SWHID for storing edges (relations) of the SWH archive graph to generalize usage of SWHID for referencing SWH archive objects.Tue, Feb 9, 4:34 PM

So, there's still a few separate issues in this task which I'll try to spell out (at least for my sake) :

extending the current SWHID v1 spec for origins

In that regard, the cat is out of the bag already, and even if we try not to leak these to the public, in practice swh.graph and its interface with swh.web are enforcing a definition for "v1 SWHID of an Origin" already (using a fixed size 20-byte sha1 of the origin url/iri encoded as UTF-8), so we should document them and make them official. Any change to this definition is, AFAICT from @zack's objections, a non-starter as swh.graph absolutely needs a fixed size identifier for all the nodes it has to process

  • blessing these as SWHIDs opens up the somewhat tangential question of defining a normalization process for the origin URL/IRIs;
  • by that point we should probably also bless the RawExtrinsicMetadata intrinisic id as a "v1 SWHID" too, even if they're meant to be internal to SWH.
Within the Python data model of a node of the graph, consistent typing of references to other nodes

The status quo of having

  • Union[SWHID, OriginUrl] attributes on some objects, where we actually disable some of the features of the union members (SWHID qualifiers)
  • (target_type: Enum, target: bytes) attribute pairs on some objects

is unfortunate.

We have a few proposals to move forward:

  • introducing a new explicit SwhRef type
    • overlapping with the current SWHID type, without qualifiers, for objects with a SWHID
    • supporting explicit origin url references (rather than using a hashed swh:1:ori: SWHID)
  • or using
    • the (normalized) Origin url, inlined, if that's the only possible type of node referenced (f.e. in origin visits / origin visit statuses)
    • core SWHIDs directly everywhere else, and fully blessing references using hashed swh:1:ori: SWHIDs

This second option is growing on me, because of the uniformity, and because we're and avoiding the introduction of a new, very similar type. We may want to introduce a CoreSWHID type which disables qualifiers, but it would be a plain subset of SWHID, which doesn't have the funky smell of the current Union + attribute validator combo.

Within the storage backends (SQL / Cassandra), consistent storage of references to other nodes of the graph, origins included

For reference, the status quo is:

  • some tables have a (target bytes, target_type enum) pair of columns, equivalent to the binary storage for a core SWHID v1
  • some tables have a (target str, target_type enum) pair of columns, where the target is either the hexadecimal string representation or a SWHID v1, or an origin url
  • some tables (directory entries) are split across the target type and only use the (target bytes) column to reference the other node

We would like any future storage of these columns to be:

  • consistent (i.e. use the same column type/set of columns to store references in all tables)
  • compact (i.e. use bytes for hashes)
  • future-proof (i.e. not having to redesign the database for the migration to SWHIDv2)

It would also make sense to gradually migrate the current storage to the new schema.

If we decide to only use core SWHIDs in the model layer, we can store them in a consistent composite (version short, type enum, id bytes) column

If we decide to introduce a new composite type in the model layer, we can probably use the same composite (version short, type enum, id bytes) column type, fudging the version/type items to unambiguously store references to origins using their full IRI, encoded as bytes

Bonus track : serialization of references to other nodes in the Journal

I guess that's what @douardda has in mind when he says that resolving the current discussion is necessary for the archival of SWH to Vitam.

As far as I can tell, this issue is more intricately linked to the decisions we're taking in the Python data model, than to the storage backend considerations.

Once we have the updated attributes in swh.model, we will want to rewrite the contents of the journal to use the new schema, but we will likely have written a conversion layer to support the deserialization of the old entries stored in swh.journal.

In either case, we'll probably want to use the consistency-improved version of the swh.model objects before serializing them for long term archival in Vitam.

Personal conclusion

After rewriting all of this, I have a much better sense of how all these decisions are tangled with one another.

My preference would be:

  1. bless the hashed-url version of SWHID v1 of origins (and of the RawExtrinsicMetadata object id while we're at it)
  2. implement a CoreSWHID type in swh.model, which would be the current SWHID, sans qualifiers
  3. define the swh.journal serialization of CoreSWHID
  4. gradually migrate all swh.model (target, target type) attribute pairs to CoreSWHIDs.
    • Now:
      • RawExtrinsicMetadata is the obvious first target
      • ExtId too
    • Soon:
      • Release.(target, target_type)
      • SnapshotBranch.(target, target_type) (plausibly introducing a new type for branch aliases ?)
    • Eventually:
      • Revision.directory
      • Directory.entries
  5. Gradually migrate storage of CoreSWHID attributes to a new composite column type

We already have a storage -> swh.model mapping layer for all backends, so the migrations in swh.storage and swh.model don't have to be entangled. We can convert all existing data stored in swh.storage to generate CoreSWHID objects from the current target/target_type columns and vice versa.

zack updated the task description. (Show Details)

(I've finally caught up with the backlog in this task, sorry I'm late to the party.)

In short:

Personal conclusion

After rewriting all of this, I have a much better sense of how all these decisions are tangled with one another.

My preference would be:

  1. bless the hashed-url version of SWHID v1 of origins (and of the RawExtrinsicMetadata object id while we're at it)
  2. implement a CoreSWHID type in swh.model, which would be the current SWHID, sans qualifiers
  3. define the swh.journal serialization of CoreSWHID
  4. gradually migrate all swh.model (target, target type) attribute pairs to CoreSWHIDs.
  5. Gradually migrate storage of CoreSWHID attributes to a new composite column type

This plan is also my preference, and the course of action also LGTM.

Two minor caveats:

  • I'm not so sure about the idea of pseudo-SWHIDs for RawExtrinsicMetadata (REM), but maybe it's because I'm missing/not understanding a few details. In particular, it is one thing to need a stable serialization format and an intrinsic ID on them, another to need a SWHID to reference them. Who needs to reference REMs in the first place?
  • for the storage migration, this seems profound enough that, once ready, it will warrant a "stop the world" → "migrate everything" → restart approach? But maybe that's what you have in mind and the gradual part only applies to the code (which of course will be gradual anyway)
In T3034#58665, @zack wrote:

Who needs to reference REMs in the first place?

I'll let @vlorentz give a better answer, but I think this is needed at least for the metadata-only deposit feature.

In T3034#58665, @zack wrote:

Who needs to reference REMs in the first place?

Option 3 in T2779#55692