Page MenuHomeSoftware Heritage

Make model objects immutable
Open, NormalPublic

Description

The problem

Currently, model objects, as defined in swh.model.model are declared using attrs as frozen, typically:

@attr.s(frozen=True)
class Person(BaseModel):
    """Represents the author/committer of a revision or release."""

    fullname = attr.ib(type=bytes, validator=type_validator())
    name = attr.ib(type=Optional[bytes], validator=type_validator())
    email = attr.ib(type=Optional[bytes], validator=type_validator())

However, some object models have mutable attributes, which fails the immutable property expected when using frozen entity types.

Besides the inconsistency that this situation provoke, it also prevents these model objects from being hashable, thus from being put in structures likes sets or dicts.

And with the current move to "use swh.model.model model objects everywhere", it would be nice to have consistent core model scaffolding.

Affected model classes are:

  • Revision because of the parents attribute (List) and the metadata attribute (Dict),
  • Directory because of the entries attribute (List),
  • OriginVisit because of the metadata attribute (Dict),
  • OriginVisitStatus because of the metadata attribute (Dict),
  • Snapshot because of the branches attribute (Dict),
  • Release because of the metadata attribute (Dict).

Proposed fix

  1. Replace List attribute types by Tuple ones (namely Revision.parents and Directory.entries). An implementation of this idea is proposed is D3177.
  1. Replace Snapshot.branches by a Tuple. It is currently a dict which keys are branch names ({branch_name: branch}). So it would require to add a new name attribute to the SnapshotBranch class and store them in a Snapshot as a tuple.
  1. Replace metadata attributes by either a json-encoded string or a Tuple[Tuple[str, str]] (or similar, as long as it remains a hashable structure).

Impacts

Tuple instead of List

The first point should have very limited impact on other swh packages, as long as the .from_dict() method is nice enough to convert the list into a tuple on the fly.

Snapshot.branches as a Tuple

Second point (Snapshot.branches as a tuple) has possible impact on almost everything (namely on swh.deposit, swh.indexer, swh.loader.core, swh.loader.git, swh.loader.mercurial, swh.loader.svn, swh.storage, swh.web.client and swh.web).

However, when grepping accesses to the .branches attribute, we have only a few hits:

~/swh/swh-environment$ pygrep "[.]branches" swh-*/swh
swh-loader-core/swh/loader/package/loader.py:            for rev in snapshot.branches.values()
swh-loader-git/swh/loader/git/loader.py:        for branch in base_snapshot.branches.values():
swh-loader-git/swh/loader/git/loader.py:            eventful = bool(self.snapshot.branches)
swh-loader-git/swh/loader/git/from_disk.py:            eventful = bool(self.snapshot.branches)
swh-loader-mercurial/swh/loader/mercurial/loader.py:        self.branches = {}
swh-model/swh/model/tests/test_hypothesis_strategies.py:    branches = snapshot.branches
swh-model/swh/model/tests/test_hypothesis_strategies.py:    assert len(snapshot.branches) == _min_snp_size
swh-storage/swh/storage/in_memory.py:            sorted_branch_names = sorted(snapshot.branches)
swh-storage/swh/storage/in_memory.py:            for branch in snapshot.branches.values()
swh-storage/swh/storage/in_memory.py:                branch = snapshot.branches[branch_name]
swh-storage/swh/storage/in_memory.py:                branch_name: snapshot.branches[branch_name]
swh-storage/swh/storage/storage.py:                            for name, info in snapshot.branches.items()
swh-storage/swh/storage/cassandra/storage.py:            for (branch_name, branch) in snapshot.branches.items():

The only piece of code really using the dict feature here is swh.storage.in_memory.

So it could be possible to make this modification without too much trouble for other swh packages, by using a bw-compatible .from_dict() method.

<cls>.metadata as a str (json encoded MD structure)

TODO

Event Timeline

douardda triaged this task as Normal priority.May 26 2020, 4:52 PM
douardda created this task.
olasd added a subscriber: olasd.May 26 2020, 7:05 PM

I think there's a conflation of two issues:

  1. our model objects are not hashable (which prevents us from using them as members of a set or as keys in a dict);
  2. some of our model objects are mutable when they're marked as frozen (breaking an attrs "contract").

In most cases, fixing the "unhashable" issue would be as easy as making __hash__ return the computed identifier of the object; variations in ancillary fields that aren't part of the object identifier should not be a problem. Where are we trying to use these objects as dict keys / set items?

The mutability of model objects feels to me like only a theoretical/conceptual problem: that this mutability could cause some bugs in theory, but I haven't seen any in practice; Did I miss something?

Of course, it would make sense to clean these conceptual issues up before we backfill all our objects into a new kafka cluster...

As for your proposed changes:

Conceptually, both Snapshot.branches and Directory.entries are mappings, not arrays of objects: snapshots and directories must not contain entries with repeated names (which is something that's currently enforced for snapshots, but not for directories).

In that sense, I think it would be more semantically correct to introduce a frozen mapping type, and to make sure that Directory.entries and Snapshot.branches use this frozen mapping type. It'd also be way more convenient to have "dict-like" access to directory entries rather than the current list-based lookups.

I think we should do away with the metadata attributes completely. I'm not sure why we added them to OriginVisitStatus in the end, instead of removing them, after we checked that they were always empty in production data.

Only Revisions actually have metadata stored currently, and they're being phased out in favor of the extrinsic metadata storage. We only need to move extra_headers, which are taken into account in the actual object hash, to a top-level attribute of revisions. That's T2423, which is something that we should implement regardless of what we decide in this task.

Conceptually, both Snapshot.branches and Directory.entries are mappings, not arrays of objects: snapshots and directories must not contain entries with repeated names (which is something that's currently enforced for snapshots, but not for directories).

What about order preservation ?

@olasd any reason not to do the step 1. ? Wether or not we "fix" the remaining hashability issue by overwriting the __hash__ method, I see no harm and it makes the model (IMHO) cleaner.

The mutability of model objects feels to me like only a theoretical/conceptual problem

I tend to disagree; any time you let the possibility of something go wrong, it will happen at some point. I see no harm in doing things "right", and I believe the model should as clean and correct as possible.

Where are we trying to use these objects as dict keys / set items?

I just wanted to compare objects in a destination storage after a replay session (in tests) by comparing equality of sets (because a replay does not keep order)... I know I can do comparison between sets of ids, but still, having frozen model objects and not being able to put them in a set/dict looks very odd to me. Having HashableObject instances that are not hashable (in python) is somewhat ironic...

Also, reimplementing the __hash__ method should be something we do if it's a sensible thing to do w.r.t. the conceptual model, not as a easy fix for the hashability problem. It seems to be the case at first sight, especially for HashableObject subclasses, but we could also look at this from the "append only" nature of the archive. In this regard, the "immutability" is not a matter of wether some attributes are used in the computation of the hash or not.

Conceptually, both Snapshot.branches and Directory.entries are mappings, not arrays of objects: snapshots and directories must not contain entries with repeated names (which is something that's currently enforced for snapshots, but not for directories).

What about order preservation ?

In both cases, entries are canonically sorted when the identifier is computed. So having two directories with the same set of entries in a different order will yield the same identifier (and both directories are identical); same for snapshots.

@olasd any reason not to do the step 1. ? Wether or not we "fix" the remaining hashability issue by overwriting the __hash__ method, I see no harm and it makes the model (IMHO) cleaner.

I see no reason not to do step 1.

The mutability of model objects feels to me like only a theoretical/conceptual problem

I tend to disagree; any time you let the possibility of something go wrong, it will happen at some point. I see no harm in doing things "right", and I believe the model should as clean and correct as possible.

Where are we trying to use these objects as dict keys / set items?

I just wanted to compare objects in a destination storage after a replay session (in tests) by comparing equality of sets (because a replay does not keep order)... I know I can do comparison between sets of ids, but still, having frozen model objects and not being able to put them in a set/dict looks very odd to me. Having HashableObject instances that are not hashable (in python) is somewhat ironic...

Putting composite ("large") objects in a set or as dict keys always feels a bit weird to me, but I can see how this would be convenient.

Also, reimplementing the __hash__ method should be something we do if it's a sensible thing to do w.r.t. the conceptual model, not as a easy fix for the hashability problem. It seems to be the case at first sight, especially for HashableObject subclasses, but we could also look at this from the "append only" nature of the archive. In this regard, the "immutability" is not a matter of wether some attributes are used in the computation of the hash or not.

Sure. But taken in isolation, the proposal feels like a substantial amount of work for a very thin benefit in a fringe usecase (which can fairly easily be worked around by adding .id in a few places). With adaptations, I'm slowly warming up to it, as I can see how it would improve some things in the long run. But I'd like to use the opportunity of this cleanup to go a bit further than "the minimal amount of work for pedantic correctness", and actually make changes that have a conceptual meaning.

I think we should do away with the metadata attributes completely.
I'm not sure why we added them to OriginVisitStatus in the end, instead of
removing them, after we checked that they were always empty in production
data.

Note that I'm fine either way but my understanding of this discussion [1] was
we agreed to keep them to avoid migrating back and forth in case we need it.

Which now that i expose it like this, i know @douardda will drop on me ninja
style saying "yagni?" or something ;)

[1] T2346#43058

Re-reading (and realizing again) the argument on metadata being phased out. So,
we can drop those unused [1] metadata field from origin-visit/origin-visit-status then.

[1] They are indeed not used in production. I have checked again on staging
which got a recent dump from production (well recently enough).

swh=> select count(*) from origin_visit_status where metadata is not null;
 count
-------
     0
(1 row)

swh=> select count(*) from origin_visit where metadata is not null;
 count
-------
     0
(1 row)

But I'd like to use the opportunity of this cleanup to go a bit further than "the minimal amount of work for pedantic correctness", and actually make changes that have a conceptual meaning.

my point exactly (wether or not I've been clear in my motivations), maybe the "pedantic scale" is not a universal one however.

Reading the code dealing with snapshot branches in several storage implementations, it really seems to me that storing them as a dict-like structure has no advantage.

As discussed on IRC:

12:06 <+douardda> I really don't see the advantage of using a dict(-like) structure for snapshot branches (rather than a sorted tuple of couples)
12:07 <+douardda> all the code we have that deals with branches needs these branches to be sorted
12:07 < vlorentz> except to access b"master" and b"HEAD"
12:07 <+douardda> the only place where the dict api is convenient is (possibly) in user code (user == user of the storage)
12:08 < vlorentz> right
12:08 <+douardda> where it's easy to dictify the returned branches
12:08 < vlorentz> indeed