The problem
Currently, model objects, as defined in swh.model.model are declared using attrs as frozen, typically:
@attr.s(frozen=True) class Person(BaseModel): """Represents the author/committer of a revision or release.""" fullname = attr.ib(type=bytes, validator=type_validator()) name = attr.ib(type=Optional[bytes], validator=type_validator()) email = attr.ib(type=Optional[bytes], validator=type_validator())
However, some object models have mutable attributes, which fails the immutable property expected when using frozen entity types.
Besides the inconsistency that this situation provoke, it also prevents these model objects from being hashable, thus from being put in structures likes sets or dicts.
And with the current move to "use swh.model.model model objects everywhere", it would be nice to have consistent core model scaffolding.
Affected model classes are:
- Revision because of the parents attribute (List) and the metadata attribute (Dict),
- Directory because of the entries attribute (List),
- OriginVisit because of the metadata attribute (Dict),
- OriginVisitStatus because of the metadata attribute (Dict),
- Snapshot because of the branches attribute (Dict),
- Release because of the metadata attribute (Dict).
Proposed fix
- Replace List attribute types by Tuple ones (namely Revision.parents and Directory.entries). An implementation of this idea is proposed is D3177.
- Replace Snapshot.branches by a Tuple. It is currently a dict which keys are branch names ({branch_name: branch}). So it would require to add a new name attribute to the SnapshotBranch class and store them in a Snapshot as a tuple.
- Replace metadata attributes by either a json-encoded string or a Tuple[Tuple[str, str]] (or similar, as long as it remains a hashable structure).
Impacts
Tuple instead of List
The first point should have very limited impact on other swh packages, as long as the .from_dict() method is nice enough to convert the list into a tuple on the fly.
Snapshot.branches as a Tuple
Second point (Snapshot.branches as a tuple) has possible impact on almost everything (namely on swh.deposit, swh.indexer, swh.loader.core, swh.loader.git, swh.loader.mercurial, swh.loader.svn, swh.storage, swh.web.client and swh.web).
However, when grepping accesses to the .branches attribute, we have only a few hits:
~/swh/swh-environment$ pygrep "[.]branches" swh-*/swh swh-loader-core/swh/loader/package/loader.py: for rev in snapshot.branches.values() swh-loader-git/swh/loader/git/loader.py: for branch in base_snapshot.branches.values(): swh-loader-git/swh/loader/git/loader.py: eventful = bool(self.snapshot.branches) swh-loader-git/swh/loader/git/from_disk.py: eventful = bool(self.snapshot.branches) swh-loader-mercurial/swh/loader/mercurial/loader.py: self.branches = {} swh-model/swh/model/tests/test_hypothesis_strategies.py: branches = snapshot.branches swh-model/swh/model/tests/test_hypothesis_strategies.py: assert len(snapshot.branches) == _min_snp_size swh-storage/swh/storage/in_memory.py: sorted_branch_names = sorted(snapshot.branches) swh-storage/swh/storage/in_memory.py: for branch in snapshot.branches.values() swh-storage/swh/storage/in_memory.py: branch = snapshot.branches[branch_name] swh-storage/swh/storage/in_memory.py: branch_name: snapshot.branches[branch_name] swh-storage/swh/storage/storage.py: for name, info in snapshot.branches.items() swh-storage/swh/storage/cassandra/storage.py: for (branch_name, branch) in snapshot.branches.items():
The only piece of code really using the dict feature here is swh.storage.in_memory.
So it could be possible to make this modification without too much trouble for other swh packages, by using a bw-compatible .from_dict() method.
<cls>.metadata as a str (json encoded MD structure)
TODO