Page MenuHomeSoftware Heritage

Add a 'unique_key' method on model objects
ClosedPublic

Authored by vlorentz on Sep 29 2020, 2:11 PM.

Details

Summary

that returns a value suitable for unicity constraints.

Motivation:

  • this is somewhat more of a model concern than a journal/kafka concern IMO
  • this is one step toward adding support for non-model objects in KafkaJournalWriter

Implementation of the unique_key methods comes from
swh.journal.serializers.object_key.

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 15686
Build 24152: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 24151: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4078 (id=14390)

Rebasing onto 362ebf609b...

Current branch diff-target is up to date.
Changes applied before test
commit 33437fa2f551848e0300682ba91b8ea9c67cab70
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/141/ for more details.

Build is green

Patch application report for D4078 (id=14392)

Rebasing onto 362ebf609b...

Current branch diff-target is up to date.
Changes applied before test
commit 951523a89e1fdef0e04e71d1409c616de187601f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/142/ for more details.

vlorentz retitled this revision from [WIP] Add a 'unique_key' method on model objects to Add a 'unique_key' method on model objects.Oct 8 2020, 11:18 AM

Build is green

Patch application report for D4078 (id=14762)

Rebasing onto bdfde82845...

Current branch diff-target is up to date.
Changes applied before test
commit a251df2e5b31a5d59d7e69e51a441bb22b1a7b0b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/145/ for more details.

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

This would prevent compacting away old versions of objects. Is this something we want?

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

This would prevent compacting away old versions of objects. Is this something we want?

Yes I though about this, dunno, needs a bit a pros/cons of this idea... maybe what we want is a version Final attribute on model classes instead.

yes, it would make sense for values. Do you want to open a task for that?

yes, it would make sense for values. Do you want to open a task for that?

you read my mind :-)

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

I believe it would be best to enforce hashable unique keys. The "self-documented" looks less of the desired feature to me. @olasd what's your opinion?

yes, it would make sense for values. Do you want to open a task for that?

you read my mind :-)

I've raised this idea in T1279

First of all, I totally agree that this should be pushed down to swh.model.

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

I believe it would be best to enforce hashable unique keys. The "self-documented" looks less of the desired feature to me. @olasd what's your opinion?

Well. The current implementation is a dict, but it probably should have been a named tuple or a frozen dict or any other immutable, hashable type.

If we change the shape of these values (now or in the future), we need to make sure that swh.journal will convert them to bytes in a backwards-compatible way, or we will lose the kafka compaction behavior. This deserves a big fat warning in every implementation of this method, IMO.

Considering our *current* operational status (backfill of objects barely started, ingestion quasi-stopped, kafka cluster almost empty), if there's one point where we should be doing such a change, it's now: we can just drop the contents of the kafka cluster and start fresh.

This revision is now accepted and ready to land.Oct 12 2020, 11:59 AM