Paths

Table of Contentst

Differential D4078

Add a 'unique_key' method on model objects
ClosedPublic
Actions

Authored by vlorentz on Sep 29 2020, 2:11 PM.

Details

Reviewers

olasd

Group Reviewers

Reviewers

Commits

rDMODa251df2e5b31: Add a 'unique_key' method on model objects

Summary

that returns a value suitable for unicity constraints.

Motivation:

this is somewhat more of a model concern than a journal/kafka concern IMO
this is one step toward adding support for non-model objects in KafkaJournalWriter

Implementation of the unique_key methods comes from
swh.journal.serializers.object_key.

Diff Detail

Repository

rDMOD Data model

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 15685
Build 24150: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 24149: arc lint + arc unit

Event Timeline

vlorentz created this revision.Sep 29 2020, 2:11 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 29 2020, 2:11 PM

vlorentz planned changes to this revision.Sep 29 2020, 2:11 PM

Build is green

Patch application report for D4078 (id=14390)

Rebasing onto 362ebf609b...

Current branch diff-target is up to date.

Changes applied before test

commit 33437fa2f551848e0300682ba91b8ea9c67cab70
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/141/ for more details.

Harbormaster completed remote builds in B15685: Diff 14390.Sep 29 2020, 2:12 PM

improve commit message

vlorentz planned changes to this revision.Sep 29 2020, 2:12 PM

vlorentz edited the summary of this revision. (Show Details)Sep 29 2020, 2:12 PM

Build is green

Patch application report for D4078 (id=14392)

Rebasing onto 362ebf609b...

Current branch diff-target is up to date.

Changes applied before test

commit 951523a89e1fdef0e04e71d1409c616de187601f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/142/ for more details.

Harbormaster completed remote builds in B15686: Diff 14392.Sep 29 2020, 2:14 PM

vlorentz mentioned this in D4079: Remove swh.journal.serializers.object_key, use BaseModel.unique_key instead..Sep 29 2020, 2:16 PM

vlorentz added a child revision: D4079: Remove swh.journal.serializers.object_key, use BaseModel.unique_key instead..Sep 29 2020, 3:09 PM

rebase

vlorentz retitled this revision from [WIP] Add a 'unique_key' method on model objects to Add a 'unique_key' method on model objects.Oct 8 2020, 11:18 AM

Build is green

Patch application report for D4078 (id=14762)

Rebasing onto bdfde82845...

Current branch diff-target is up to date.

Changes applied before test

commit a251df2e5b31a5d59d7e69e51a441bb22b1a7b0b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/145/ for more details.

Harbormaster completed remote builds in B16028: Diff 14762.Oct 8 2020, 11:19 AM

vlorentz added a child revision: D4194: model: use visit ids in the unique key, instead of their date..Oct 8 2020, 11:22 AM

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

In D4078#103938, @douardda wrote:

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

In D4078#103938, @douardda wrote:

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

In D4078#103940, @douardda wrote:

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

This would prevent compacting away old versions of objects. Is this something we want?

In D4078#103941, @vlorentz wrote:

In D4078#103938, @douardda wrote:

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

In D4078#103940, @douardda wrote:

Also (most probably dumb idea, writing as it pops in my mind), wouldn't it make sense to add some kind of 'per-object class model version' in the key?

This would prevent compacting away old versions of objects. Is this something we want?

Yes I though about this, dunno, needs a bit a pros/cons of this idea... maybe what we want is a version Final attribute on model classes instead.

yes, it would make sense for values. Do you want to open a task for that?

In D4078#103968, @vlorentz wrote:

yes, it would make sense for values. Do you want to open a task for that?

you read my mind :-)

In D4078#103941, @vlorentz wrote:

In D4078#103938, @douardda wrote:

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

I believe it would be best to enforce hashable unique keys. The "self-documented" looks less of the desired feature to me. @olasd what's your opinion?

In D4078#103969, @douardda wrote:

In D4078#103968, @vlorentz wrote:

yes, it would make sense for values. Do you want to open a task for that?

you read my mind :-)

I've raised this idea in T1279

First of all, I totally agree that this should be pushed down to swh.model.

In D4078#103970, @douardda wrote:

In D4078#103941, @vlorentz wrote:

In D4078#103938, @douardda wrote:

maybe stupid question, but why using dict as unique key (in many model classes)? Why not use a tuple? I mean it seems to me that such a UID should be usable as dict keys or in a set directly.

I don't know, I just copied what we were already doing in swh-journal. Dicts have the nice property of being somewhat "self-documenting" though.

I believe it would be best to enforce hashable unique keys. The "self-documented" looks less of the desired feature to me. @olasd what's your opinion?

Well. The current implementation is a dict, but it probably should have been a named tuple or a frozen dict or any other immutable, hashable type.

If we change the shape of these values (now or in the future), we need to make sure that swh.journal will convert them to bytes in a backwards-compatible way, or we will lose the kafka compaction behavior. This deserves a big fat warning in every implementation of this method, IMO.

Considering our *current* operational status (backfill of objects barely started, ingestion quasi-stopped, kafka cluster almost empty), if there's one point where we should be doing such a change, it's now: we can just drop the contents of the kafka cluster and start fresh.

olasd accepted this revision.Oct 12 2020, 11:59 AM

This revision is now accepted and ready to land.Oct 12 2020, 11:59 AM

Closed by commit rDMODa251df2e5b31: Add a 'unique_key' method on model objects (authored by vlorentz). · Explain WhyOct 12 2020, 12:16 PM

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDMODa251df2e5b31: Add a 'unique_key' method on model objects.