Page MenuHomeSoftware Heritage

model: use visit ids in the unique key, instead of their date.
AbandonedPublic

Authored by vlorentz on Oct 8 2020, 11:22 AM.

Details

Summary

dates are not unique (ie. multiple visits can share a date, and they
do in practice) and timestamps are also not great for testing equality
(eg. migrating between systems is lossy).
Additionally, visit statuses already use visit ids in their
unique key.

Depends on D4078

Resolves T2672

Diff Detail

Repository
rDMOD Data model
Branch
idx-journal
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16035
Build 24671: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 24670: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4194 (id=14763)

Could not rebase; Attempt merge onto bdfde82845...

Updating bdfde82..719f57a
Fast-forward
 swh/model/model.py            | 53 +++++++++++++++++++++++++++++++++++++++----
 swh/model/tests/test_model.py | 30 ++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 4 deletions(-)
Changes applied before test
commit 719f57afe24995d0705e9d1698c91bd7bcd3426f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 8 11:22:10 2020 +0200

    model: use visit ids in the unique key, instead of their date.
    
    dates are not unique (ie. multiple visits can share a date, and they
    do in practice); and visit statuses already use visit ids in their
    unique key.

commit a251df2e5b31a5d59d7e69e51a441bb22b1a7b0b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/146/ for more details.

dates are not unique (ie. multiple visits can share a date, and they
do in practice); and visit statuses already use visit ids in their
unique key.

Have you an idea of how many of these origin,date collision we have?
Looks to me that should be really a rare event, shouldn't it? (since date are stored with a microsecond resolution, right?)

microsecond in postgres, millisecond in cassandra.

that's another good reason not to use them as identifiers.

IMO, more generally, timestamps are like floating-point numbers, they should never be used as operands in an equality.

microsecond in postgres, millisecond in cassandra.

Ah ok, maybe add this statement somewhere in the commit message?

that's another good reason not to use them as identifiers.

IMO, more generally, timestamps are like floating-point numbers, they should never be used as operands in an equality.

I agree on the principle. Just wanted to know how much of an existing problem this is. I guess not so much in pg, but then pg->cassandra becomes a lossy transfer.

Thx

Build is green

Patch application report for D4194 (id=14769)

Could not rebase; Attempt merge onto bdfde82845...

Updating bdfde82..d0d6693
Fast-forward
 swh/model/model.py            | 53 +++++++++++++++++++++++++++++++++++++++----
 swh/model/tests/test_model.py | 30 ++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 4 deletions(-)
Changes applied before test
commit d0d669361700c8e83fc7a94b268186be3e0129bb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 8 11:22:10 2020 +0200

    model: use visit ids in the unique key, instead of their date.
    
    dates are not unique (ie. multiple visits can share a date, and they
    do in practice), and timestamps are also not great for testing equality
    (eg. migrating between systems is lossy).
    Additionally, visit statuses already use visit ids in their
    unique key.

commit a251df2e5b31a5d59d7e69e51a441bb22b1a7b0b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 29 14:08:08 2020 +0200

    Add a 'unique_key' method on model objects
    
    that returns a value suitable for unicity constraints.
    
    Motivation:
    
    * this is somewhat more of a model concern than a journal/kafka
      concern IMO
    * this is one step toward adding support for non-model objects in
      KafkaJournalWriter
    
    Implementation of the unique_key methods comes from
    `swh.journal.serializers.object_key`.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/147/ for more details.