Page MenuHomeSoftware Heritage

cassandra: Bump next_visit_id when origin_visit_add is called by a replayer
ClosedPublic

Authored by vlorentz on Aug 20 2021, 6:12 PM.

Details

Summary

When called by a replayer, the visit.visit field is set; but
origin.next_visit_id was never incremented, so on the next loader
run, the visit id would be 1 even if there is already a visit
with that id.

Diff Detail

Event Timeline

Build is green

Patch application report for D6120 (id=22142)

Could not rebase; Attempt merge onto 9f00eb9dba...

Updating 9f00eb9d..724a67e0
Fast-forward
 swh/storage/cassandra/cql.py       | 45 +++++++++++++++++++++++++++++++++++
 swh/storage/cassandra/model.py     |  4 ++--
 swh/storage/cassandra/schema.py    |  2 +-
 swh/storage/cassandra/storage.py   | 21 ++++++++++++++++-
 swh/storage/in_memory.py           | 11 +++++++++
 swh/storage/tests/storage_tests.py | 48 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 127 insertions(+), 4 deletions(-)
Changes applied before test
commit 724a67e06fd6e6c9ed93c28dae79db43239e7fc9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 18:12:26 2021 +0200

    cassandra: Bump next_visit_id when origin_visit_add is called by a replayer
    
    When called by a replayer, the visit.visit field is set; but
    origin.next_visit_id was never incremented, so on the next loader
    run, the visit id would be 1 even if there is already a visit
    with that id.

commit a3cc0dc7b104bc8b7f05988a7e0e26fae462ac7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1365/ for more details.

This revision is now accepted and ready to land.Aug 23 2021, 2:50 PM

Build is green

Patch application report for D6120 (id=22163)

Could not rebase; Attempt merge onto 7113198fd6...

Updating 7113198f..cf880db3
Fast-forward
 swh/storage/cassandra/cql.py       | 45 +++++++++++++++++++++++++++++++++++
 swh/storage/cassandra/model.py     |  4 ++--
 swh/storage/cassandra/schema.py    |  2 +-
 swh/storage/cassandra/storage.py   | 21 ++++++++++++++++-
 swh/storage/in_memory.py           | 11 +++++++++
 swh/storage/tests/storage_tests.py | 48 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 127 insertions(+), 4 deletions(-)
Changes applied before test
commit cf880db30bb549ccbdbb2cdd05b61d124ed90be7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 18:12:26 2021 +0200

    cassandra: Bump next_visit_id when origin_visit_add is called by a replayer
    
    When called by a replayer, the visit.visit field is set; but
    origin.next_visit_id was never incremented, so on the next loader
    run, the visit id would be 1 even if there is already a visit
    with that id.

commit 54b5abfb26267ad56a67ad9fa2dd9d5d075e30f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1369/ for more details.