Page MenuHomeSoftware Heritage

Fix CardinalityViolation in grab_next_visits on duplicate origins
ClosedPublic

Authored by vlorentz on Nov 22 2021, 1:37 PM.

Details

Summary

grab_next_visits grabs from listed_origins, whose primary key is
(lister_id, url, visit_type) and uses it to upsert in origin_visit_stats,
whose primary key is (url, visit_type).
This causes the error `ON CONFLICT DO UPDATE command cannot affect row a
second time` when the same (origin, type) pair is grabbed twice.

This commit deduplicates the (origin, type) pairs before upserting.

Resolves SWH-SCHEDULER-6A

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
dedup
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 25099
Build 39215: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 39214: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6664 (id=24222)

Rebasing onto 00ff02eab9...

Current branch diff-target is up to date.
Changes applied before test
commit 2abb39368405ce684e2fb54dda03d4504328db6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 22 13:32:20 2021 +0100

    Fix CardinalityViolation in grab_next_visits on duplicate origins
    
    grab_next_visits grabs from `listed_origins`, whose primary key is
    `(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats,
    whose primary key is `(url, visit_type)`.
    This causes the error `ON CONFLICT DO UPDATE command cannot affect row a
    second time` when the same (origin, type) pair is grabbed twice.
    
    This commit deduplicates the (origin, type) pairs before upserting.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/495/ for more details.

This revision is now accepted and ready to land.Nov 22 2021, 2:28 PM