Page MenuHomeSoftware Heritage

storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
ClosedPublic

Authored by vlorentz on Oct 1 2020, 6:14 PM.

Details

Summary

ctags and fossology can legitimately add multiple objects with the same
id and indexer_configuration_id at the same time, but different values
for other columns.

Before this commit, it would immediately crash the 'check_id_duplicates'
checks.

As a side-effect, this commit also remove the 'add_merge' logic in the
in-memory storage, and stores it like proper rows like the pg does instead
of working around this broken deduplication.

As an other side-effect, removing this broken logic removes an
inconsistency between the in-mem and pg storage (see the added tests
named test_add_empty for Ctags and Fossology), which Fossology indexer
tests relied on, so this commit updates these tests as well.

Sorry, this is quite a large diff; I intended it as a small refactoring but
it uncovered a bunch of bugs that I needed to fix too

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4118 (id=14513)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit 83a02daaba15a087f5f09095754ec734e86643b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/console

fix tests (see fourth paragraph of the description of this diff)

Build is green

Patch application report for D4118 (id=14550)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit f006cd254963d2a38f2384fb3640a684a4c06cc1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/57/ for more details.

anlambert added a subscriber: anlambert.

I get the idea and did not notice any particular issue in the diff.

Nevertheless, these changes break swh-web tests (P797).

I did not analyze in details the issues but my guess is that a regression is introduced in the indexer memory storage.

swh/indexer/storage/in_memory.py
322

Could you add types to that method signature ?

349

same here

This revision now requires changes to proceed.Oct 2 2020, 1:37 PM

fix fossology test, which didn't test the right endpoint; then fix a bug the test missed, which caused the issues in swh-web.

Build is green

Patch application report for D4118 (id=14553)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit 07821f0da7c4a6e693244193b8bd2ee83b0c9613
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/58/ for more details.

swh/indexer/storage/in_memory.py
322

That's on my todo-list for next couple of weeks.

This revision is now accepted and ready to land.Oct 2 2020, 2:18 PM