Page MenuHomeSoftware Heritage

storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
ClosedPublic

Authored by vlorentz on Oct 1 2020, 6:14 PM.

Details

Summary

ctags and fossology can legitimately add multiple objects with the same
id and indexer_configuration_id at the same time, but different values
for other columns.

Before this commit, it would immediately crash the 'check_id_duplicates'
checks.

As a side-effect, this commit also remove the 'add_merge' logic in the
in-memory storage, and stores it like proper rows like the pg does instead
of working around this broken deduplication.

As an other side-effect, removing this broken logic removes an
inconsistency between the in-mem and pg storage (see the added tests
named test_add_empty for Ctags and Fossology), which Fossology indexer
tests relied on, so this commit updates these tests as well.

Sorry, this is quite a large diff; I intended it as a small refactoring but
it uncovered a bunch of bugs that I needed to fix too

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
attrs
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 15780
Build 24286: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 24285: arc lint + arc unit

Unit TestsFailed

TimeTest
2 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.storage.__init__::swh.indexer.storage.check_id_duplicates
071 072 If any two row models in `data` have the same unique key, raises 073 a `ValueError`.
12 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer::test_index
self = <swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer testMethod=test_index> def test_index(self):
13 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer::test_index_one_unknown_sha1
self = <swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer testMethod=test_index_one_unknown_sha1> def test_index_one_unknown_sha1(self):
25 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_origin_metadata::test_origin_metadata_indexer_delete_metadata
idx_storage = <swh.indexer.storage.in_memory.IndexerStorage object at 0x7f90c5bd8cc0> storage = <swh.storage.in_memory.InMemoryStorage object at 0x7f90c2707048> obj_storage = <swh.objstorage.backends.in_memory.InMemoryObjStorage object at 0x7f90c0525e48>
3 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.codemeta::swh.indexer.codemeta.merge_values
View Full Test Results (4 Failed · 315 Passed · 15 Skipped)

Event Timeline

Build has FAILED

Patch application report for D4118 (id=14513)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit 83a02daaba15a087f5f09095754ec734e86643b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/console

fix tests (see fourth paragraph of the description of this diff)

Build is green

Patch application report for D4118 (id=14550)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit f006cd254963d2a38f2384fb3640a684a4c06cc1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/57/ for more details.

anlambert added a subscriber: anlambert.

I get the idea and did not notice any particular issue in the diff.

Nevertheless, these changes break swh-web tests (P797).

I did not analyze in details the issues but my guess is that a regression is introduced in the indexer memory storage.

swh/indexer/storage/in_memory.py
317

Could you add types to that method signature ?

344

same here

This revision now requires changes to proceed.Oct 2 2020, 1:37 PM

fix fossology test, which didn't test the right endpoint; then fix a bug the test missed, which caused the issues in swh-web.

Build is green

Patch application report for D4118 (id=14553)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.
Changes applied before test
commit 07821f0da7c4a6e693244193b8bd2ee83b0c9613
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/58/ for more details.

swh/indexer/storage/in_memory.py
317

That's on my todo-list for next couple of weeks.

This revision is now accepted and ready to land.Oct 2 2020, 2:18 PM