Paths

Table of Contentst

Differential D4118

storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
ClosedPublic
Actions

Authored by vlorentz on Oct 1 2020, 6:14 PM.

Details

Reviewers

anlambert

Group Reviewers

Reviewers

Commits

rDCIDX07821f0da7c4: storage: check uniqueness based on all relevant columns, not just id and…

Summary

ctags and fossology can legitimately add multiple objects with the same
id and indexer_configuration_id at the same time, but different values
for other columns.

Before this commit, it would immediately crash the 'check_id_duplicates'
checks.

As a side-effect, this commit also remove the 'add_merge' logic in the
in-memory storage, and stores it like proper rows like the pg does instead
of working around this broken deduplication.

As an other side-effect, removing this broken logic removes an
inconsistency between the in-mem and pg storage (see the added tests
named test_add_empty for Ctags and Fossology), which Fossology indexer
tests relied on, so this commit updates these tests as well.

Sorry, this is quite a large diff; I intended it as a small refactoring but
it uncovered a bunch of bugs that I needed to fix too

Diff Detail

Repository

rDCIDX Metadata indexer

Branch

attrs

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 15780
Build 24286: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 24285: arc lint + arc unit

Unit TestsFailed
View All

	Time	Test
	2 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.storage.__init__::swh.indexer.storage.check_id_duplicates 071 072 If any two row models in `data` have the same unique key, raises 073 a `ValueError`.
	12 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer::test_index self = <swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer testMethod=test_index> def test_index(self):
	13 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer::test_index_one_unknown_sha1 self = <swh.indexer.tests.test_fossology_license.TestFossologyLicenseIndexer testMethod=test_index_one_unknown_sha1> def test_index_one_unknown_sha1(self):
	25 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_origin_metadata::test_origin_metadata_indexer_delete_metadata idx_storage = <swh.indexer.storage.in_memory.IndexerStorage object at 0x7f90c5bd8cc0> storage = <swh.storage.in_memory.InMemoryStorage object at 0x7f90c2707048> obj_storage = <swh.objstorage.backends.in_memory.InMemoryObjStorage object at 0x7f90c0525e48>
	3 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.codemeta::swh.indexer.codemeta.merge_values
		View Full Test Results (4 Failed · 315 Passed · 15 Skipped)

Event Timeline

vlorentz created this revision.Oct 1 2020, 6:14 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 1 2020, 6:14 PM

Build has FAILED

Patch application report for D4118 (id=14513)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.

Changes applied before test

commit 83a02daaba15a087f5f09095754ec734e86643b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/56/console

Harbormaster failed remote builds in B15780: Diff 14513!Oct 1 2020, 6:16 PM

vlorentz planned changes to this revision.Oct 1 2020, 6:18 PM

fix tests (see fourth paragraph of the description of this diff)

vlorentz edited the summary of this revision. (Show Details)Oct 2 2020, 12:18 PM

Build is green

Patch application report for D4118 (id=14550)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.

Changes applied before test

commit f006cd254963d2a38f2384fb3640a684a4c06cc1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/57/ for more details.

Harbormaster completed remote builds in B15812: Diff 14550.Oct 2 2020, 12:20 PM

vlorentz edited the summary of this revision. (Show Details)Oct 2 2020, 12:20 PM

I get the idea and did not notice any particular issue in the diff.

Nevertheless, these changes break swh-web tests (P797).

I did not analyze in details the issues but my guess is that a regression is introduced in the indexer memory storage.

swh/indexer/storage/in_memory.py
317	Could you add types to that method signature ?
344	same here

This revision now requires changes to proceed.Oct 2 2020, 1:37 PM

fix fossology test, which didn't test the right endpoint; then fix a bug the test missed, which caused the issues in swh-web.

Build is green

Patch application report for D4118 (id=14553)

Rebasing onto cd650f102d...

Current branch diff-target is up to date.

Changes applied before test

commit 07821f0da7c4a6e693244193b8bd2ee83b0c9613
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 1 18:13:40 2020 +0200

    storage: check uniqueness based on all relevant columns, not just id and indexer_configuration_id.
    
    ctags and fossology can legitimately add multiple objects with the same
    id and indexer_configuration_id at the same time, but different values
    for other columns.
    
    Before this commit, it would immediately crash the 'check_id_duplicates'
    checks.
    
    As a side-effect, this commit also remove the 'add_merge' logic in the
    in-memory storage, and stores it like proper rows like the pg does instead
    of working around this broken deduplication.
    
    As an other side-effect, removing this broken logic removes an
    inconsistency between the in-mem and pg storage (see the added tests
    named `test_add_empty` for Ctags and Fossology), which Fossology indexer
    tests relied on, so this commit updates these tests as well.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/58/ for more details.

Harbormaster completed remote builds in B15815: Diff 14553.Oct 2 2020, 2:05 PM

vlorentz added inline comments.Oct 2 2020, 2:05 PM