Page MenuHomeSoftware Heritage

Add checks in the idx_storage that the same content/rev/orig is not present twice in the new data.
ClosedPublic

Authored by vlorentz on Feb 5 2019, 12:00 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
storage-duplicate
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 4062
Build 5342: tox-on-jenkinsJenkins
Build 5341: arc lint + arc unit

Event Timeline

vlorentz created this revision.Feb 5 2019, 12:00 PM
douardda requested changes to this revision.Feb 7 2019, 9:44 AM
douardda added a subscriber: douardda.

This lacks a unit test: just add a test for the new method. It's nice to also have the test for the revision_metadata_add() call, but there should be a test that ensure the _check_duplicates() behaves ok (once again, with valid and invalid inputs), and ensure the newly added validator works in every modified method. I count 7 of them patched (plus the hunk in in_memory.py). I would expect (functionally) 8 tests.

This revision now requires changes to proceed.Feb 7 2019, 9:44 AM
vlorentz updated this revision to Diff 3433.Feb 7 2019, 2:17 PM
  • Move _check_duplicates outside a class and add doctests.

This lacks a unit test: just add a test for the new method.

Done.

and ensure the newly added validator works in every modified method. I count 7 of them patched

Done in D1095 for 5 of them. The 2 other ones will come in a later Diff (because these endpoints behave differently)

(plus the hunk in in_memory.py).

It's already tested by the tests in test_storage.py (which are run three times each: pg storage, memory storage, and client api).

Done in D1095 for 5 of them. The 2 other ones will come in a later Diff (because these endpoints behave differently)

I'm not a fan of adding missing pieces of a diff in another one.
Those tests are missing here, not in 1095.

I understand the need for refactoring but we should strive to keep the perimeter of a diff sandboxed.
Let 1095 be a tests refactoring and not 'a tests refactoring + a bug fix + adding missing tests'.

I understand the need for refactoring but we should strive to keep the perimeter of a diff sandboxed.
Let 1095 be a tests refactoring and not 'a tests refactoring + a bug fix + adding missing tests'.

I don't want to write non-trivial code that will get deleted in the next diff anyway.

I understand the need for refactoring but we should strive to keep the perimeter of a diff sandboxed.
Let 1095 be a tests refactoring and not 'a tests refactoring + a bug fix + adding missing tests'.

I don't want to write non-trivial code that will get deleted in the next diff anyway.

There could be a middle ground somewhere though.

You could have started by the refactoring first.
Then a diff to add this diff (and the 5 missing tests easier to add a priori with 1095)
Then the bugfix 1095 includes as another diff.

vlorentz updated this revision to Diff 3435.Feb 7 2019, 3:01 PM
  • Rewrite as a child of D1095

You could have started by the refactoring first.
Then a diff to add this diff (and the 5 missing tests easier to add a priori with 1095)

Done; I swapped the order of D1079 and D1095

Then the bugfix 1095 includes as another diff.

Done as D1096.

douardda requested changes to this revision.Feb 7 2019, 5:59 PM
douardda added inline comments.
swh/indexer/storage/__init__.py
51–79

I'm still quite not happy with this for several reasons:

  • the arg name 'column' make no sense in the context of checking stuff on dictionnaries,
  • the psycopg stuff in the docstring is very specific to the intended (primary) usage of this function, but there is no reason to make it specific to that usage,
  • more importantly, I still do not see a unit test for this function. For example it's very easy to make it fail in an unexpected way using a dict as value in a dict (since these are not hashable, the function will generates a TypeError). If we do not bother to be bullet proof, make it explicit in the doctring.
This revision now requires changes to proceed.Feb 7 2019, 5:59 PM
vlorentz updated this revision to Diff 3487.Feb 8 2019, 1:38 PM
  • Remove doc about psycopg error.
  • s/column/key/g
  • Document that values must be hashable.
vlorentz marked an inline comment as done.Feb 8 2019, 1:38 PM
vlorentz added inline comments.
swh/indexer/storage/__init__.py
51–79

the arg name 'column' make no sense in the context of checking stuff on dictionnaries,

Indeed, fixed.

the psycopg stuff in the docstring is very specific to the intended (primary) usage of this function, but there is no reason to make it specific to that usage,

It's in the module that implements the pg backend and its name is prefixed with _, so it's unlikely to be used anywhere else. Fixed anyway.

I still do not see a unit test for this function

There's the doctest

If we do not bother to be bullet proof, make it explicit in the doctring.

Indeed, fixed.

vlorentz updated this revision to Diff 3488.Feb 8 2019, 1:39 PM
  • Rebase
douardda accepted this revision.Feb 8 2019, 2:12 PM
This revision is now accepted and ready to land.Feb 8 2019, 2:12 PM
vlorentz updated this revision to Diff 3522.Feb 12 2019, 12:14 PM
  • rebase + squash
This revision was automatically updated to reflect the committed changes.