Details

Reviewers

douardda
olasd
vlorentz

Group Reviewers

Reviewers

Commits

rDJNL3c0e49135293: replay: Filter out colliding contents when replaying
rDJNLe2429199c711: tox: Add optional py3-dev dependency to ease debugging

Summary

Now that we have the colliding contents details [1], we can be less lossy when
replaying contents to another storage.

Prior to this commit, we would not write the batch of contents to the
destination if a collision was detected [2]. Now we keep those out of
the loop if any. Effectively, not writing the colliding hashes to the destination
storage.

We still have to decide what to do for the colliding contents... For that, we
still have the logs referencing the colliding contents though (and sentry [1]).

Note that could be a shared behavior, in a retrying proxy for example?

[1] https://sentry.softwareheritage.org/share/issue/db05509e5a9d40dc8fe782be90f2f712/

[2] Related to D2777

Test Plan

tox

Diff Detail

Repository

rDJNL Journal infrastructure

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 11056
Build 16656: tox-on-jenkins	Jenkins
Build 16655: arc lint + arc unit

Event Timeline

ardumont created this revision.Mar 10 2020, 3:48 PM

ardumont edited the summary of this revision. (Show Details)Mar 10 2020, 3:51 PM

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/367/ for more details.

Harbormaster completed remote builds in B11026: Diff 9955.Mar 10 2020, 3:52 PM

Rebase on latest master

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/368/ for more details.

Harbormaster completed remote builds in B11028: Diff 9957.Mar 10 2020, 4:07 PM

Interesting approach.

I think the test should be separated from the main test:

This would allow testing for the retry behavior explicitly
This would allow checking our ability to parse the logs and fish out colliding object ids for later processing

In general, I think we should make sure that the replay problems we're circumventing emit logs that can be machine parsed and handled after the fact (and that the format of these logs is tested).

swh/journal/replay.py
239	Not a fan of that name; `collision_aware_content_add`?
250–258	I think this should really be a while loop, starting with an empty set of `colliding_contents`, a full list of `contents`, then winnowing that down until: either the function returns without exception or the set of colliding contents doesn't grow for a given number of retries (2 retries ?) The current implementation breaks if you have two pairs of colliding contents in the same batch, which doesn't sound out of question as people might have added several sets of colliding pdfs to a given repo. You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling `content_add_fn` inside this function.
269–281	I think we should collapse the two for loops in a single one building two lists, which allows us to: not call "content_add" on empty lists call `BaseContent.from_dict` only once for each object

This revision now requires changes to proceed.Mar 10 2020, 4:25 PM

I also can't help but wonder if we should push that behavior to the storage instead of reimplementing it here.

ardumont edited the summary of this revision. (Show Details)Mar 10 2020, 5:04 PM

ardumont added inline comments.Mar 10 2020, 5:23 PM

swh/journal/replay.py
239	ok
250–258	I think this should really be a while loop, starting with an empty set of colliding_contents, a full list of ... ok, sounds better indeed. The current implementation breaks if you have two pairs of colliding contents in the same batch, which doesn't sound out of question as people might have added several sets of colliding pdfs to a given repo. Indeed. I missed that. I entertained the idea of the function calling itself with the content subset in the except clause at first... That would have caught the problem. meh, i prefer your proposal ;) You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling content_add_fn inside this function. I tried and it does not work. I did not check further than that but i think it's the storage's'validate' proxy configuration. And that's something we cannot have partially, it's a all or nothing situation. Meaning we can't have it for release and not for content for example.

I also can't help but wonder if we should push that behavior to the storage instead of reimplementing it here.

DRY principle would agree.

But i would not know exactly where to put that (in the 3 backend implems or in the retry proxy or in yet another one?)

(IMO, It's somehow a 'retry', thus why i'm proposing that proxy in particular).

olasd added inline comments.Mar 10 2020, 7:55 PM

swh/journal/replay.py
250–258	I tried and it does not work. I did not check further than that but i think it's the storage's'validate' proxy configuration. And that's something we cannot have partially, it's a all or nothing situation. Meaning we can't have it for release and not for content for example. I'm not sure why the following wouldn't work: In the `_insert_objects` for loop, call `BaseContent.from_dict` (only when `object_type == 'content'`) In the `collision_aware_content_add` function, when calling out to storage, pass the objects through `.as_dict()` In the exception handler, you can call `.hashes()` to match the colliding objects directly At some point, when the validate filter gets dropped, we can just drop the `.as_dict()` call. And the conversion function shouldn't be needed anymore.

ardumont added inline comments.Mar 11 2020, 8:11 AM

swh/journal/replay.py
250–258	In the collision_aware_content_add function, when calling out to storage, pass the objects through .as_dict() well, yes, i did not add that extra conversion indeed... I found this rather convoluted even though the correct way to make the thing work (convert back and forth). But ok, let's do that, it's not happening that often and it's "temporary". As a next step, i'll look at removing the validate filter, i think the journal part is the last one using it.

ardumont added inline comments.Mar 11 2020, 8:50 AM

swh/journal/replay.py
250–258	You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling content_add_fn inside this function. Ok, i meant i tried to do the conversion outside the loop. Without the unconditionnally calling to_dict within the function `collision_aware_content_add`... And that could not work, thus my reply from yesterday. Doing exactly what you say should work but i was wary of that many back and forth in the first place...

Adapt according to review:

Split test with collision from main test
Rename content_add to collision_aware_content_add
Log formatted colliding contents and check logs in test
Rework collision_aware_content_add implementation

TODO:

collapse the 2 for-comprehensions

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/384/ for more details.

Harbormaster completed remote builds in B11056: Diff 9984.Mar 11 2020, 9:52 AM

collapse the 2 for-comprehensions

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/385/ for more details.

Harbormaster completed remote builds in B11059: Diff 9987.Mar 11 2020, 10:13 AM

Update test scenarios' docstring to clarify test intent

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/386/ for more details.

Harbormaster completed remote builds in B11060: Diff 9988.Mar 11 2020, 10:24 AM

Plug to dedicated branch

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/387/ for more details.

Harbormaster completed remote builds in B11061: Diff 9989.Mar 11 2020, 10:42 AM

olasd added inline comments.Mar 11 2020, 11:18 AM

swh/journal/tests/test_replay.py
235	That should be done before the log messages are emitted, not that late. I guess this works because capturing error messages is the default.
240	Could you also check that the hashes are available in the log arguments?

ardumont added inline comments.Mar 11 2020, 1:51 PM

swh/journal/tests/test_replay.py
235	Right.
240	Right, currently doing it.

Add hashes checks
Set up caplog fixture prior to replay
Rebase on latest master

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/388/ for more details.

Harbormaster completed remote builds in B11068: Diff 9996.Mar 11 2020, 2:33 PM

ardumont added a child revision: D2812: journal: Use swh-model objects instead of dicts in replay and writer.Mar 11 2020, 4:46 PM

Sorry for having you go another round, but I've just thought of another problem :/

swh/journal/replay.py
259	So, now that I've given it a thorough think, I'm worried about the "cardinality" of this data: When logging in production, we "flatten" the `logging.foo()` arguments, as journald only supports flat key-value pairs. The schema you've implemented for this data means that, for each collision, we'll pass the following k/v pairs to the logging framework: swh_hashes_{algo}-{colliding_id}_0_sha1=<sha1 of content 0> swh_hashes_{algo}-{colliding_id}_0_sha1_git=<sha1_git of content 0> swh_hashes_{algo}-{colliding_id}_0_sha256=<sha256 of content 0> swh_hashes_{algo}-{colliding_id}_0_blake2s256=<blake2s256 of content 0> swh_hashes_{algo}-{colliding_id}_1_sha1=<sha1 of content 1> swh_hashes_{algo}-{colliding_id}_1_sha1_git=<sha1_git of content 1> swh_hashes_{algo}-{colliding_id}_1_sha256=<sha256 of content 1> swh_hashes_{algo}-{colliding_id}_1_blake2s256=<blake2s256 of content 1> We're going create 8 new fields per collision in our logging infra, which is not great (for instance, elasticsearch doesn't let any given index have more than 1000 fields). It also makes queries harder (because often, the names of the fields are fixed in queries). I think we should make the algorithm and the colliding hash values rather than a part of the key. All in all, I suggest turning `colliding_content_hashes` into a `collisions` list, with values following the schema: { 'algorithm': algo, 'hash': hash_id, 'objects': colliding_hashes, } and logging a different error for each item in the collision list; Each error will generate the following key/values: swh_collision_algorithm=<algorithm> swh_collision_hash=<colliding hash> swh_collision_objects_0_sha1=<sha1 of content 0> swh_collision_objects_0_sha1_git=<sha1_git of content 0> swh_collision_objects_0_sha256=<sha256 of content 0> swh_collision_objects_0_blake2s256=<blake2s256 of content 0> swh_collision_objects_1_sha1=<sha1 of content 1> swh_collision_objects_1_sha1_git=<sha1_git of content 1> swh_collision_objects_1_sha256=<sha256 of content 1> swh_collision_objects_1_blake2s256=<blake2s256 of content 1> [... if we're unlucky, more objects go here ...] This makes the set of logging keys constant, which avoids making our logging framework blow up, and allows us to do meaningful queries on the data. We also need to encode all values as strings, as I don't think binary data is passed quite properly through the full logging stack...
267–269	Following the previous comment, this should turn in a for loop. You should rename the key `collision` instead of `hashes` which would give the generated k/v pairs more meaning.
swh/journal/tests/test_replay.py
362–370	I guess changing the schema to fixed keys will make this test a little bit simpler as well