Page MenuHomeSoftware Heritage

replayer: Filter out colliding contents when replaying
ClosedPublic

Authored by ardumont on Mar 10 2020, 3:48 PM.

Details

Summary

Now that we have the colliding contents details [1], we can be less lossy when
replaying contents to another storage.

Prior to this commit, we would not write the batch of contents to the
destination if a collision was detected [2]. Now we keep those out of
the loop if any. Effectively, not writing the colliding hashes to the destination
storage.

We still have to decide what to do for the colliding contents... For that, we
still have the logs referencing the colliding contents though (and sentry [1]).

Note that could be a shared behavior, in a retrying proxy for example?

[1] https://sentry.softwareheritage.org/share/issue/db05509e5a9d40dc8fe782be90f2f712/

[2] Related to D2777

Test Plan

tox

Diff Detail

Repository
rDJNL Journal infrastructure
Branch
filter-out-colliding-contents
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11061
Build 16664: tox-on-jenkinsJenkins
Build 16663: arc lint + arc unit

Event Timeline

olasd requested changes to this revision.Mar 10 2020, 4:25 PM

Interesting approach.

I think the test should be separated from the main test:

  • This would allow testing for the retry behavior explicitly
  • This would allow checking our ability to parse the logs and fish out colliding object ids for later processing

In general, I think we should make sure that the replay problems we're circumventing emit logs that can be machine parsed and handled after the fact (and that the format of these logs is tested).

swh/journal/replay.py
239

Not a fan of that name; collision_aware_content_add?

250–258

I think this should really be a while loop, starting with an empty set of colliding_contents, a full list of contents, then winnowing that down until:

  • either the function returns without exception
  • or the set of colliding contents doesn't grow for a given number of retries (2 retries ?)

The current implementation breaks if you have two pairs of colliding contents in the same batch, which doesn't sound out of question as people might have added several sets of colliding pdfs to a given repo.

You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling content_add_fn inside this function.

286–287

I think we should collapse the two for loops in a single one building two lists, which allows us to:

  • not call "content_add" on empty lists
  • call BaseContent.from_dict only once for each object
This revision now requires changes to proceed.Mar 10 2020, 4:25 PM

I also can't help but wonder if we should push that behavior to the storage instead of reimplementing it here.

swh/journal/replay.py
239

ok

250–258

I think this should really be a while loop, starting with an empty set of colliding_contents, a full list of ...

ok, sounds better indeed.

The current implementation breaks if you have two pairs of colliding contents in the same batch, which doesn't sound out of question as people might have added several sets of colliding pdfs to a given repo.

Indeed. I missed that.
I entertained the idea of the function calling itself with the content subset in the except clause at first...
That would have caught the problem.

meh, i prefer your proposal ;)

You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling content_add_fn inside this function.

I tried and it does not work.
I did not check further than that but i think it's the storage's'validate' proxy configuration.
And that's something we cannot have partially, it's a all or nothing situation.
Meaning we can't have it for release and not for content for example.

I also can't help but wonder if we should push that behavior to the storage instead of reimplementing it here.

DRY principle would agree.

But i would not know exactly where to put that (in the 3 backend implems or in the retry proxy or in yet another one?)

(IMO, It's somehow a 'retry', thus why i'm proposing that proxy in particular).

swh/journal/replay.py
250–258

I tried and it does not work.
I did not check further than that but i think it's the storage's'validate' proxy configuration.
And that's something we cannot have partially, it's a all or nothing situation.
Meaning we can't have it for release and not for content for example.

I'm not sure why the following wouldn't work:

  • In the _insert_objects for loop, call BaseContent.from_dict (only when object_type == 'content')
  • In the collision_aware_content_add function, when calling out to storage, pass the objects through .as_dict()
  • In the exception handler, you can call .hashes() to match the colliding objects directly

At some point, when the validate filter gets dropped, we can just drop the .as_dict() call. And the conversion function shouldn't be needed anymore.

swh/journal/replay.py
250–258

In the collision_aware_content_add function, when calling out to storage, pass the objects through .as_dict()

well, yes, i did not add that extra conversion indeed...
I found this rather convoluted even though the correct way to make the thing work (convert back and forth).

But ok, let's do that, it's not happening that often and it's "temporary".
As a next step, i'll look at removing the validate filter, i think the journal part is the last one using it.

swh/journal/replay.py
250–258

You can also avoid the conversion function argument, by converting objects to BaseContents outside of that call, then unconditionally calling to_dict when calling content_add_fn inside this function.

Ok, i meant i tried to do the conversion outside the loop.
Without the unconditionnally calling to_dict within the function collision_aware_content_add...
And that could not work, thus my reply from yesterday.

Doing exactly what you say should work but i was wary of that many back and forth in the first place...

Adapt according to review:

  • Split test with collision from main test
  • Rename content_add to collision_aware_content_add
  • Log formatted colliding contents and check logs in test
  • Rework collision_aware_content_add implementation

TODO:

  • collapse the 2 for-comprehensions

collapse the 2 for-comprehensions

Update test scenarios' docstring to clarify test intent

swh/journal/tests/test_replay.py
250

That should be done before the log messages are emitted, not that late. I guess this works because capturing error messages is the default.

255

Could you also check that the hashes are available in the log arguments?

swh/journal/tests/test_replay.py
250

Right.

255

Right, currently doing it.

  • Add hashes checks
  • Set up caplog fixture prior to replay
  • Rebase on latest master
olasd requested changes to this revision.Mar 13 2020, 2:16 PM

Sorry for having you go another round, but I've just thought of another problem :/

swh/journal/replay.py
259

So, now that I've given it a thorough think, I'm worried about the "cardinality" of this data:

When logging in production, we "flatten" the logging.foo() arguments, as journald only supports flat key-value pairs.

The schema you've implemented for this data means that, for each collision, we'll pass the following k/v pairs to the logging framework:

swh_hashes_{algo}-{colliding_id}_0_sha1=<sha1 of content 0>
swh_hashes_{algo}-{colliding_id}_0_sha1_git=<sha1_git of content 0>
swh_hashes_{algo}-{colliding_id}_0_sha256=<sha256 of content 0>
swh_hashes_{algo}-{colliding_id}_0_blake2s256=<blake2s256 of content 0>
swh_hashes_{algo}-{colliding_id}_1_sha1=<sha1 of content 1>
swh_hashes_{algo}-{colliding_id}_1_sha1_git=<sha1_git of content 1>
swh_hashes_{algo}-{colliding_id}_1_sha256=<sha256 of content 1>
swh_hashes_{algo}-{colliding_id}_1_blake2s256=<blake2s256 of content 1>

We're going create 8 new fields per collision in our logging infra, which is not great (for instance, elasticsearch doesn't let any given index have more than 1000 fields). It also makes queries harder (because often, the names of the fields are fixed in queries).

I think we should make the algorithm and the colliding hash values rather than a part of the key.

All in all, I suggest turning colliding_content_hashes into a collisions list, with values following the schema:

{
   'algorithm': algo,
   'hash': hash_id,
   'objects': colliding_hashes,
}

and logging a different error for each item in the collision list; Each error will generate the following key/values:

swh_collision_algorithm=<algorithm>
swh_collision_hash=<colliding hash>
swh_collision_objects_0_sha1=<sha1 of content 0>
swh_collision_objects_0_sha1_git=<sha1_git of content 0>
swh_collision_objects_0_sha256=<sha256 of content 0>
swh_collision_objects_0_blake2s256=<blake2s256 of content 0>
swh_collision_objects_1_sha1=<sha1 of content 1>
swh_collision_objects_1_sha1_git=<sha1_git of content 1>
swh_collision_objects_1_sha256=<sha256 of content 1>
swh_collision_objects_1_blake2s256=<blake2s256 of content 1>
[... if we're unlucky, more objects go here ...]

This makes the set of logging keys constant, which avoids making our logging framework blow up, and allows us to do meaningful queries on the data.

We also need to encode all values as strings, as I don't think binary data is passed quite properly through the full logging stack...

267–269

Following the previous comment, this should turn in a for loop. You should rename the key collision instead of hashes which would give the generated k/v pairs more meaning.

swh/journal/tests/test_replay.py
377–385

I guess changing the schema to fixed keys will make this test a little bit simpler as well

This revision now requires changes to proceed.Mar 13 2020, 2:16 PM
  • Rework collision log message
This revision is now accepted and ready to land.Mar 13 2020, 3:08 PM