Page MenuHomeSoftware Heritage

journal.replay: Inline `_fix_origin_visit` for loop in insert_object
ClosedPublic

Authored by ardumont on Mar 17 2020, 12:12 PM.

Details

Summary

This aligns with what's done for the other object types.

Also:

  • Add missing test cases around existing scenario
  • Clarify intent about stopping the replayer when an old origin_visit format is

detected.

Test Plan

tox

Diff Detail

Repository
rDJNL Journal infrastructure
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

swh/journal/tests/test_replay.py
343–349

should go away.

365

I wanted to use the caplog stanza here.
But cannot really as those tests use mocked replayer...

So, I'll keep that improvment for later.

visits = [visit, visit2]
caplog.set_level(logging.ERROR, 'swh.journal.replay')
_test_write_replay_origin_visit(visits, with_skipped_visits=True)

nb_visits_skipped = 0
skipped_visits = []
for record in caplog.records:
    logtext = record.getMessage()
    if 'Old origin visit format detected:' in logtext:
        nb_visits_skipped += 1
        skipped_visits.append(record.args['visit'])

assert nb_visits_skipped == len(visits)
for visit in visits:
    assert visit in skipped_visits

Remove unused caplog fixture

vlorentz requested changes to this revision.EditedMar 17 2020, 2:01 PM
vlorentz added a subscriber: vlorentz.

This aligns the behavior with other methods (skipped
unfit revision, skipped colliding hash, etc...).

Except we shouldn't align it like this. Those methods are skipped because they were not added in the existing storage. This diff would skip visits that are already in it, which is not acceptable.

The correct fixes for this issue are either

  1. to overwrite the old messages in Kafka (with a run of the backfiller)
  2. convert the old format to the new when reading
This revision now requires changes to proceed.Mar 17 2020, 2:01 PM

Except we shouldn't align it like this.

ok?
^ I'm confused ;)

Those methods are skipped because they were not added in the existing storage.

I do not understand.

This diff would skip visits that are already in it, which is not acceptable.

We no longer have visits without type, don't we?

The correct fixes for this issue are either

  • to overwrite the old messages in Kafka (with a run of the backfiller)

who does such run?

  • convert the old format to the new when reading

except we do not have the origin_visit's type here.

Except we shouldn't align it like this.

ok?
^ I'm confused ;)

Those methods are skipped because they were not added in the existing storage.

I do not understand.

Those objects, I mean. Invalid revisions were refused by the pg storage after being pushed to Kafka.

While the origin visits you are proposing to skip were accepted by the pg storage.

This diff would skip visits that are already in it, which is not acceptable.

We no longer have visits without type, don't we?

We don't, but they were created before having a type and were migrated with an SQL script, without the update being sent to Kafka.

The correct fixes for this issue are either

  • to overwrite the old messages in Kafka (with a run of the backfiller)

who does such run?

You, olasd, me, ...

  • convert the old format to the new when reading

except we do not have the origin_visit's type here.

Indeed. So the correct fix is to overwrite them in Kafka.

Those objects, I mean. Invalid revisions were refused by the pg storage after being pushed to Kafka.

ok, so they should not be there in the first place. Thus the backfill you want to run again.

While the origin visits you are proposing to skip were accepted by the pg storage.

I'm not changing anything?
The replayer now skips the same bad origin_visit as before but instead of failing the process, the replay continues on doing its job.

Or maybe, please, could you pinpoint the code you say is faulty?

swh/journal/replay.py
388

Is that what you think is wrong?

Just send all origins from the origin_visit and not only the one that are converted correctly?

I'm not changing anything?
The replayer now skips the same bad origin_visit as before but instead of failing the process, the replay continues on doing its job.

They are not bad visits. They are valid visits in a wrong format. If we skip them, then a replayed storage would be missing these visits that the current storage has.
It is indeed bad that the replay crashes, but dropping data we already have is worse.

Or maybe, please, could you pinpoint the code you say is faulty?

It's not the code, it's the behavior you want

Those objects, I mean. Invalid revisions were refused by the pg storage after being pushed to Kafka.

ok, so they should not be there in the first place. Thus the backfill you want to run again.

The backfill operation only adds new objects, or reformats existing objects. It doesn't remove faulty objects that shouldn't have been here in the first place.

While the origin visits you are proposing to skip were accepted by the pg storage.

I'm not changing anything?

I can confirm that the logic seems almost identical; the only change is that, before this change, the replayer would crash if _fix_origin_visit returned None, instead of just carrying on with the valid origins.

Now, I think that @vlorentz's issue (and mine as well) is that this crash shouldn't have happened in the first place : the entries in the origin_visit topic with the old, type-less schema, should have been rewritten by a pass of the backfiller, a while ago.

How did the issue get noticed? Do you have a partition/offset reference of a crashy message?

(for instance, as we've changed the format of the keys of the origin_visit topic a few times, it's possible that objects which used the old key schema haven't been compacted away)

In D2839#68284, @olasd wrote:

The backfill operation only adds new objects, or reformats existing objects. It doesn't remove faulty objects that shouldn't have been here in the first place.

Indeed, that's what I meant by overwrite.

that the logic seems almost identical; the only change is that, before this change, the replayer would crash if _fix_origin_visit returned None, instead of just carrying on with the valid origins.

Now, I think that @vlorentz's issue (and mine as well) is that this crash shouldn't have happened in the first place : the entries in the origin_visit topic with the old, type-less schema, should have been rewritten by a pass of the backfiller, a while ago.

How did the issue get noticed? Do you have a partition/offset reference of a crashy message?

Good point. I don't remember seeing this issue with my replayers (but I don't monitor them very closely)

Clarify intent about stopping the replayer when an old origin_visit format is
detected.

ardumont retitled this revision from journal.replay: Log and skip old origin_visit format instead of failing to journal.replay: Align fix_origin_visit method with others.Mar 17 2020, 6:39 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

I like this much better!

Could you change the title of the diff and commit accordingly?

swh/journal/replay.py
275–290

You don't need pytest in doctests, see https://docs.python.org/3/library/doctest.html#what-about-exceptions

(And it makes sense not to use pytest, as doctests are also documentation examples not just tests)

This revision now requires changes to proceed.Mar 18 2020, 10:38 AM
swh/journal/replay.py
275–290

Ok, is there a doctest configuration file where i could put
https://docs.python.org/3/library/doctest.html#doctest.IGNORE_EXCEPTION_DETAIL

because i don't want to leak the traceback which could change over time (i see 3.7 in my current setup).
And i know test for the debian run both for 3.8 and 3.7 so that won't work...

276     >>> _fix_origin_visit({
UNEXPECTED EXCEPTION: ValueError("Old origin visit format detected: {'origin': {'url': 'http://foo'}, 'date': datetime.datetime(2020, 2, 27, 14, 39, 19, tzinfo=datetime.timezone.utc), 'status': 'ongoing', 'snapshot': None}")
Traceback (most recent call last):
  File "/usr/lib/python3.7/doctest.py", line 1329, in __run
    compileflags, 1), test.globs)
  File "<doctest swh.journal.replay._fix_origin_visit[5]>", line 5, in <module>
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-journal/.tox/py3/lib/python3.7/site-packages/swh/journal/replay.py", line 305, in _fix_origin_visit
    raise ValueError(f'Old origin visit format detected: {visit}')
ValueError: Old origin visit format detected: {'origin': {'url': 'http://foo'}, 'date': datetime.datetime(2020, 2, 27, 14, 39, 19, tzinfo=datetime.timezone.utc), 'status': 'ongoing', 'snapshot': None}

I'd like to only set the following as output:

ValueError: Old origin visit format detected: {'origin': {'url': 'http://foo'}, 'date': datetime.datetime(2020, 2, 27, 14, 39, 19, tzinfo=datetime.timezone.utc), 'status': 'ongoing', 'snapshot': None}

I like this much better!

awesome ;)

Could you change the title of the diff and commit accordingly?

Like what, i thought i did already even though i admit* it's not that good.

swh/journal/replay.py
275–290

ah but the documentation i target seems to explicit it already...

-o IGNORE_EXCEPTION_DETAIL

let's check ;)

Like what, i thought i did already even though i admin it's not that good.

it still says "Align fix origin_visit endpoint with others", while it should mention raising an error (earlier)

swh/journal/replay.py
275–290
swh/journal/replay.py
275–290

let's check

it did not work... or i misused it...

tox.ini (tryout):

commands =
   pytest --cov={envsitepackagesdir}/swh/journal \
         {envsitepackagesdir}/swh/journal \
         --cov-branch \
         --doctest-modules -o doctest.IGNORE_EXCEPTION_DETAIL=1 {posargs}

meh, i'll do like in your sample, thx

What do you mean? ...

I must be missing the Traceback... that the output starts with in your sample.
And i have an error about that...

Well, i finally made it work...

>>> _fix_origin_visit({
...     'origin': 'http://foo',
...     'date': date,
...     'status': 'ongoing',
...     'snapshot': None
... })
Traceback (most recent call last):
...
ValueError: Old origin visit format detected...

There is a lot of noise though...
I just wanted to mention the exception and that's it.

Anyway, thx a lot!

journal.replay: Make _fix_origin_visit raise earlier

Also:

  • unify with other _fix methods to work on one origin_visit (and not a list)
  • Add test cases around such scenario
  • Clarify intent about stopping the replayer when an old origin_visit format is

detected.

ardumont retitled this revision from journal.replay: Align fix_origin_visit method with others to journal.replay: Make _fix_origin_visit raise earlier.Mar 18 2020, 1:36 PM
ardumont edited the summary of this revision. (Show Details)
  • unify with other _fix methods to work on one origin_visit (and not a list)

what? it wasn't working on a list before this commit

swh/journal/tests/test_replay.py
358–364

For consistency, you should renumber test_write_replay_legacy_origin_visit2 and test_write_replay_legacy_origin_visit3, and add a new test_write_replay_legacy_origin_visit2 test for this case

This revision now requires changes to proceed.Mar 18 2020, 2:09 PM
swh/journal/tests/test_replay.py
358–364

or maybe renumber differently, depending on the chronology. I don't see where the format you're introducing fits in the timeline

what? it wasn't working on a list before this commit

hmm yeah, that's true!
i forgot i already pushed that and that it's another commit.

For the test order, no i won't change it.
In the end, nothing changed much here.
I just inlined a for loop to avoid one iteration (to build directly the origins at the same time as visits)
I added missing test scenario the code already did.
And i explicited what was implicit in the code.

I'll try to improve the commit message though.

ardumont edited the summary of this revision. (Show Details)

Fix commit message

ardumont retitled this revision from journal.replay: Make _fix_origin_visit raise earlier to journal.replay: Inline `_fix_origin_visit` for loop in insert_object.Mar 18 2020, 4:40 PM
ardumont edited the summary of this revision. (Show Details)
ardumont added a project: Journal.
This revision is now accepted and ready to land.Mar 18 2020, 7:44 PM