Page MenuHomeSoftware Heritage

journal.replay: Batch insert contents/skipped_contents in storage backend
ClosedPublic

Authored by ardumont on Mar 6 2020, 10:48 AM.

Diff Detail

Repository
rDJNL Journal infrastructure
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10949
Build 16480: tox-on-jenkinsJenkins
Build 16479: arc lint + arc unit

Event Timeline

swh/journal/replay.py
249

Do we add a fallback behavior to ensure other objects from the failing transaction are written nonetheless?

Sounds sensible to do so (in the mirror context).

Maybe we should evolve the hash collision exception to refeerence the content in error.
That'd make it easier to fix client side the behavior if we had the culprit immediately in sight.

swh/journal/replay.py
249

Maybe we should evolve the hash collision exception to refeerence the content in error.

Right now we have HashCollision 'type' of hash collision as message (e.g. sha1)... [1]
That's not helpful enough.

[1] https://sentry.softwareheritage.org/share/issue/5a6f8c09fea1468997bb88ca3e19fc2d/

swh/journal/replay.py
235–236

are these vars necessary?

Why not simply use (with proper exception handling stuff):

storage.skipped_content_add(
              (obj for obj in objects if obj.get('status') == 'absent'))
storage.content_add_metadata(
              (obj for obj in objects if obj.get('status') != 'absent'))

ok (besides my remark).

This revision is now accepted and ready to land.Mar 6 2020, 11:54 AM
swh/journal/replay.py
235–236

Depends if we want to act upon my question below or not i guess ;)

swh/journal/replay.py
235–236

Well one thing at a time. Let's deal with this as is, then discuss how to improve cases of ingestion error...

As a first step, no need for further check so no need for temporary variables