Page MenuHomeSoftware Heritage

retry: Make content_add endpoints maximize content writes to storage
AbandonedPublic

Authored by ardumont on Apr 10 2020, 3:50 PM.

Details

Summary

In effect, allows to write all but the colliding contents to the storage. This
is the same behavior currently existing in the journal replayer (D2800). This
shares the behavior within the retry proxy as this is a common need, for
example in loaders as well (D2973).

And the colliding hashes are stored in the log in a formatted way so we do not
lose that information.

Related to D2800#66985
Related to D2973#72387

Note:
It could go in an entirely different proxy. That would maximize composition
through configuration and not entangle the retry and that new behavior (which
is somehow a retry, thus why it's proposed here in the first place ;)

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
collision-aware-content-add
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11837
Build 17955: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 17954: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D3012 (id=10681)

Rebasing onto ddac3d27e3...

Current branch diff-target is up to date.
Changes applied before test
commit 15c1caa21862d5f7973a37599e338c4e7186a40f
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Fri Apr 10 12:03:04 2020 +0200

    retry: Make content_add endpoints maximize content writes
    
    In effect, allow to write all but the colliding contents to the storage. This
    is the same behavior currently existing in the journal replayer. This shares
    the behavior within the retry proxy as this is a common need, for example in
    loaders.
    
    And the colliding hashes are stored in the log in a formatted way so we do not
    lose that information.
    
    Related to D2800
    Related to D2973

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/89/ for more details.

vlorentz added a subscriber: vlorentz.

Nice. I always felt this was needed, but never actually did it.

You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional, and did you discuss it with the others?

swh/storage/retry.py
95

Use a collections.Counter, so you can just do: global_summary += summary

96–114

Should add a check to not loop indefinitely if there's a bug in the storage

102–108

What about a namedtuple or a dataclass instead of a dict?

117

This is more idiomatic:

logger.error("Collision detected: %(collision)s", collision=collision)
152–164

Just make collision_aware_content_add take an iterable and convert to list there

swh/storage/tests/test_retry.py
76

Could you duplicate this test, to check what happens if you insert [cont, cont2, cont3, cont2]?

123

You should check the length of caplog.records, it will give a better error

151–152

nit: 1st & 2nd tries go ko

229–230

same

This revision now requires changes to proceed.Apr 14 2020, 11:32 AM

Build is green

Patch application report for D3012 (id=10709)

Rebasing onto e5e594358b...

Current branch diff-target is up to date.
Changes applied before test
commit cdd8fac799a29fe5370063a47fcb6ae8e9558eb3
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Fri Apr 10 12:03:04 2020 +0200

    retry: Make content_add endpoints maximize content writes
    
    In effect, allow to write all but the colliding contents to the storage. This
    is the same behavior currently existing in the journal replayer. This shares
    the behavior within the retry proxy as this is a common need, for example in
    loaders.
    
    And the colliding hashes are stored in the log in a formatted way so we do not
    lose that information.
    
    Related to D2800
    Related to D2973

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/94/ for more details.

You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up.
Is this intentional,

Well, yes and no.

The exception is written through logs now so it's not lost (thus why it's tested with caplog).

I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).

and did you discuss it with the others?

Yes, the proposition appeared in the replayer D2800.
Now the rest of the discussion can happen here :D

swh/storage/retry.py
95

neat ;)

102–108

What's the pros for this change?

(it's local to the function and it's clear as it is, i'm kinda lazy right now ;)

117
        logger.error("Houston, we have a %s", "major problem", exc_info=1)
        """
        if self.isEnabledFor(ERROR):
>           self._log(ERROR, msg, args, **kwargs)
E           TypeError: _log() got an unexpected keyword argument 'collision'

What is fine is:

logger.error("Collision detected: %(collision)s", dict(collision=collision))
swh/storage/tests/test_retry.py
76

sure

123

Except i've no idea how long the length should be.
I'm only interested in the collision one.

  • Rebase on latest master
  • Adapt according to review

TODO:

  • Add another test on hash collision scenario

Build has FAILED

Patch application report for D3012 (id=10716)

Rebasing onto 2cc263da97...

Current branch diff-target is up to date.
Changes applied before test
commit bb2799a312d3583ba4e3f45188b570db6e327942
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Fri Apr 10 12:03:04 2020 +0200

    retry: Make content_add endpoints maximize content writes
    
    In effect, allow to write all but the colliding contents to the storage. This
    is the same behavior currently existing in the journal replayer. This shares
    the behavior within the retry proxy as this is a common need, for example in
    loaders.
    
    And the colliding hashes are stored in the log in a formatted way so we do not
    lose that information.
    
    Related to D2800
    Related to D2973

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/console

Build is green

Patch application report for D3012 (id=10717)

Rebasing onto 2cc263da97...

Current branch diff-target is up to date.
Changes applied before test
commit 039de11c1fdbda64e58424fa9214dd478e683ae2
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Fri Apr 10 12:03:04 2020 +0200

    retry: Make content_add endpoints maximize content writes
    
    In effect, allow to write all but the colliding contents to the storage. This
    is the same behavior currently existing in the journal replayer. This shares
    the behavior within the retry proxy as this is a common need, for example in
    loaders.
    
    And the colliding hashes are stored in the log in a formatted way so we do not
    lose that information.
    
    Related to D2800
    Related to D2973

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/97/ for more details.

Build is green

Patch application report for D3012 (id=10718)

Rebasing onto 2cc263da97...

Current branch diff-target is up to date.
Changes applied before test
commit 4cee13fc8d08afbe250eb20a2c99fd0f2ecbbabd
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Fri Apr 10 12:03:04 2020 +0200

    retry: Make content_add endpoints maximize content writes
    
    In effect, allow to write all but the colliding contents to the storage. This
    is the same behavior currently existing in the journal replayer. This shares
    the behavior within the retry proxy as this is a common need, for example in
    loaders.
    
    And the colliding hashes are stored in the log in a formatted way so we do not
    lose that information.
    
    Related to D2800
    Related to D2973

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/98/ for more details.

olasd requested changes to this revision.EditedApr 14 2020, 12:59 PM
olasd added a subscriber: olasd.

You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up.
Is this intentional,

Well, yes and no.

The exception is written through logs now so it's not lost (thus why it's tested with caplog).

I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).

and did you discuss it with the others?

Yes, the proposition appeared in the replayer D2800.
Now the rest of the discussion can happen here :D

This diff is trying to conflate two different cases, with two different tradeoffs, where we need to handle colliding contents. The choice it makes is one of these tradeoffs, but it introduces a critical issue in the other.

(c/p from the argument I wrote in D2973)

Replayers need to handle colliding contents in the journal, which we know happens in two situations:

  • before the compaction, when two identical contents have been added with a different ctime (which is a bug that's now been fixed and should happen less often)
  • after compaction, when there's actually two colliding contents in the journal (that have been rejected when being added by postgres)

In both of these cases, we want the replayer to _keep on working_, to get as close a copy as possible of the main archive in the mirror.

Loaders need to write consistent data to the storage, from the snapshot object down. Handling colliding contents in loaders is only incidental, because we're starting to have loaders which write an enormous amount of data, and where the probability of loading colliding contents is 1. For these loaders, we want to "maximize" the size of the partial snapshots loaded. But handling these colliding contents must not in any way result in inconsistent, hard-to-recover-from data being loaded into the archive.

The choice made in this diff is to make the "log hash collisions, keep on going without error" behavior of the journal replayer the default. If loaders start using this, which they will as they all have the retry proxy in their storage pipeline, this will break the integrity of the archive, and lose data forever!

At this point, I'm not even sure that I agree that this behavior needs to be lifted in the storage (considering that the journal replayer will be integrated in swh.storage soon). If we decide to do so, it absolutely needs to be opt-in (a new add_content_metadata_ignoring_hash_collisions endpoint, with a big fat warning in the documentation). The default behavior of content_add/content_add_metadata needs to stay a hard failure, and specific users can opt-in to the lossy behavior if they've made this choice explicitly.

This revision now requires changes to proceed.Apr 14 2020, 12:59 PM