Page MenuHomeSoftware Heritage

storage.retry: Implement remaining `_add_endpoint` with retry policy
ClosedPublic

Authored by ardumont on Jan 11 2020, 12:37 PM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Commits
rDSTOC8dcac2b0c3a7: retry: Implement content_add_metadata endpoint with retry policy
rDSTOCaa588c995442: retry: Migrate to tenacity
rDSTOCdf3f33fb6c9b: storage.retry: Implement release_add with retry policy
rDSTOC2b7d770ac9cc: storage.retry: Implement snapshot_add with retry policy
rDSTOC4aa4d79dc8ef: test_retry: Improve and align consistently assertion checks
rDSTOC54890f722c7a: storage.retry: Implement revision_add with retry policy
rDSTOC2dd578c7ab9e: storage.retry: Implement origin_visit_update with retry policy
rDSTOCdddb6d91e704: in_memory: Make directory_get_random return None when storage empty
rDSTOCa8efa9568d8d: storage.retry: Implement directory_add with retry policy
rDSTOC32c460cb5c78: storage.retry: Implement origin_metadata_add endpoint with retry policy
rDSTOC08f2f384b2fd: storage.retry: Implement tool_add endpoint with retry policy
rDSTOCfe6440ee825f: storage.retry: Implement origin_visit_add endpoint with retry policy
rDSTOC3cf7adb89c6a: storage.retry: Implement metadata_provider_add endpoint with retry policy
rDSTOC351b9777c065: storage.retry: Implement origin_add_one endpoint with retry policy
rDSTO8dcac2b0c3a7: retry: Implement content_add_metadata endpoint with retry policy
rDSTO2b7d770ac9cc: storage.retry: Implement snapshot_add with retry policy
rDSTO4aa4d79dc8ef: test_retry: Improve and align consistently assertion checks
rDSTOaa588c995442: retry: Migrate to tenacity
rDSTO54890f722c7a: storage.retry: Implement revision_add with retry policy
rDSTOdf3f33fb6c9b: storage.retry: Implement release_add with retry policy
rDSTOa8efa9568d8d: storage.retry: Implement directory_add with retry policy
rDSTOdddb6d91e704: in_memory: Make directory_get_random return None when storage empty
rDSTO2dd578c7ab9e: storage.retry: Implement origin_visit_update with retry policy
rDSTO32c460cb5c78: storage.retry: Implement origin_metadata_add endpoint with retry policy
rDSTO3cf7adb89c6a: storage.retry: Implement metadata_provider_add endpoint with retry policy
rDSTO08f2f384b2fd: storage.retry: Implement tool_add endpoint with retry policy
rDSTOfe6440ee825f: storage.retry: Implement origin_visit_add endpoint with retry policy
rDSTO351b9777c065: storage.retry: Implement origin_add_one endpoint with retry policy
Summary

This will allow to align the loader package's implementations with the
loader-core's regarding retry policy (no retry policy on the package ones).

Also, as a next step, this will allow again to simplify the loader-core's
implementation.

This should help in fixing or at least decreasing the frequency of [1] (which happens for all package loaders, not only cran's).

endpoints:

  • origin_visit_add
  • origin_metadata_add
  • origin_visit_update
  • directory_add
  • revision_add
  • release_add
  • snapshot_add
  • tool_add
  • metadata_provider_add

Note:
The diff is a tad big because of tests.

[1] https://sentry.softwareheritage.org/share/issue/d4f1208b7eec4b43b11e38494ff039cc/

Related to D2511 (which only added content_add)

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10074
Build 14919: tox-on-jenkinsJenkins
Build 14918: arc lint + arc unit

Event Timeline

from irc discussion [1], it appears, this is not needed.
With the current state of affairs, only the content_add endpoint requires this.
So closing the task

[1]

10:56 <+olasd> are we sure we really want to retry adding all object types? the only object type where we know there are transient failures that are worth retrying are contents. I believe a failure to add anything else is a symptom of a deeper issue, that won't be fixed by retrying three times
11:03 <+pinkieval> I said that too, but ardumont had a compelling answer which I forgot
11:04 <+pinkieval> ah yes, it's that we're already doing the retries in loader-core
11:04 <+pinkieval> (and I assumed there was a reason we were doing this in loader-core in the first place)
11:08 <douardda> olasd: thinking about this, not sure why it could not be the case for other objects than content?
11:09 <douardda> I mean if 2 workers load 2 forks of the same git repo they may very well (attempt to) create the exact same objects
11:10 <douardda> for any entity types other than snapshots and so
11:10 <douardda> even if the collision window for these smaller objects is narrower
11:11 <douardda> how sure are we of this statement "the only object type where we know there are transient failures that are worth retrying are contents"?
11:32 <+olasd> douardda: the addition of the other kinds of objects is properly idempotent as they have a single primary key; contents are special and have multiple unique columns which makes idempotency harder
11:33 <+olasd> postgres knows how to serialize the addition of all other kinds of objects between transactions (because of the single unique primary key), not contents
11:33 <+ardumont> good, that means, i can close 2518 then
11:39 <+olasd> douardda: that's what vlorentz tried to address with D2248 (creating a read dependency to force the serialization of the transactions) but I'm not quite sure of the full implications of it
11:39 -- Notice(swhbot): D2248 (author: vlorentz, Needs Review) on swh-storage: [WIP] In case of race condition in content_add, raise SerializationFailure instead of HashCollision. <https://forge.softwareheritage.org/D2248>

Further discussion revealed we need this in the end:

12:00 <+ardumont> > double check that the retry behavior: well, i was going for dropping code, then staging + sentry initially but docker-dev should be enough at first
12:09 <+olasd> ardumont: I think the current retry behavior logs something, doesn't it?
12:10 <+ardumont> ah yes, it logs a warning
12:10 <+olasd> could be worth a poke at kibana
12:10 <+ardumont> i looked-up only retry which is too generic
12:10 <+ardumont> (in kibana)
12:10 <+ardumont> Retry loading a batch
12:10 <+ardumont> is better indeed
12:10 <+olasd> aha
12:18 <+ardumont> it's used, i see integrity errors on directory_add and revision_add (for the person adding step for example)
12:20 <+ardumont> (trying to paste a link is harder than it feels apparently)
12:21 <+ardumont> http://kibana0.internal.softwareheritage.org:5601/goto/8449adafcc24d8b89c3e71a0fa8c9c67
12:25 <+ardumont> so makes 2518 needed after all, at least on revision and directory
12:26 <+ardumont> i guess for the release_add endpoints as well as person can be created through the author field as well
13:07 <+olasd> blerg, ok; I guess we can generalize the retries then
13:08 <+olasd> but we should probably add a random backoff, because right now it looks like they're just failing three times in a row
13:08 <+ardumont> right, i'll check retrying's doc and adapt accordingly
13:08 <+olasd> thanks for checking
13:10 <+olasd> before adding the backoff, we might want to consider tenacity as a retrying replacement, which is more expressive in the way it does retries (and has the good taste of being maintained upstream)
13:11 <+ardumont> ack
13:12 <+ardumont> looking up tenacity
13:13 <+olasd> something like `@retry(stop=stop_after_attempt(5), wait=wait_fixed(2) + wait_random(0, 3))` (in tenacity)
13:13 <+olasd> or wait_random_exponential(multiplier=1, max=20)

Rebase on latest master and:

  • retry: Migrate to tenacity
  • retry: Implement content_add_metadata endpoint with retry policy
swh/storage/retry.py
38

Could be written as reduce...

r = reduce(
    lambda x, y: x | retry_if_exception_type(y),
    RETRY_EXCEPTIONS[1:],
    retry_if_exception_type(RETRY_EXCEPTIONS[0])
)
retry = r(error)

or even:

for exc in RETRY_EXCEPTIONS:
    if retry_if_exception_type(exc)(error):
        error_name = error.__module__ + '.' + error.__class__.__name__
        logger.warning('Retry adding a batch', exc_info=False, extra={
            'swh_type': 'storage_retry',
            'swh_exception_type': error_name,
            'swh_exception': traceback.format_exc(),
        })
        return True
return False

Simplify should_retry_adding function

Fix typo (auto-indent played game with me and i missed it ;)

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/storage/in_memory.py
371–372

No, self._contents items are supposed to always have a ctime.

swh/storage/retry.py
27–28

I don't think it needs such a specific comment. Just "when the server is restarting" is enough.

65–123

To much duplication.

my_retry = retry(retry=should_retry_adding,
           wait=wait_random_exponential(multiplier=1, max=10),
           stop=stop_after_attempt(3))

class RetryingProxyStorage:
    # ...

    @my_retry
    def content_add(self, content: List[Dict]) -> Dict:
        # ...

    @my_retry
    def content_add_metadata(self, content: List[Dict]) -> Dict:
        # ...
swh/storage/tests/conftest.py
219

Wrong format. data.cont3 has a data field and is missing a ctime.

This revision now requires changes to proceed.Jan 14 2020, 1:20 PM
swh/storage/in_memory.py
371

This is for content_add_metadata tests.
I do not understand:

  • why we remove this in the first place though.
  • or why there is content without ctime if it's mandatory...
swh/storage/in_memory.py
371

why we remove this in the first place though.

because that's what the postgresql storage did. But it's not a very good reason.

or why there is content without ctime if it's mandatory...

because the in-mem storage doesn't check it's provided when using content_metadata_add

swh/storage/in_memory.py
371

because that's what the postgresql storage did. But it's not a very good reason.

i meant the initial reason...

because the in-mem storage doesn't check it's provided when using content_metadata_add

it should be fixed then (in another diff).
i checked back (again) the pg schema and there is a not null constraint there.

so there discrepancy which is annoying T.T

swh/storage/retry.py
27–28

ack

65–123

indeed i was gonna address it at some point.
should have prior to update the diff maybe.

Thanks, working on the adaptations mentioned ;)

  • retry: Remove duplication in decorator
  • retry: Simplify exception comment
  • test_retry: Fix test content data
ardumont added inline comments.
swh/storage/tests/conftest.py
219

it does now, thx for the heads up.

This revision is now accepted and ready to land.Jan 14 2020, 2:24 PM
ardumont marked an inline comment as done.

Rebase and plug to master

This revision was automatically updated to reflect the committed changes.