Page MenuHomeSoftware Heritage

Add a new TenaciousProxyStorage
ClosedPublic

Authored by douardda on Jun 22 2020, 5:10 PM.

Details

Summary

This proxy storage attempt to add buckets of objects, but in case of failure,
it splits the bucket in parts so every valid object in the bucket get a chance
to be inserted.

Also provides an error rate-limiting feature.

:Warning: the behavior of this proxy when the rate limit threshold has been
reach is not ok and needs to be discussed and adapted before landing.

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D3334 (id=11815)

Could not rebase; Attempt merge onto 2d497ff07c...

Auto-merging swh/storage/tests/algos/test_origin.py
Merge made by the 'recursive' strategy.
 mypy.ini                               |   3 +
 requirements.txt                       |   1 +
 swh/storage/__init__.py                |   3 +
 swh/storage/cassandra/storage.py       |  17 +-
 swh/storage/in_memory.py               |  15 +-
 swh/storage/interface.py               |   7 +-
 swh/storage/storage.py                 |  38 +++--
 swh/storage/tenacious.py               | 124 ++++++++++++++
 swh/storage/tests/algos/test_origin.py |   7 +-
 swh/storage/tests/test_storage.py      |  83 ++++++----
 swh/storage/tests/test_tenacious.py    | 293 +++++++++++++++++++++++++++++++++
 swh/storage/validate.py                |   7 +-
 12 files changed, 531 insertions(+), 67 deletions(-)
 create mode 100644 swh/storage/tenacious.py
 create mode 100644 swh/storage/tests/test_tenacious.py
Changes applied before test
commit 438ccc7c3b4a37efe5afc39d1f695382d0af6b95
Merge: 2d497ff0 3f7075e8
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jun 22 15:19:41 2020 +0000

    Merge branch 'diff-target' into HEAD

commit 3f7075e8a07e1d54b951b76b3575141f5a1c3bb7
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.

commit 4f156595f7894a6e346fed1e5f5f3262ea3ffbab
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Jun 22 11:27:54 2020 +0200

    Deprecate the origin_add_one() endpoint
    
    This endpoint is not really useful since the origin_add() can be used
    instead.
    
    Using a single API endpoint would also make the API a bit more consistant
    (most other endpoints only provide a xxx_add endpoint) ; having a single
    endpoint per object_type make is enough and make the whole API simpler.

commit f6aef1c74cb30d9e890ee538baf1010157542967
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:28:51 2020 +0200

    Make Storage.add_origin() return a sumary dict
    
    make it consistent with other add_xxx methods by making it return a
    summary dict `{"origin:add": int}`.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/333/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/333/console

performance nitpick: you're using to_add as a stack, but with the top at the beginning, so every operation copies the whole list. You should flip that.

Stopping after 10 failures over 1000 insertions seems a bit harsh. IMO it should only stop if there is something obviously wrong with the storage, so more something of the order of 10 failures over 50 or 100 insertions

swh/storage/tenacious.py
44

If

57–60

no more args:

62–63

could you document these two numbers? And I think total would be better named window_size or something of the sort.

81

_tenacious_add

93–96

It should raise an exception, else the client will think it went fine and will proceed to adding objects referencing it, so it will create holes in the graph.

104–106

I think it should be an error (so it appears in sentry), and contain the exception's name and args.

108

nice.

121–124

you're reimplementing collections.Counter. Instead, you can make results a Counter and call results.update()

Build has FAILED

Patch application report for D3334 (id=11842)

Could not rebase; Attempt merge onto 2d497ff07c...

Updating 2d497ff0..73f9b659
Fast-forward
 mypy.ini                               |   3 +
 requirements.txt                       |   1 +
 swh/storage/__init__.py                |   3 +
 swh/storage/cassandra/storage.py       |  19 ++-
 swh/storage/in_memory.py               |  15 +-
 swh/storage/interface.py               |   7 +-
 swh/storage/storage.py                 |  38 +++--
 swh/storage/tenacious.py               | 124 ++++++++++++++
 swh/storage/tests/algos/test_origin.py |   7 +-
 swh/storage/tests/test_storage.py      |  83 ++++++----
 swh/storage/tests/test_tenacious.py    | 293 +++++++++++++++++++++++++++++++++
 swh/storage/validate.py                |   7 +-
 12 files changed, 532 insertions(+), 68 deletions(-)
 create mode 100644 swh/storage/tenacious.py
 create mode 100644 swh/storage/tests/test_tenacious.py
Changes applied before test
commit 73f9b6595362f848d002c2f6c4db45c7b5cd15e6
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.

commit 621fc8d3779535656e15499464c44acecd7eb8bf
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Jun 22 11:27:54 2020 +0200

    Deprecate the origin_add_one() endpoint
    
    This endpoint is not really useful since the origin_add() can be used
    instead.
    
    Using a single API endpoint would also make the API a bit more consistant
    (most other endpoints only provide a xxx_add endpoint) ; having a single
    endpoint per object_type make is enough and make the whole API simpler.

commit fb603e1b9874a3bbb393f7653ec5004bfa18b251
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:28:51 2020 +0200

    Make Storage.add_origin() return a sumary dict
    
    make it consistent with other add_xxx methods by making it return a
    summary dict `{"origin:add": int}`.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/339/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/339/console

This revision now requires changes to proceed.Sep 10 2020, 9:17 AM

@douardda ping?

still note sure if it's a good idea or not...

Rebased and updated to current HEAD

revived because it may be useful for the mirror graph replayer.

Build was aborted

Patch application report for D3334 (id=20007)

Rebasing onto 2c477ec442...

Current branch diff-target is up to date.
Changes applied before test
commit 59f78ce383ede20b742bf81eb6d96017f8490999
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1271/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1271/console

rebase and improve logging messages

Build is green

Patch application report for D3334 (id=20242)

Rebasing onto 051b771523...

Current branch diff-target is up to date.
Changes applied before test
commit d528cf95e67551603c58c865f301abd681dd92d3
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1299/ for more details.

swh/storage/tenacious.py
104–106

no because the problematic object insertion will be attempted again (but in a smaller batch), and in most of the cases, this error is a transient error (typically a concurrency related one), so the object will most probably be properly inserted right after this situation occurs. When we really fail at inserting the object (i.e. when the batch size has fallen to 1), then we log the exception.

121–124

good thinking, thx

Answer (most) vlorentz comments/remark

  • use 'window_size' instead of 'total'
  • raise a RuntimeError in case of error rate limit reached
  • use a Counter object to gather statistics
  • rename tenacious_add() as _tenacious_add()
  • fix the docstring of the class

Build is green

Patch application report for D3334 (id=20254)

Rebasing onto 051b771523...

Current branch diff-target is up to date.
Changes applied before test
commit cd455ab8478a137a9f045955dac2d4e8e0ad92c2
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1300/ for more details.

Thanks. One last thing though

swh/storage/tenacious.py
104–106

ok

144–147

it's dead code now

This revision is now accepted and ready to land.May 4 2021, 4:59 PM
swh/storage/tenacious.py
144–147

jeeez, thx!

remove dead code and improve a bit the commit message

Build is green

Patch application report for D3334 (id=20284)

Rebasing onto 051b771523...

Current branch diff-target is up to date.
Changes applied before test
commit ffb38f71d9b634cea6c6bc7d7d3baf5076919396
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jun 18 18:41:24 2020 +0200

    Add a new TenaciousProxyStorage
    
    This proxy storage attempt to add buckets of objects, but in case of failure,
    it splits the bucket in parts so every valid object in the bucket get a chance
    to be inserted.
    
    Also provides an error rate-limiting feature.
    
    This proxy storage is mainly dedicated to help mirrorring an archive
    using the replayer stack.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1305/ for more details.

This revision was automatically updated to reflect the committed changes.