Page MenuHomeSoftware Heritage

winery: basic implementation of the backend
ClosedPublic

Authored by dachary on Dec 8 2021, 5:18 PM.

Details

Summary

Related to: T3533

Test Plan

Part of the tests require a live Ceph cluster to run otherwise they will be skipped. Follow the instructions in winery-test-environment/README.md to setup this cluster and run the tests. It should produce the following coverage:

.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/__init__.py                     1      0      0      0   100%
.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/database.py                    37      5      8      1    87%   63->72, 67-71
.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/objstorage.py                  92      4     22      2    95%   67, 82, 106, 112
.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/roshard.py                     78      5     10      1    93%   28, 37, 65-67
.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/rwshard.py                     58      1     10      1    97%   74
.tox/py3/lib/python3.9/site-packages/swh/objstorage/backends/winery/sharedbase.py                 100      2     16      0    98%   88-89

Diff Detail

Repository
rDOBJS Object storage
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just so we're clear: these ansible files are uniquely used for setting up the test environment (local or fed4fire). Are you suggesting that part of the test environment is moved in the snippets repository? Or did you maybe think that these ansible files were not used for testing purposes?

Do you mean they are invoked via pytest?

No, they are to be run right before pytest to provision the ceph cluster (either via libvirt or via fed4fire)

Hmm... alright, let's keep it in this repo. But move it to the root (ie. something like swh-objstorage/winery-benchmarks/ instead of swh-objstorage/swh/objstorage/tests/winery/), as the current location implies it is used from the Python code.

Sounds good?

Hmm... alright, let's keep it in this repo. But move it to the root (ie. something like swh-objstorage/winery-benchmarks/ instead of swh-objstorage/swh/objstorage/tests/winery/), as the current location implies it is used from the Python code.

Sounds good?

Absolutely, I'll do that :-)

  • move winery test environment at the root of the repository as suggested

Build is green

Patch application report for D6796 (id=24766)

Rebasing onto 8f992e8a20...

Current branch diff-target is up to date.
Changes applied before test
commit 546cf88ecf824bfe35b0bbeaec47d4b0751a1880
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: basic implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/95/ for more details.

  • avoid race condition when two processes try to create a table

Build is green

Patch application report for D6796 (id=24787)

Rebasing onto 8f992e8a20...

Current branch diff-target is up to date.
Changes applied before test
commit 52743ddc48096fdd39578544cc01f8d20cdff80f
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: basic implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/96/ for more details.

Build is green

Patch application report for D6796 (id=24859)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit 13dd82948a6d4093a9ab314cf0c8067b332d0eab
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/97/ for more details.

requirements.txt
9

Could you use subprocess instead? The API may be a little less nice, but it saves us an extra dependency.

swh/objstorage/backends/winery/roshard.py
33

I don't think this work when $LANG is not an English locale. You should force the locale when calling it.

Looks like solid ground work, thanks!

On the code, I've left a few inline comments. I would appreciate having a high-level documentation of the architecture of the code, understanding what class does what. For instance, having a high level idea of the lifetime of shards and what bits of the code handle them would be useful, even if for now I think reading the code is fine.

One thing that's not explicit at all, is that the WineryObjStorage class handles write operations on a single shard (and that "multi-shard parallelism" is extrinsic to that class), but it handles read operations on any shard. This could be an operational concern

Generally, we don't capitalize the module-level logger object (compare mr grep LOGGER vs. mr grep logger in swh-environment).

(I haven't gone through reading the tests code yet)


I've tried to run the integration tests (using a local libvirt), and I've had a couple of issues, which I've eventually worked around.

I think the shell scripts need some hardening (set -e and only dropping machines if they're defined), but the most important issue is being able to set the QEMU connection string. For some reason my "qemu:///session" is unable to create bridges, but I have privileges to create bridges in "qemu:///system"; I've had to set -c in virsh calls, and --connection in virt-create.

The scripts depend on being run in the winery-test-environment dir, but I'm not sure that's explicit. maybe just add a cd $(dirname $0) or somesuch.

I suggest installing mitogen and using ANSIBLE_STRATEGY=mitogen_linear to speed up the ansibling of hosts (it keeps persistent ssh connections and a single remote python interpreter to do its work, instead of spawning a script for each task). It seems to work just fine in my attempts.


(this is just a generic comment, I don't think anything in there is actionable at all)

And could you update the CI environment to run the full tests? Adding a line to https://forge.softwareheritage.org/source/swh-jenkins-dockerfiles/browse/master/base-buster/Dockerfile to install Ceph from the Debian repo should be enough.

I would if I could but docker is not going to cut it. The test environment relies on libvirt and requires around 32GB of RAM. In addition it is prone to environmental failures during initialization. It's rather trivial to cope with it when bootstraping it manually. But it would be a significant amount of work to make it reliable enough for the CI to be useful. Since 32GB of RAM and nested virtualization and environmental failures are difficult to solve, an alternative would be to run a kubernetes cluster (maybe with k3s) to reduce the memory footprint. It is possibly more stable although I don't have actual experience setting this up. Therefore it's also going to be a significant amount of work.

You could argue (rightfully) that this amount of work is worth investing because it will ensure there is no regression. However, whatever modification is done to the codebase that requires a Ceph cluster to verify also requires benchmarking to verify there is no performance regression.

I don't think I agree with that. Ideally we would want to pass most changes through regression integration tests (which would also contain "historical" data to ensure continued backwards compatibility), while benchmarking for performance regressions could "only" happen before releasing.

And such benchmarking would be *very* difficult to fit in the CI. Last but not least, the overhead of manually running the Ceph integration/performance tests is likely to be an order of magnitude less over the next five years than the overhead of making sure they can run from the CI without making so much noise (because of environmental failures) that it often requires manual intervention anyways.

We should assume that anything that must happen but is not automated will, eventually, not happen, in a way that's detrimental.

I expect that we will have a persistent "staging" ceph cluster against which we would be able to run these kinds of workloads (or be able to spawn a scratch cluster on kubernetes, or something).

mypy.ini
25–27

Is this really needed?

swh/objstorage/backends/winery/bench.py
1 ↗(On Diff #24859)

Could you consider s/bench/benchmark/ (at least in the module name)? I do wonder if it should move to the tests submodule too.

84 ↗(On Diff #24859)

This generates only contents that are neatly aligned on 1024 byte multiple boundaries. Maybe it would make sense to add some fuzz/randomness to these object sizes, to exercise some obviously-unaligned reads in the ceph cluster.

swh/objstorage/backends/winery/database.py
19

Should be a better pattern to properly override only the dbname out of the provided dsn.

It may be worth having a common method/context manager for opening an "admin connection" wrapping all of the common operations you're doing on all these methods (connecting, setting autocommit, getting a cursor, doing stuff, then closing the connection on __exit__).

74
swh/objstorage/backends/winery/objstorage.py
60–61

I've ended up writing the same comment on the current version of this code, so I think there's a gap in the logic somewhere.

Here's the scenario which worries me:

A loader sends object A to be written; it's picked up by the objstorage worker for shard1 which inserts (A, shard1, inflight=true) in signature2shard, and swiftly crashes because of a bug. The loader in turn crashes.

The write for object A is attempted again by another loader; the objstorage worker for shard2 picks it up, but, AFAICT, fails phase 1 (because of the unique index on signature2shard(signature)). What happens then to object A?

swh/objstorage/backends/winery/roshard.py
21–25

Doesn't look like init has any side effect, so all calls can probably go to the __init__ function

62–100

I would be (much) more comfortable if these functions were only in a Pool subclass defined and used by tests.

swh/objstorage/backends/winery/sharedbase.py
29

What is the meaning of this magic number?

31–45

Probably worth using triple-quoted strings for SQL queries generally, but that's purely stylistic.

32–37

Do we really need an id and a name? Either way the name column should probably be made unique.

I would suggest merging both into a single uuid-based column (and reconstructing the database names from the shard uuid on the fly).

In any case the char type has some weird padding behavior with spaces, which is probably not what we want (even though it probably doesn't matter in practice).

95

Seeing the way these booleans are handled, I think what @vlorentz had in mind was making the shard "state" a single non-null column containing a ternary enum (with values such as "writable", "packing", "readonly").

128–143

These return values are inconsistent, I assume we would want to make them all consistent (maybe using a dataclass for ShardInfo?)

winery-test-environment/build-vms.sh
1

Missing shebang, not everyone uses bash :P

100

On the code, I've left a few inline comments.

I've addressed them (hopefully) thank you.

I would appreciate having a high-level documentation of the architecture of the code, understanding what class does what. For instance, having a high level idea of the lifetime of shards and what bits of the code handle them would be useful, even if for now I think reading the code is fine.

I'll do that in the next few days. I intended to (documentation + types) but postponed that in case the review required a complete change in the architecture.

One thing that's not explicit at all, is that the WineryObjStorage class handles write operations on a single shard (and that "multi-shard parallelism" is extrinsic to that class), but it handles read operations on any shard. This could be an operational concern

The multi-shard parllelism for writes is part of the swh-winery (in swh-environment/docker) and should be considered as part of the winery. It currently is almost identical to swh-obstorage but may evolve independently.

The asymmetry between write and read shards reflects the fact that there will always be a a limited numbers of write shards, hence it won't be an issue to have one worker per write shard. And it is likely to make things simpler in the long run. Workers handling reads must however deal with an ever growing number of shards and that require them to be able to reach any number of them.

Generally, we don't capitalize the module-level logger object (compare mr grep LOGGER vs. mr grep logger in swh-environment).

Oh... I took the seaweedfs implementation as an inspiration but indeed... I'll revert that.

(I haven't gone through reading the tests code yet)


I've tried to run the integration tests (using a local libvirt), and I've had a couple of issues, which I've eventually worked around.

I think the shell scripts need some hardening (set -e

That's done, good idea.

and only dropping machines if they're defined),

I left the brute force because it proved reliable but ignored errors to avoid the disturbing noise.

but the most important issue is being able to set the QEMU connection string. For some reason my "qemu:///session" is unable to create bridges, but I have privileges to create bridges in "qemu:///system"; I've had to set -c in virsh calls, and --connection in virt-create.

I set the default to "qemu:///system" with a variable so that it can be easier to override. It is indeed a common problem.

The scripts depend on being run in the winery-test-environment dir, but I'm not sure that's explicit. maybe just add a cd $(dirname $0) or somesuch.

I did that too, good idea.

I suggest installing mitogen and using ANSIBLE_STRATEGY=mitogen_linear to speed up the ansibling of hosts (it keeps persistent ssh connections and a single remote python interpreter to do its work, instead of spawning a script for each task). It seems to work just fine in my attempts.

I'll leave that to the taste of the person running the script. I have mixed feeling about mitogen and fear that it could create more environmental problems that may end up costing more human time than it saves computing time.


(this is just a generic comment, I don't think anything in there is actionable at all)

And could you update the CI environment to run the full tests? Adding a line to https://forge.softwareheritage.org/source/swh-jenkins-dockerfiles/browse/master/base-buster/Dockerfile to install Ceph from the Debian repo should be enough.

I would if I could but docker is not going to cut it. The test environment relies on libvirt and requires around 32GB of RAM. In addition it is prone to environmental failures during initialization. It's rather trivial to cope with it when bootstraping it manually. But it would be a significant amount of work to make it reliable enough for the CI to be useful. Since 32GB of RAM and nested virtualization and environmental failures are difficult to solve, an alternative would be to run a kubernetes cluster (maybe with k3s) to reduce the memory footprint. It is possibly more stable although I don't have actual experience setting this up. Therefore it's also going to be a significant amount of work.

You could argue (rightfully) that this amount of work is worth investing because it will ensure there is no regression. However, whatever modification is done to the codebase that requires a Ceph cluster to verify also requires benchmarking to verify there is no performance regression.

I don't think I agree with that. Ideally we would want to pass most changes through regression integration tests (which would also contain "historical" data to ensure continued backwards compatibility), while benchmarking for performance regressions could "only" happen before releasing.

And such benchmarking would be *very* difficult to fit in the CI. Last but not least, the overhead of manually running the Ceph integration/performance tests is likely to be an order of magnitude less over the next five years than the overhead of making sure they can run from the CI without making so much noise (because of environmental failures) that it often requires manual intervention anyways.

We should assume that anything that must happen but is not automated will, eventually, not happen, in a way that's detrimental.

I expect that we will have a persistent "staging" ceph cluster against which we would be able to run these kinds of workloads (or be able to spawn a scratch cluster on kubernetes, or something).

I think we can agree on one thing: there is a need and whatever the solution it is not trivial to implement, unfortunately.

requirements.txt
9

The upside of using sh vs subprocess is also that it is significantly less code to do the same. But if you think this advantage is not worth the extra dependency, I'll be happy to do the extra work.

swh/objstorage/backends/winery/bench.py
1 ↗(On Diff #24859)

Agreed, it was moved and renamed into winery_benchmark.py

winery-test-environment/build-vms.sh
1

Good catch, added.

100

Added.

mypy.ini
25–27

No. My bad, removed.

swh/objstorage/backends/winery/bench.py
84 ↗(On Diff #24859)

Good point. I added +1 to all of them which makes them all non-aligned. Neatly aligned objects are the exception anyways.

swh/objstorage/backends/winery/database.py
19

Done in a context manager, more DRY is good. Also changed the connect as suggested.

74

Good catch, fixed.

swh/objstorage/backends/winery/objstorage.py
60–61

The idea is that SharedBase.add_phase_1 either

  • creates the row or,
  • if there is a unique violation, returns the shard id for the inflight object or,
  • if the object is no longer in flight returns none

Which allows it to pick up where the crashed worker left.

However reading the code again I saw there was a bug in how the unique violation was handled and fixed it.

That being said, the tests do not still cover border cases such as this one and they are what I intend to work on next. They are tricky to get right and I wanted to wait for the first wave over review before getting to them. I think I'm ready now :-)

swh/objstorage/backends/winery/roshard.py
21–25

Done.

62–100

Moved to the PoolHelper class and to the test directory. Dangerous indeed...

swh/objstorage/backends/winery/sharedbase.py
29

There is no meaning, it just has to be unique for the lock to work. I added a comment to clarify that.

31–45

Done!

32–37

Do we really need an id and a name?

The id matches the shard field in the signature2shard table below to save space. I'll make the name field unique. There will be a small number of rows in this table, is your recommendation for merging the fields about simplicity or saving space?

95

I understood what @vlorentz had in mind but thought having two fields made the whole thing a little clearer.

128–143

This function was exclusively used for testing and indeed confusing because inconsistent. I moved it to the test directory for clarity.

  • address olasd & vlorentz comments

Build has FAILED

Patch application report for D6796 (id=24888)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit 84e0f76596bacae7404b83dacb3fa254cc981bf4
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

Link to build: https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/98/
See console output for more information: https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/98/console

+[mypy-swh.perfecthash.*]
+ignore_missing_imports = True

is actually necessary for jenkins to be happy. Not sure why it passes on my machine, that needs to be investigated.

00:50:33  mypy run-test-pre: PYTHONHASHSEED='1622736210'
00:50:33  mypy run-test: commands[0] | mypy swh
00:50:39  swh/objstorage/backends/winery/roshard.py:10: error: Skipping analyzing "swh.perfecthash": module is installed, but missing library stubs or py.typed marker
00:50:39  swh/objstorage/backends/winery/roshard.py:10: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
00:50:47  Found 1 error in 1 file (checked 54 source files)

Build is green

Patch application report for D6796 (id=24889)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit d2cc4f3b384f001610e91fcbac9a84a75defc900
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/99/ for more details.

Build is green

Patch application report for D6796 (id=24890)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit 63c2945249a8454f587e4b27ca2503399dfaad4c
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/100/ for more details.

  • Reworked the Winery class hierarchy to separate reader and writer

Build is green

Patch application report for D6796 (id=24893)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit 9331629aa8ed2039d8ec973621bbab0d4aba0d46
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/101/ for more details.

  • add implementation notes to the documentation

Build is green

Patch application report for D6796 (id=24905)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit d4cf25677a8d8ed02b3ba1a3014150b0de40b033
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/102/ for more details.

  • Update misleading comment in the check function

Build is green

Patch application report for D6796 (id=24906)

Rebasing onto 4238095a6c...

First, rewinding head to replay your work on top of it...
Applying: winery: first implementation
Changes applied before test
commit da80b03b8155340a4580d15eac428bbff412ac2a
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/103/ for more details.

(accepting the diff as myself so I don't block it while on vacation, as you addressed my own comments)

swh/objstorage/backends/winery/sharedbase.py
95

I'm not a huge fan of having two booleans to encode a 3-states state machine btw. But it's acceptable as long as you add a CHECK as I mentioned in another comment.

Alright, let's it so you aren't stuck for another week

This revision is now accepted and ready to land.Dec 28 2021, 1:39 AM
docs/winery.rst
11

because this isn't markdown (ditto below)

Hum, it looks like I forgot to submit two replies a week ago or so, doing that now.

@vlorentz thanks for the additional review and the kind thought :-) Although I am blocked, I have other things to keep me busy and I fully expected there would be delays this time of year.

docs/winery.rst
8

@olasd the high level description that you suggested to add is here.

swh/objstorage/backends/winery/objstorage.py
19

@olasd the WineryObjStorage is split into four classes for clarity.

Build is green

Patch application report for D6796 (id=25008)

Rebasing onto 4238095a6c...

Current branch diff-target is up to date.
Changes applied before test
commit 9a88cafbc72cea6accbf32194b7b2ba2cd58fb31
Author: Loïc Dachary <loic@dachary.org>
Date:   Fri Dec 10 17:58:49 2021 +0100

    winery: first implementation
    
    Relates to: T3533

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/104/ for more details.

This revision was automatically updated to reflect the committed changes.