Page MenuHomeSoftware Heritage

Add winery backend
Open, NormalPublic

Description

Assuming there exist an object storage as described in T3054, there needs to be an object storage backend to store / retrieve objects. The name is winery (using the "winery namespace" agreed on on the mailing list last week). The implementation is a replacement of the object storage at the bottom of the global architecture.

The backend runs in a *sgi app. At boot time it:

  • Gets an exclusive lock on a Shard in the Write Storage (creating a new one or resuming an existing one)
  • Gets a connection to the Write Storage and the global index
  • Registers for throttling purposes (probably a table in the database)
  • Serves both reads and writes from and to the Read Storage and the Write Storage

When the Shard it is responsible for in the Write Storage is full:

  • It stops accepting requests
  • Writes the Shard to the Read Storage
  • Shutdown

Spawning the *sgi app or controlling the number of workers is not in the scope of this implementation.

Event Timeline

dachary changed the task status from Open to Work in Progress.Jul 19 2021, 12:03 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.
dachary updated the task description. (Show Details)

I misrepresented @olasd suggestions, here is the chat log on the matter.

(13:14:12) olasd: I didn't really think of the api dispatcher as an actual component that needs writing. We deploy our API servers using gunicorn in a prefork model. I would implement the write storage as an (a/w)sgi app that owns the connection to its currently active shard and gets shut down and replaced by a new one when the shard is full; reads over the index and other shards can happen directly within this
(13:14:14) olasd: (a/w)sgi app?
(13:14:53) olasd: the number of active write shards gets controlled by the number of active (preforked) processes in the wsgi server
(13:16:54) ***dachary thinking
(13:18:19) olasd: the main concern is having n^2 connections open to be able to read objects, in the worst case (if all preforked workers get requests for objects in all shards). but reading all the fresh objects is not a common usecase (we do it 2 or 3 times, not n times) so I don't think it would be that bad in practice.
(13:19:15) dachary: there is a single endpoint for read and writes at the moment, correct?
(13:19:17) olasd: if that really happens we can route writes and reads to different backend services on the reverse proxy
(13:19:25) vlorentz: and if we ever need to read lots of fresh objects, we can probably add a cache layer in front of the objstorage
(13:19:35) olasd: vlorentz: yeah, that too
(13:19:55) vlorentz: (could be useful for other backends too)
(13:20:36) olasd: dachary: there's a single HTTP service (backed by however many preforked workers), with separate http endpoints for writes and reads
(13:21:32) dachary: thanks for understanding my poorly phrased question properly, you have a good parser
(13:23:22) olasd: so, we have lots of options for "API dispatch" but I think "writing code" should probably be the last one if we can help it
(13:23:26) olasd: :)
(13:25:51) dachary: so a given *sgi app would run with configuration parameters that are interpreted as "be a winery backend and be in charge of writes". And the winery backend would figure that out by reading the configuration, am I getting closer ?
(13:29:26) dachary: that makes sense to me
(13:30:02) olasd: the *sgi app implements the objstorage RPC api endpoints for reads and writes, either for the write storage case (where writes write to an index and a database, and reads read from an index and a database), or for the read storage case (where writes fail, and reads read from an index then a ceph rbd image)
(13:31:09) olasd: most likely the *sgi app can be generated from an implementation of the ObjstorageInterface (or whatever the name of the abstract class is) which doesn't care about the *sgi side of things at all
(13:36:01) dachary: the fact that a single app handles both reads and writes confuse me though: when it starts writing the Shard to the Read Storage, it will block during a long time. Won't that be a problem?
(13:36:58) vlorentz: then incoming requests should be sent to other workers
(13:38:44) dachary: hum, true. The app will actually shutdown at that point and no longer accept requests.
(13:40:28) olasd: dachary: we can split the reads and writes at the reverse proxy level, and have reads and writes handled by different pools of workers
(13:40:29) dachary: To rephrase: there will be N app running (N being under the control of the sysadmin using tooling that already exists). All app will receive reads and writes, there will not be an app specialized in reading or writing.
(13:40:45) olasd: but I don't think we /need/ to considering the workload
(13:40:54) olasd: which makes the initial implementation simpler
(13:41:18) olasd: but operationally even if the implementation has both reads and writes, we can dedicate different pools to both workloads
(13:41:28) olasd: and dispatch at a higher level
(13:42:49) dachary: the workload (as in how much CPU/RAM the workers use) is not a concern, I/O throttling is. The workers need to communicate to figure out how much read/write bandwith each of them is allowed to consume.
(13:43:24) dachary: But that can be done with the architecture you suggest.
(13:44:42) dachary: And it would not require a higher level dispatch / worker pools. Pools of independant workers is what the benchmark use but it is problematic.
(13:47:01) dachary: olasd I think I have enough to get going, thank you! I'll go in this direction.

On the topic of throttling, the following discussion happened on IRC:

(13:46:47) vlorentz: > The workers need to communicate
(13:46:50) vlorentz: ah!
(13:47:03) vlorentz: then indeed that's going to be an issue
(13:47:37) vlorentz: how will they communicate? it looks like you're restricting yourself to in-process communication
(13:47:44) dachary: vlorentz: they are connected to a database anyways and dedicating a table to throttling with updates wever N seconds seems practical and efficient.
(13:47:45) olasd: I'm still not entirely convinced about the need for i/o throttling in practice
olasd olasd`
(13:47:56) olasd: we'll see
olasd olasd`
(13:49:32) olasd: (in any case I would suggest getting a working implementation for a single shard and shared index being written to/read from first, then we can have a look at coordinating all of this)
(13:49:34) dachary: olasd: the benchmarks demonstrate that no throttling is going to hurt your performances. Applying high pressure on writes during 5 minutes will significantly slow down your reads and multiply the response time.
(13:50:27) dachary: and if you do not have throttling, you cannot prevent that from happening at random moments
(13:51:21) dachary: writing a Shard to the Read Storage will have exactly that kind of effect
(13:51:32) zack: olasd: i had a very similar gut feeling about throttling while reading the benchmark reports
(13:52:11) zack: dachary: but explicit throttling will impact our real-world read workload, which is not something we want to do
(13:52:42) zack: (or, at least, i don't think so, maybe i'm wrong)
(13:52:51) dachary: to be clear I mean throttling the Ceph I/O workload only
(13:53:37) zack: oh, and also i was reading it backward, you want to throttle *writes*
(13:54:12) dachary: both reads and writes but for Ceph only
(13:55:45) dachary: read throttling is less of an immediate concern, writes are immediately problematic if not throttled because of the batch writes when a Shard is written to the Read Storage
(13:56:25) dachary: but the benchmarks also show that high pressure reads (which are possible although not immediately predictible) will hurt both reads and writes
(13:59:13) dachary: my intention is not to implement throttling in the first draft :-) But it needs to be there when the real work begins.

dachary changed the task status from Work in Progress to Open.Aug 29 2021, 1:08 PM
dachary changed the status of subtask T3104: Persistent readonly perfect hash table from Work in Progress to Open.