In order to request the vault to cook a bundle, or retrieve an already cooked one, the vault have to be accessible through an API, meaning that there is a need for an intern client/server to plug in.
See API section in rDSTO1e974580ac3c.
In order to request the vault to cook a bundle, or retrieve an already cooked one, the vault have to be accessible through an API, meaning that there is a need for an intern client/server to plug in.
See API section in rDSTO1e974580ac3c.
rDSTO Storage manager | |||
D108 | rDSTOa71109b66538 Http API to access the SWH vault |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T67 prototype: git clone from SWH | ||
Migrated | gitlab-migration | T508 prototype: git archive from SWH | ||
Migrated | gitlab-migration | T530 Software Heritage Vault | ||
Unknown Object (Maniphest Task) | ||||
Migrated | gitlab-migration | T532 Vault API |
Currently the cookers store their bundles in an objstorage. The current design of the objstorage requires to have the whole object in ram, and it would require significant changes to be able to "stream" big objects to the objstorage. This is a big problems for the cooking requests of big repositories.
I see two solutions for that:
This will require a lot of changes (I'm not even sure I've seen everything there is to change, but I think it's a pretty big overhaul, especially for the remote storage).
Advantages:
I really like this option. I see a ton of advantages:
Maybe I missed some reasons behind having an efficient retrieval for the user? I really like having the user "share the responsibility" of the computing time by maintaining an open connection.
From IRC with permission.
16:46:39 seirl ╡ olasd: i don't remember if you had an opinion on that too 16:49:32 olasd ╡ my opinion is that we should try very hard to avoid doing long-running stuff without checkpoints 16:50:59 ╡ I don't think it's reasonable to expect a connection to stay open for hours 16:53:37 ╡ this disqualifies any client on an unreliable connection, which is maybe half the world? 16:54:14 seirl ╡ okay, i'm not excluding trying to find a way to "resume" the download 16:55:06 ╡ that way we can just store the state of the cookers, which is pretty small 16:55:29 ╡ also, people on unstable connections tend to not want to download 52GB files 16:55:38 olasd ╡ except when they do 16:55:45 nicolas17 ╡ o/ 16:56:03 ╡ I don't mind downloading 52GB 16:56:32 olasd ╡ swh doesn't intend to serve {people on stable, fast connections}, it intends to serve people 16:56:41 nicolas17 ╡ but if it's bigger than 100MB and you don't support resuming then I hate you 16:56:44 seirl ╡ okay there's a misunderstanding by what I meant by that 16:56:53 ╡ assuming we DO implement checkpoints 16:57:04 ╡ (and resuming) 16:57:15 ╡ people with unstable connections are usually people with slow download speeds 16:57:47 ╡ so they won't be impacted a lot by the fact that streaming the response while it's being cooked has a lower throughtput 16:57:55 olasd ╡ I still don't think streaming is a reasonable default 16:58:09 seirl ╡ okay 16:59:13 olasd ╡ however, I think making the objstorage support chunking is a reasonable goal 16:59:34 ╡ even if it's restricted to the local api for now 16:59:55 seirl ╡ oh, i hadn't thought of chunking the bundles 17:00:02 nicolas17 ╡ if I start downloading and you stream the response, and the connection drops, what happens? will it keep processing and storing the result in the server, or will it abort? 17:00:29 seirl ╡ nicolas17: i was thinking about storing the state of the processing (which is small) somewhere 17:00:34 ╡ in maybe an LRU cache 17:00:48 ╡ if the user reconnects, the state is restored and the processing can continue 17:01:15 nicolas17 ╡ would this be a plain HTTP download from the user's viewpoint? 17:01:21 seirl ╡ yeah 17:01:27 nicolas17 ╡ would the state be restored such that the file being produced is bitwise identical? 17:01:33 seirl ╡ that's the idea 17:01:45 ╡ we can deduce which state to retrieve from the Range: header 17:02:06 nicolas17 ╡ great then 17:03:04 olasd ╡ nicolas17: mind if I paste this conversation to the forge ? 17:03:15 nicolas17 ╡ go ahead 17:03:17 * ╡ olasd is lazy 17:03:23 seirl ╡ that said i perfectly understand that wanting the retrieval to be fast and simple for the users is an important goal, if we're not concerned about the storage and we can easily do chunking that might be a good way to go 17:03:42 nicolas17 ╡ the bitwise-identical thing is important or HTTP-level resuming would cause a corrupted mess :P
We should also consider that the API server where the public requests a bundle, and the workers that actually cook them, are very likely to be isolated from one another, which would make streaming to clients tricky to implement.
Just a couple of comments: