In order to request the vault to cook a bundle, or retrieve an already cooked one, the vault have to be accessible through an API, meaning that there is a need for an intern client/server to plug in.
See API section in rDSTO1e974580ac3c.
In order to request the vault to cook a bundle, or retrieve an already cooked one, the vault have to be accessible through an API, meaning that there is a need for an intern client/server to plug in.
See API section in rDSTO1e974580ac3c.
| rDSTO Storage manager | |||
| D108 | rDSTOa71109b66538 Http API to access the SWH vault | ||
| Status | Assigned | Task | ||
|---|---|---|---|---|
| Migrated | gitlab-migration | T67 prototype: git clone from SWH | ||
| Migrated | gitlab-migration | T508 prototype: git archive from SWH | ||
| Migrated | gitlab-migration | T530 Software Heritage Vault | ||
| Unknown Object (Maniphest Task) | ||||
| Migrated | gitlab-migration | T532 Vault API |
Currently the cookers store their bundles in an objstorage. The current design of the objstorage requires to have the whole object in ram, and it would require significant changes to be able to "stream" big objects to the objstorage. This is a big problems for the cooking requests of big repositories.
I see two solutions for that:
This will require a lot of changes (I'm not even sure I've seen everything there is to change, but I think it's a pretty big overhaul, especially for the remote storage).
Advantages:
I really like this option. I see a ton of advantages:
Maybe I missed some reasons behind having an efficient retrieval for the user? I really like having the user "share the responsibility" of the computing time by maintaining an open connection.
From IRC with permission.
16:46:39 seirl ╡ olasd: i don't remember if you had an opinion on that too
16:49:32 olasd ╡ my opinion is that we should try very hard to avoid doing long-running stuff without checkpoints
16:50:59 ╡ I don't think it's reasonable to expect a connection to stay open for hours
16:53:37 ╡ this disqualifies any client on an unreliable connection, which is maybe half the world?
16:54:14 seirl ╡ okay, i'm not excluding trying to find a way to "resume" the download
16:55:06 ╡ that way we can just store the state of the cookers, which is pretty small
16:55:29 ╡ also, people on unstable connections tend to not want to download 52GB files
16:55:38 olasd ╡ except when they do
16:55:45 nicolas17 ╡ o/
16:56:03 ╡ I don't mind downloading 52GB
16:56:32 olasd ╡ swh doesn't intend to serve {people on stable, fast connections}, it intends to serve people
16:56:41 nicolas17 ╡ but if it's bigger than 100MB and you don't support resuming then I hate you
16:56:44 seirl ╡ okay there's a misunderstanding by what I meant by that
16:56:53 ╡ assuming we DO implement checkpoints
16:57:04 ╡ (and resuming)
16:57:15 ╡ people with unstable connections are usually people with slow download speeds
16:57:47 ╡ so they won't be impacted a lot by the fact that streaming the response while it's being cooked has a lower throughtput
16:57:55 olasd ╡ I still don't think streaming is a reasonable default
16:58:09 seirl ╡ okay
16:59:13 olasd ╡ however, I think making the objstorage support chunking is a reasonable goal
16:59:34 ╡ even if it's restricted to the local api for now
16:59:55 seirl ╡ oh, i hadn't thought of chunking the bundles
17:00:02 nicolas17 ╡ if I start downloading and you stream the response, and the connection drops, what happens? will it keep processing and storing the result in the server, or will it abort?
17:00:29 seirl ╡ nicolas17: i was thinking about storing the state of the processing (which is small) somewhere
17:00:34 ╡ in maybe an LRU cache
17:00:48 ╡ if the user reconnects, the state is restored and the processing can continue
17:01:15 nicolas17 ╡ would this be a plain HTTP download from the user's viewpoint?
17:01:21 seirl ╡ yeah
17:01:27 nicolas17 ╡ would the state be restored such that the file being produced is bitwise identical?
17:01:33 seirl ╡ that's the idea
17:01:45 ╡ we can deduce which state to retrieve from the Range: header
17:02:06 nicolas17 ╡ great then
17:03:04 olasd ╡ nicolas17: mind if I paste this conversation to the forge ?
17:03:15 nicolas17 ╡ go ahead
17:03:17 * ╡ olasd is lazy
17:03:23 seirl ╡ that said i perfectly understand that wanting the retrieval to be fast and simple for the users is an important goal, if we're not concerned about the storage and we can easily do chunking that might be a good way to go
17:03:42 nicolas17 ╡ the bitwise-identical thing is important or HTTP-level resuming would cause a corrupted mess :PWe should also consider that the API server where the public requests a bundle, and the workers that actually cook them, are very likely to be isolated from one another, which would make streaming to clients tricky to implement.
Just a couple of comments: