Page MenuHomeSoftware Heritage

Vault: polish web-ui API
Closed, MigratedEdits Locked

Event Timeline

seirl created this task.
seirl triaged this task as Normal priority.Aug 3 2017, 6:58 PM

What is needed now is mainly to change the way formats are given in parameter (revision_gitfast/<hash> -> revision/<hash>/gitfast maybe?)

My current idea for the user-facing api is below. I have highlighted in bold the
things that require decisions or comments.

Object types and options

GET /1/vault/directory/<sha1:dir_id>/         (default format = tarball)
GET /1/vault/directory/<sha1:dir_id>/tarball

GET /1/vault/revision/<sha1:rev_id>/          (default format = ?)
GET /1/vault/revision/<sha1:rev_id>/gitfast
GET /1/vault/revision/<sha1:rev_id>/bare
GET /1/vault/revision/<sha1:rev_id>/gitfast?depth=1337

GET /1/vault/snapshot/<sha1:rev_id>/          (default format = ?)
GET /1/vault/snapshot/<sha1:rev_id>/gitfast
GET /1/vault/snapshot/<sha1:rev_id>/bare
GET /1/vault/snapshot/<sha1:rev_id>/bare?depth=42

Internally, the type and format dispatch to different cookers, so when
transmitting the request to the vault backend, we'd need a:

obj_type = '{}_{}'.format(object, format)

Does that split on the user side/join on the backend side and the use of
default formats look consistent to you?

Getting progress vs requesting cooking

(note that the URLs are the same for progress and cooking, only the HTTP verb
changes. should we add /cook or is that fine?)

GET /1/vault/snapshot/<sha1:rev_id>/bare     -> {"status": null}
POST /1/vault/snapshot/<sha1:rev_id>/bare    -> {"status": "new"}

GET /1/vault/snapshot/<sha1:rev_id>/bare     -> {"status": "pending"}
...
GET /1/vault/snapshot/<sha1:rev_id>/bare     -> {"status": "pending"}
GET /1/vault/snapshot/<sha1:rev_id>/bare     -> {"status": "done"}

GET /1/vault/snapshot/<sha1:rev_id>/bare/raw -> <repo.tar.gz>

Email notification

Option 1 (request parameter):

POST /1/vault/snapshot/<sha1:rev_id>/bare?depth=1&email=test@test.com

Option 2 (request data):

curl -X POST -d 'email=test@test.com' \
    /1/vault/snapshot/<sha1:rev_id>/bare?depth=1

(I think both should be supported)

Getting progress of objects with options

This is rather long term, but if we allow shallow clones with like ?depth=1 or
other options, we will no longer be able to reference bundles only by (type,
id, format), to fetch them or to see their progress. There's two
possibilites here (not necessarily exclusive, we just need to choose what we
support
):

  • Add *all the params* to all subsequent requests (HTTP parameters can be tricky, so it can be hard to find a way to uniquely identify bundles that way, but we can probably normalize that).
POST /1/vault/snapshot/<sha1:rev_id>/bare?depth=42     -> cook
GET /1/vault/snapshot/<sha1:rev_id>/bare?depth=42      -> progress
GET /1/vault/snapshot/<sha1:rev_id>/bare/raw?depth=42  -> fetch

GET /1/vault/snapshot/<sha1:rev_id>/bare/raw?depth=41
   -> ERROR, wasn't cooked like that
  • The POST that requests a cooking returns an ID, we can use that as an identifier for subsequent requests:
POST /1/vault/snapshot/<sha1:rev_id>/bare?depth=42     -> cook
   -> {'id': 1337}

GET /1/vault/1337  -> progress
GET /1/vault/1337/raw  -> fetch

First of all thanks for this useful summary @seirl !

My take on the open questions (and more):

  • shouldn't the format be a parameter instead of a part of the URL? i.e., ?format=foo instead of /foo at the end?
  • the split looks good to me, but i'm not sure i understand which kind of (potential) inconsistency you are worried about…
  • dispatching on GET v. POST looks good to me, I (slightly) prefer it to different URLs. YMMV
  • for polling, while reading I was about to propose returning a receipt that can be used for progress report, so there goes my preference :-) it might be tricky though. A couple of potential caveats come to mind:
    • the plan would be to guarantee that if multiple people request the same cooking (same params, format, etc.) they should be given the same cooking ID, right?
    • we should check (with @anlambert) that this choice doesn't pose problems in terms of showing in the web UI the current status of cooking
  • shouldn't the format be a parameter instead of a part of the URL? i.e., ?format=foo instead of /foo at the end?

I see some reasons why not:

  • I like having different endpoints ↔ different ways of handling the result data. If you do ?depth=1, the way you handle the data doesn't change compared to ?depth=42, you still import it the same way.
  • I think it's important to distance ourselves from the backend implementation and this is why we do the split, but the format is not just an option, it's an inherent part of how we do the dispatching, so I think it's important to reflect that even in the user API
  • Having multiple views for different formats allows us to have an API documentation per format, with detailed info on how to extract/import it locally, as seen here: https://archive.softwareheritage.org/api/1/vault/revision_gitfast/594617d1cd9d9d6bc0cfbd531bbaa1ed19627e9b/
  • the split looks good to me, but i'm not sure i understand which kind of (potential) inconsistency you are worried about…

I was just wondering if it looked good in general or if you thought it was unclear for some reason, I'm not particularly worried.

  • dispatching on GET v. POST looks good to me, I (slightly) prefer it to different URLs. YMMV

The industry is currently in love with REST apis for whatever reasons, so I think having a dispatch over GET/POST is the """de-facto industry standard""" right now, so I think it's less surprising to do it like that. But I don't really care either.

  • the plan would be to guarantee that if multiple people request the same cooking (same params, format, etc.) they should be given the same cooking ID, right?

Yes, that's already the case for the tuple (obj_type, obj_id) as enforced here : https://forge.softwareheritage.org/source/swh-vault/browse/master/swh/vault/backend.py;e7ff2ad49da6989b30beb1a314d1452e47a1f91f$184-186
We need to find a canonical way to serialize the "optional parameters" so that we can add them to that identifying tuple. But I don't think we should give it *that much thought* for now, as this will require a lot of changes in the vault anyway... for instance, all the backend methods currently only take (obj_type, obj_id) so we would need a new version of all these methods that take an id instead.

In T743#16529, @seirl wrote:
  • shouldn't the format be a parameter instead of a part of the URL? i.e., ?format=foo instead of /foo at the end?

I see some reasons why not:

ok, i'm sold!

seirl added a project: Restricted Project.Jan 8 2018, 12:45 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 9 2018, 12:45 PM

New URL scheme was done in rDWAPPS195058a21902, dispatch between GET and POST in rDWAPPS03406f1d07af , and I made some changes in the Vault to behave nicely with the RPC and the documented exceptions in rDVAUd95bf6d37bec .

What's remaining (getting progress of objects with options) is a long term goal, so I'm closing this task. The API should be definitive for the FOSDEM release now.

seirl moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 12 2018, 2:42 PM