Page MenuHomeSoftware Heritage

Public API v2
Closed, MigratedEdits Locked

Description

Motivation: We want to stop encoding arguments in the request path, and use query parameters instead. This makes more sense for HTTP (and the various proxy layers we have), and is the only way to properly give an origin URL as argument to the API (%2F and / are indistinguishable in Django's request routing). And this will require changing the whole structure of the API.

This meta task lists all other breaking changes we want to include in version 2 of the public API.

  1. Use query parameters instead of encoding arguments in the request path
  2. No leak of the origin id (use only origin URLs)
  3. Use SWHIDs everywhere (core SWHIDs, without qualifiers)
  4. Compatibility with at least one well-known API specification format (OpenAPI, SPARQL, ...)
  5. Consistent pagination used across all endpoints
  6. Authentication
  7. Standardize "batch invocation" of endpoints on multiple objects
  8. Consistent results for the same object accessed via different endpoints (e.g. /revision/<rev>/directory and /directory/<dir_id> do not return the same type of result; one is a superset of the other).
  9. Future-proofing, wrt. changes of hash algorithms (currently sha1_git)
  10. Consider dropping /revision/log/ (?) (see T2450)

Event Timeline

vlorentz triaged this task as Normal priority.Jun 14 2019, 12:06 PM
vlorentz created this task.
douardda updated the task description. (Show Details)
vlorentz renamed this task from Public API v2 (meta task) to Public API v2.Jan 22 2020, 4:23 PM
vlorentz added a project: meta-task.
zack updated the task description. (Show Details)

Rereading this task, I have a few comments/questions.

Item 3 Use SWHID everywhere - does the new API handle SWHID qualifiers too? always? sometimes? Or are qualifiers actually "the way" to specify arguments ? Are there needed endpoints that would not fit (besides query execution related arguments like pagination support etc.)?
Using SWHID will work well to navigate the archive (since we only manipulate existing archive objects, so everything have a SWHID), but not for a "discovery" usage (looking for objects within a range, matching a pattern, etc).

Maybe those 2 use cases should be considered separately.

Item 4 Use a well known API tool - I'm pretty sure SPARQL is out of reach for us, so I'd go for OpenAPI. So a decent first step would be to begin writing an OpenAPI definition for this new API.

Items 5, 6, 7 aka pagination, auth and batches - I believe these come naturally with item 4 (specification wise)

Item 8 Consistent results w.r.t. access path - with the rise of "SWHID everywhere" and the new existence of "SWHID with context", how does this point plays? Don't we want dedicated views/results when using a contextualised SWHID?

Item 9 Future proof w.r.t. hash algos - unless I'm wrong, from the API point of view, and according item 3, this is a matter of SWHID specification, not an API one.

Overall, I'm not sure how far to go with this "use SWHID everywhere". In fact, using SWHID, we don't even need most entry points (/content, /snapshot, etc.). Keeping them is a bit redundant. But can we imagine an API based on SWHID only?

Then, during this API v2 design session, should be considered:

  • what part of the API v1 we need? can we get rid of some? (e.g. /revision/log as pointed in item 10)
  • are there missing endpoints?
  • who are current users for this API? who are the expected new users?
  • what do they need?

And before any jump to writing code, a few client scenarios should be implemented using this new API, to check how convenient it is.

I suspect when this task was initially submitted we didn't have yet SWHID with qualifiers :)
From the point of view of the APIv2, given v1 was using only hashes, for feature parity we should indeed only need SWHID without qualifiers, i.e., "core" SWHIDs. (I'm gonna edit the task description to reflect that.) Thanks for noticing this!

Good point about /content v. /directory etc. I agree that switching to SWHIDs everywhere they seem redundant. That's similar with something I've experienced when writing the Python Web client. There I went for (1) a generic get() method that takes a SWHID and return any kind of object, together with (2) type-specific methods (revision(), directory()), etc. The advantage of having both is that, on the one hand, you have type checking that will help avoiding to pass the wrong type of SWHID if your application, say, only ever want to deal with revisions. On the other hand you have a natural namespace for additional type-specific methods. For instance, if we get rid of /revision, where do we put /revision/log (maybe not the greatest of examples as we are considering dropping that method, but other "rich" methods can easily show up for any kind of object, I think).

Items 5, 6, 7 aka pagination, auth and batches - I believe these come naturally with item 4 (specification wise)

They don't. OpenAPI is a specification to describe APIs, and it contains absolutely nothing about pagination or batches.

But they have to be taken into account when writing the OpenAPI description of the API, if that's what you meant

Items 5, 6, 7 aka pagination, auth and batches - I believe these come naturally with item 4 (specification wise)

They don't. OpenAPI is a specification to describe APIs, and it contains absolutely nothing about pagination or batches.

note: old comment I did not submit and forgot from back then

sure, but

it's true these do not come "for free" but I still have the impression there is an "Open API way" of handling these and we should stick to them.

One point I'm not sure how/if we want to pay attention to: using the query parameters approach limit the capabilities of the batch invocation mech (for batches input, due to limited and poorly standardized url size limit).

So do we want to also have endpoints which support "parameters" given in the request's payload (typically to support big list of swhids)?

it's true these do not come "for free" but I still have the impression there is an "Open API way" of handling these and we should stick to them.

Thanks for these comments. I agree that we should look into these kind of best-practices to implement our requirements on top of the something like Open API, given it seems to be the current state-of-the-art.

One point I'm not sure how/if we want to pay attention to: using the query parameters approach limit the capabilities of the batch invocation mech (for batches input, due to limited and poorly standardized url size limit).

So do we want to also have endpoints which support "parameters" given in the request's payload (typically to support big list of swhids)?

I think so, yes. In fact, that is already how we do things with the /known endpoint. Given it needs to take in "a lot" of SWHIDs, they are passed as POST payload. I think it would make sense to generalize this approach for all methods that need to be able to handle batches (maybe all of them, maybe not, but that's a separate question).