Page MenuHomeSoftware Heritage

expose swh-graph API at archive.s.o/api/1/graph/
Closed, MigratedEdits Locked

Description

We would like to expose the swh-graph API to the public, by mounting it at a novel /graph endpoint of the Web API.

Caveat: as it is currently easy to DoS the swh-graph API (e.g., by requesting a full BFS visit of a significant part of the archive), we want for now to only support using that API to selected users, e.g., via their authentication tokens.

This require some discussion of which components will be responsible for what, between at least: the swh-graph API server, the swh-web API server, a reverse proxy between the two, and the authentication/authorization backend.

So, discuss :-)

Event Timeline

zack renamed this task from expose the compressed graph API at archive.s.o/api/1/graph/ to expose swh-graph API at archive.s.o/api/1/graph/.Sep 14 2020, 2:37 PM
zack triaged this task as Normal priority.
zack created this task.

So, my first instinct for this was to implement the "mount" at the reverse proxy level (before even hitting swh-web), but:

  • I don't know how much of the token implementation that's been rolled on top of basic OpenID Connect we can support there
  • I'm not sure how we can handle user filtering at the reverse proxy level either without hardcoding a list in the RP config (ew)
  • In the end, we'll want to have some API endpoints merging data from swh.graph and swh.storage (I'm surprised: I couldn't find a task for this after a cursory glance. T2113 is one specific instance of this); it makes sense that swh-web would call onto the graph backend and massage the data before returning it to the API consumers.

So all in all I guess implementing the reverse proxy at the swh-web level would be a decent way of moving forward...

I agree with @olasd to do the reverse proxy at the webapp level. The main advantages are:

  • We can use the same Wep API authentication backend to manage authentication and user permissions. API authentication is based on the use of an OIDC offline refresh token and access token renewal is handled in the Django DRF authentication backend. While it should be possible to implement that process at reverse proxy level, users filtering should not be as easy as using fine-grained permissions from Django User API.
  • We can process swh-graph responses to enrich the data (notably get origin urls from their sha1 or turn swhids into dicts) and returns them in JSON format

Basic skeleton from such a proxy in swh-web could be the following:

@api_route(r"/graph/(?P<graph_endpoint>.+)/", "api-1-graph")
@permission_classes([IsAuthenticated])
def api_graph_proxy(request, graph_endpoint):
    graph_endpoint_url = get_config()["graph"]["server_url"]
    graph_endpoint_url += graph_endpoint
    if request.GET:
        graph_endpoint_url += f"?{request.GET.urlencode(safe='/;:')}"
    response = requests.get(graph_endpoint_url)
    # process returned data according to content type (text, json, ndjson)
    enriched_reponse = process_response(response)
    return make_api_response(request, enriched_reponse)

FTR, in a previous life, I've set up a json web token auth validation in varnish.

  • We can process swh-graph responses to enrich the data (notably get origin urls from their sha1 or turn swhids into dicts) and returns them in JSON format

This one is indeed more important than I thought at first. In particular, to avoid leaking URI sha1, we should indeed address something like T2113 at least at the web app level (and maybe if we do that there, we can avoid doing it all together in swh-graph, which would be a nice separation of concern).

However, unless I'm missing something, I think right now origin sha1s are not stored at all in swh-storage, or are they?
If they indeed aren't, a required sub-task of this one is adding sha1s to the origin table, together with an index to do the reverse sha1 -> url, and a matching swh-storage API method.

However, unless I'm missing something, I think right now origin sha1s are not stored at all in swh-storage, or are they?
If they indeed aren't, a required sub-task of this one is adding sha1s to the origin table, together with an index to do the reverse sha1 -> url, and a matching swh-storage API method.

You are right, they are not stored in database but there is a storage.origin_get_by_sha1 method.
Its performance on the replica database (somerset) are not bad while it timeouts on the main one (belvedere) as computed indices are not the same.

Anyway, adding a sha1 column to the origin table is clearly needed for optimal performance.

You are right, they are not stored in database but there is a storage.origin_get_by_sha1 method.

Ah, good point, I completely forgot about that index. It should in theory be enough, we'll see about perfs. (In particular: for reverse lookups for graph endpoints we will need to translate batches of swh:1:ori: SWHIDs to URLs, which might be an additional factor here.)
But in terms of expressivity (i.e., having the method to do the SHA1->URL translation in swh-web) it then looks like nothing is missing. Wonderful!

anlambert changed the task status from Open to Work in Progress.Sep 22 2020, 3:39 PM