Page MenuHomeSoftware Heritage

GraphQL apis for SWH
Open, NormalPublic

Description

REST APIs are great; but, at times, it is too hard to write clients just using resource based CRUD operations. It is especially true for a system like SWH with a lot of centralized relational data. A GraphQL API layer could be a good addition to make data retrieval easy and efficient.
A graph API will let a client application to get the exact amount of data it requires in a single request. Requested attributes could be part of different resources from a REST point of view.

Most of the SWH APIs are read only. This will make the implementation of the graph layer particularly easy, we only have to think about a schema and mostly forget mutations.
GraphQL will also reduce the complexity of versioning the REST APIs. This layer can be added without disturbing any of the existing API contracts. It would also be possible for REST APIs to evolve independently.

One example use case:
A third party user is creating a data dashboard for showing all the projects newly archived in 2021.
With the current APIs, it will take many back and forth trips to gather this related data to populate the client's page.
1: To get all the origins.
2: Get visits for each origin to identify the archived date. (It might be possible to filter by an archived date in the previous request, haven't found that in the docs)
3: Get the metadata for the origin for further information.
This could be a difficult and time taking task to code in the client. This will also cause unnecessary load in the server.
With a GraphQL query, it would be possible to gather all this related data (depending on the published schema) in a single request.
This will also avoid complex client side validations (they get what they requested for) and make any JSON unmarshalling easier.

A few possible implementation ideas are
1: As a new package in SWH-web.
A graph endpoint can be exposed in the current Django system. Existing code used by the REST can be used to gather data for the graph API.
Major disadvantages are:

  • It will make the already big SWH-Web bigger.
  • Complexity for gathering data is mostly in the storage project. So the possibility of code re-use is actually minimal.

2: As a new service using existing REST APIs.
A new service that in turn calls the existing APIs to gather data. This could be useful in case we have to split the existing APIs to multiple micro services in the future.
Some disadvantages are:

  • This will be dependent a lot on the existing API. We will be forced to add a REST API before even exposing something in GraphQL schema.
  • REST could be an unnecessary level for those using graph APIs.
  • Both APIs may have to go through the same auth, throttling checks.

3: As a new service with direct calls to other services.
This way, we can have an independent Graph API layer. This layer would be free to call other services or the existing APIs to gather data.
One possible problem is that going forward, resources and attributes in REST and Graph APIs might differ a bit.

The 3rd approach seems to be the best considering the current code base.

Event Timeline

vlorentz triaged this task as Normal priority.Jun 23 2021, 5:20 PM

I stumbled across GitLab GraphQL API while working on T3442, could be a great source of inspiration.