Page MenuHomeSoftware Heritage

Graph compression REST API proposal
Needs ReviewPublic

Authored by haltode on Mon, May 13, 9:30 AM.

Details

Reviewers
zack
Group Reviewers
Reviewers

Diff Detail

Repository
rDGCMP Graph compression
Branch
docs-api
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5810
Build 7957: arc lint + arc unit

Event Timeline

haltode created this revision.Mon, May 13, 9:30 AM
haltode added a subscriber: seirl.Mon, May 13, 9:33 AM

As suggested by @seirl, we should specify the output order as the traversal order.

When you write sha1, you probably mean sha1_git (it's the hash of a dir/revision/... formatted with git's layout, which is what we use as identifiers in SWH, for all nodes of the graph)

I propose a different set of URI paths, more consistent with what we do in swh-storage:

  • /graph/directory/list
  • /graph/revision/<revision>/directory/ls
  • /graph/revision/log
api/docs/api.rst
47

It could be useful to also provide their hash.

62–63

same

78

What do you mean by "revision logs"? revision messages?

zack requested changes to this revision.Mon, May 13, 11:55 AM
zack added a subscriber: zack.

Good first iteration! I've proposed various ways to make this more general, and I'm very open to discuss them.

api/docs/api.rst
2

Minor point: the fact it's compressed is an implementation detail. What we are defining here is a generic API to efficiently run graph algos. I *think* that most of them will be traversal graphs, so this is likely gonna be the "Graph traversal API", and can be further generalized later if need be.

14–16

The notion of dataset will be gone soon: we have it now because we compressed different subgraphs separately, but we really want to compress the entire graph at once (as it's needed for cross-subgraph traversals).

As a result all subgraph-specific aspects of the API should be generalized. A first proposal to do so is to replace all "SRC_to_DST" strings with "/SRC/DST" URL paths, e.g., "dir_to_rev" → "/dir/rev" fixing the implicit notion that the first is the source, the second the destination. They're not strictly speaking hierarchical, but it seems way nicer than ?from=SRC&to=DST.

Not sure if there are better/more idiomatic REST patterns for this. (It might be useful to look around and check other existing REST APIs for traversals in existing graph databases.)

15

Various naming inconsistencies here:

  • "file" should be "content" or "cnt" for short
  • we are mixing short and long version of object types/names, e.g., "dir" and "rev" v. "snapshot" and "release"

cc: @seirl

18

this can be made more hierarchic to reduce duplication of information, e.g.:

counts: {
  nodes: ...;
  edges: ...;
};
indegree: {
  max: ...;
  min: ...;
}

etc.

33

{space,time,spacetime}-recursive are our use cases, but what the API is offering here are just endpoints to visit the graph.
Additionally, the other use cases not listed yet (e.g., provenance lookups) are still visits, they just happen to traverse the graph in the opposite direction.
Last, it seems to me that the only differentiating point between the various visits are the edges that we allow ourselves to cross during the visit. If we make that a parameter of the API, we essentially obtain an API with a single "visit" method, which is quite elegant.

All things considered, how about a method like this:

  • /graph/visit/SRC_TYPE/DST_TYPE[?direction={forward,backward}]

with direction defaulting to forward.

I realized this is able to deal only with the case where we allow ourselves to only traverse one type of edges. It could be generalized by an optional parameter to specify which *additional* edges we allow ourselves to traverse, e.g., as a list of "SRC_TYPE/DST_TYPE" strings.

Once we have that, we might add shortcut methods for common use cases, e.g., visiting rev→dir→{dir,cnt} subgraphs. They will just be syntactical shortcuts that require no ad-hoc implementation.

47

Yep, a list of paths will not do here.
We need a list of node identifiers. A node identifier is currently a pair, <node type, sha1>.

I'm not clear on where/how to include full paths. In general visits they can become very long, potentially up to the diameter of the subgraph explored by the visit. Maybe it should be an optional additional attribute, defaulting to false.

This revision now requires changes to proceed.Mon, May 13, 11:55 AM
vlorentz added inline comments.Mon, May 13, 12:00 PM
api/docs/api.rst
78

(and if it's messages, then it must be bytes instead of strings; so you may want to use msgpack instead of JSON)

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

zack added a comment.Mon, May 13, 12:06 PM

Oh, BTW, we should look at GraphQL for inspiration for this API. I'm not entirely sure it's worth it/useful for visits, but we should at least check.

api/docs/api.rst
78

We want to have minimal information returned attached to the nodes here: so by default is should just be the revision identifiers. It should be up to the user to go fetch additional info. And, we can support additional parameters that allows the user to request including in the response additional fields (e.g., log messages in this case).

vlorentz added inline comments.Mon, May 13, 12:08 PM
api/docs/api.rst
78

Okay. So it should say "revision ids" instead of "revision logs"

zack added a comment.Mon, May 13, 12:12 PM
In D1460#32446, @seirl wrote:

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

Yes, it is — my generalization idea comes from that graph too. But AFAICT your proposal is on the implementation side and note on the API one, right? I.e., if we have the static info that some edges are forbidden, we can error out instead of trying (that said, would it be worth it?).

haltode added a comment.EditedTue, May 14, 7:37 AM

Thanks for all the great feedbacks!

So as you suggested @zack we could have such API:

GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:

GET /graph/stats/src_type/dst_type

As a mapping instead of the datasets name, and parse the stats file. Or maybe compute our own stats after a graph traversal and generalize with the visit operation:

GET /graph/visit/src_type/dst_type/stats/[?extra_edge="src_type/dst_type"]*

Or have its own endpoint:

GET /graph/stats/src_type/dst_type/[?extra_edge="src_type/dst_type"]*
api/docs/api.rst
47

Yes, by "path" I meant a path of hashes. I only have access to those informations from the graph datasets (.nodes.csv.gz and .edges.csv.gz).

78

Yep, once again I meant to use the revisions identifiers.

zack added a comment.Thu, May 16, 8:51 AM
GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:

GET /graph/stats/src_type/dst_type

I like this one, and I like it way more than adding stats at the end of the visit endpoint (the rational being that the initial part of an endpoint is the method verb and should define its main function).

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

In D1460#32835, @zack wrote:
GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Here is a revisited version, I also added a starting point parameter for the visit:

GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

  • Let the client do the extra API call
  • Define some higher level API endpoints to return formated results
  • Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

zack added a comment.Sun, May 19, 9:17 AM

Here is a revisited version, I also added a starting point parameter for the visit:

GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

Approved.

As a matter of terminology, as it is gonna be a recurrent term, let's use "node identifier" (or "node_id" for short) instead of src_hash.

And, on that matter, given we already need a pair node_type/hash there, I was thinking that we can just use SWH PIDs and be done with it. (We might consider stripping the prefix "swh:1:", but I doubt it's worth the saving, given the cost of divergence from the standard.)

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

  • Let the client do the extra API call
  • Define some higher level API endpoints to return formated results
  • Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

After our chat on this point, I agree we should just have the identifiers here and nothing else.

We might consider a separate service to "do the JOINs" from path to other attributes, but it is indeed a separate concern that is not in scope of graph compression.

haltode updated this revision to Diff 4883.Sun, May 19, 4:28 PM

Update API description based on feedbacks

haltode marked 12 inline comments as done.Sun, May 19, 4:29 PM