⚙ D1460 Graph compression REST API proposal

haltode created this revision.May 13 2019, 9:30 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptMay 13 2019, 9:30 AM

Harbormaster completed remote builds in B5686: Diff 4787.May 13 2019, 9:30 AM

As suggested by @seirl, we should specify the output order as the traversal order.

When you write sha1, you probably mean sha1_git (it's the hash of a dir/revision/... formatted with git's layout, which is what we use as identifiers in SWH, for all nodes of the graph)

I propose a different set of URI paths, more consistent with what we do in swh-storage:

/graph/directory/list
/graph/revision/<revision>/directory/ls
/graph/revision/log

api/docs/api.rst
46	It could be useful to also provide their hash.
61–62	same
77	What do you mean by "revision logs"? revision messages?

Good first iteration! I've proposed various ways to make this more general, and I'm very open to discuss them.

api/docs/api.rst
1	Minor point: the fact it's compressed is an implementation detail. What we are defining here is a generic API to efficiently run graph algos. I think that most of them will be traversal graphs, so this is likely gonna be the "Graph traversal API", and can be further generalized later if need be.
13–15	The notion of dataset will be gone soon: we have it now because we compressed different subgraphs separately, but we really want to compress the entire graph at once (as it's needed for cross-subgraph traversals). As a result all subgraph-specific aspects of the API should be generalized. A first proposal to do so is to replace all "SRC_to_DST" strings with "/SRC/DST" URL paths, e.g., "dir_to_rev" → "/dir/rev" fixing the implicit notion that the first is the source, the second the destination. They're not strictly speaking hierarchical, but it seems way nicer than `?from=SRC&to=DST`. Not sure if there are better/more idiomatic REST patterns for this. (It might be useful to look around and check other existing REST APIs for traversals in existing graph databases.)
14	Various naming inconsistencies here: "file" should be "content" or "cnt" for short we are mixing short and long version of object types/names, e.g., "dir" and "rev" v. "snapshot" and "release" cc: @seirl
17	this can be made more hierarchic to reduce duplication of information, e.g.: counts: { nodes: ...; edges: ...; }; indegree: { max: ...; min: ...; } etc.
32	{space,time,spacetime}-recursive are our use cases, but what the API is offering here are just endpoints to visit the graph. Additionally, the other use cases not listed yet (e.g., provenance lookups) are still visits, they just happen to traverse the graph in the opposite direction. Last, it seems to me that the only differentiating point between the various visits are the edges that we allow ourselves to cross during the visit. If we make that a parameter of the API, we essentially obtain an API with a single "visit" method, which is quite elegant. All things considered, how about a method like this: `/graph/visit/SRC_TYPE/DST_TYPE[?direction={forward,backward}]` with `direction` defaulting to `forward`. I realized this is able to deal only with the case where we allow ourselves to only traverse one type of edges. It could be generalized by an optional parameter to specify which additional edges we allow ourselves to traverse, e.g., as a list of "SRC_TYPE/DST_TYPE" strings. Once we have that, we might add shortcut methods for common use cases, e.g., visiting `rev→dir→{dir,cnt}` subgraphs. They will just be syntactical shortcuts that require no ad-hoc implementation.
46	Yep, a list of paths will not do here. We need a list of node identifiers. A node identifier is currently a pair, <node type, sha1>. I'm not clear on where/how to include full paths. In general visits they can become very long, potentially up to the diameter of the subgraph explored by the visit. Maybe it should be an optional additional attribute, defaulting to false.

This revision now requires changes to proceed.May 13 2019, 11:55 AM

vlorentz added inline comments.May 13 2019, 12:00 PM

api/docs/api.rst
77	(and if it's messages, then it must be bytes instead of strings; so you may want to use msgpack instead of JSON)

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

Oh, BTW, we should look at GraphQL for inspiration for this API. I'm not entirely sure it's worth it/useful for visits, but we should at least check.

api/docs/api.rst
77	We want to have minimal information returned attached to the nodes here: so by default is should just be the revision identifiers. It should be up to the user to go fetch additional info. And, we can support additional parameters that allows the user to request including in the response additional fields (e.g., log messages in this case).

vlorentz added inline comments.May 13 2019, 12:08 PM

api/docs/api.rst
77	Okay. So it should say "revision ids" instead of "revision logs"

In D1460#32446, @seirl wrote:

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

Yes, it is — my generalization idea comes from that graph too. But AFAICT your proposal is on the implementation side and note on the API one, right? I.e., if we have the static info that some edges are forbidden, we can error out instead of trying (that said, would it be worth it?).

Thanks for all the great feedbacks!

So as you suggested @zack we could have such API:

GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:

GET /graph/stats/src_type/dst_type

As a mapping instead of the datasets name, and parse the stats file. Or maybe compute our own stats after a graph traversal and generalize with the visit operation:

GET /graph/visit/src_type/dst_type/stats/[?extra_edge="src_type/dst_type"]*

Or have its own endpoint:

GET /graph/stats/src_type/dst_type/[?extra_edge="src_type/dst_type"]*

api/docs/api.rst
46	Yes, by "path" I meant a path of hashes. I only have access to those informations from the graph datasets (.nodes.csv.gz and .edges.csv.gz).
77	Yep, once again I meant to use the revisions identifiers.

In D1460#32488, @haltode wrote:

GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:
GET /graph/stats/src_type/dst_type

I like this one, and I like it way more than adding stats at the end of the visit endpoint (the rational being that the initial part of an endpoint is the method verb and should define its main function).

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

In D1460#32835, @zack wrote:
In D1460#32488, @haltode wrote:
GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*
This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Here is a revisited version, I also added a starting point parameter for the visit:

GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

Let the client do the extra API call
Define some higher level API endpoints to return formated results
Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

In D1460#33101, @haltode wrote:
Here is a revisited version, I also added a starting point parameter for the visit:
GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

Approved.

As a matter of terminology, as it is gonna be a recurrent term, let's use "node identifier" (or "node_id" for short) instead of src_hash.

And, on that matter, given we already need a pair node_type/hash there, I was thinking that we can just use SWH PIDs and be done with it. (We might consider stripping the prefix "swh:1:", but I doubt it's worth the saving, given the cost of divergence from the standard.)

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

Let the client do the extra API call

Define some higher level API endpoints to return formated results

Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

After our chat on this point, I agree we should just have the identifiers here and nothing else.

We might consider a separate service to "do the JOINs" from path to other attributes, but it is indeed a separate concern that is not in scope of graph compression.

Update API description based on feedbacks

Harbormaster completed remote builds in B5810: Diff 4883.May 19 2019, 4:28 PM

haltode marked 12 inline comments as done.May 19 2019, 4:29 PM

I think that the src_type/dst_type in both the URL and extra_edges is a bit redundant. We could refactor the visit function into the following endpoint:

GET /graph/visit/swh_id/[?allowed_edges=["src_type/dst_type",...]][?direction={forward,backward}]

Where by default we can explore the graph following all types of edges, and restrict it if necessary.

In D1460#33390, @haltode wrote:
I think that the src_type/dst_type in both the URL and extra_edges is a bit redundant. We could refactor the visit function into the following endpoint:
GET /graph/visit/swh_id/[?allowed_edges=["src_type/dst_type",...]][?direction={forward,backward}]
Where by default we can explore the graph following all types of edges, and restrict it if necessary.

This is in fact totally along the lines of what I mentioned last week in chat, but didn't have the time to write properly; so thanks for doing so :-)
My main reason for linking this more than the previous approach is that "visit this node" is really the main intent that one has, everything else (including the edges you're allowed to cross) is a "detail".

Some other thoughts that crossed my mind:

given we no longer have the constraint of putting the "main" type of edges before the query parameters, we can use ":" as separator for from/to edges, I think it'd be nicer on the eye than "/"
s/allowed_edges/edges/, maybe?
if edges are not specified, we should follow *all* edges during the visit
in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"

zack requested changes to this revision.May 22 2019, 9:15 AM

This revision now requires changes to proceed.May 22 2019, 9:15 AM

In D1460#33391, @zack wrote:

Some other thoughts that crossed my mind:

given we no longer have the constraint of putting the "main" type of edges before the query parameters, we can use ":" as separator for from/to edges, I think it'd be nicer on the eye than "/"

s/allowed_edges/edges/, maybe?

Sure.

if edges are not specified, we should follow *all* edges during the visit

Yes, this is what I meant here: "Where by default we can explore the graph following all types of edges"

in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"

I chose only DFS because the memory overhead of the BFS on the entire graph might be too much if we already have the compressed graph in memory.

In D1460#33397, @haltode wrote:

if edges are not specified, we should follow *all* edges during the visit

Yes, this is what I meant here: "Where by default we can explore the graph following all types of edges"

Ah, sorry, I overlooked that. (So: yes, full agreement !)

in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"

I chose only DFS because the memory overhead of the BFS on the entire graph might be too much if we already have the compressed graph in memory.

Let's separate two concerns here: the API spec from what is doable behind the scenes. If that's the "only" issue, let's put the parameter here, with a single value allowed which is also the default. Once doing more will become doable, we'll add the other option.

Factor 'visit' endpoint
Add 'traversal' query string (dfs or bfs)

Harbormaster completed remote builds in B5868: Diff 4927.May 22 2019, 10:09 AM

Remove 'bfs' as an option for 'traversal'

Harbormaster completed remote builds in B5869: Diff 4928.May 22 2019, 10:11 AM

zack requested changes to this revision.May 30 2019, 10:33 AM

zack added inline comments.

api/docs/api.rst
13	minor: this deserve an hyperlink to the PID doc
16	I don't see how "colon-separated list of node types" is expressive enough. We need to specify a list of edge type, and each edge type is a pair of node types. So I think this should be something like "comma-separated list of src:dst strings", otherwise you can only specify one path type (possibly crossing multiple types of edges), but no more than that.

This revision now requires changes to proceed.May 30 2019, 10:33 AM

Add SWH id documentation link
Rephrase edge type description

Harbormaster completed remote builds in B5982: Diff 5031.May 31 2019, 7:32 AM

haltode marked 2 inline comments as done.May 31 2019, 7:32 AM

Rebasing on master.

Harbormaster completed remote builds in B5983: Diff 5032.May 31 2019, 8:16 AM

zack accepted this revision.Jun 4 2019, 2:07 PM

This revision is now accepted and ready to land.Jun 4 2019, 2:07 PM

Closed by commit rDGRPH384312b79b40: api: docs: add api.rst (authored by haltode). · Explain WhyJun 4 2019, 2:43 PM

This revision was automatically updated to reflect the committed changes.

Graph compression REST API proposal
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 4883

api/docs/api.rst

Graph compression REST API proposalClosedPublicActions