Page MenuHomeSoftware Heritage

Graph compression REST API proposal
ClosedPublic

Authored by haltode on May 13 2019, 9:30 AM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Commits
rDGRPH384312b79b40: api: docs: add api.rst

Diff Detail

Repository
rDGRPH Graph service
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

haltode created this revision.May 13 2019, 9:30 AM
haltode added a subscriber: seirl.May 13 2019, 9:33 AM

As suggested by @seirl, we should specify the output order as the traversal order.

When you write sha1, you probably mean sha1_git (it's the hash of a dir/revision/... formatted with git's layout, which is what we use as identifiers in SWH, for all nodes of the graph)

I propose a different set of URI paths, more consistent with what we do in swh-storage:

  • /graph/directory/list
  • /graph/revision/<revision>/directory/ls
  • /graph/revision/log
api/docs/api.rst
47

It could be useful to also provide their hash.

62–63

same

78

What do you mean by "revision logs"? revision messages?

zack requested changes to this revision.May 13 2019, 11:55 AM
zack added a subscriber: zack.

Good first iteration! I've proposed various ways to make this more general, and I'm very open to discuss them.

api/docs/api.rst
2

Minor point: the fact it's compressed is an implementation detail. What we are defining here is a generic API to efficiently run graph algos. I *think* that most of them will be traversal graphs, so this is likely gonna be the "Graph traversal API", and can be further generalized later if need be.

14–16

The notion of dataset will be gone soon: we have it now because we compressed different subgraphs separately, but we really want to compress the entire graph at once (as it's needed for cross-subgraph traversals).

As a result all subgraph-specific aspects of the API should be generalized. A first proposal to do so is to replace all "SRC_to_DST" strings with "/SRC/DST" URL paths, e.g., "dir_to_rev" → "/dir/rev" fixing the implicit notion that the first is the source, the second the destination. They're not strictly speaking hierarchical, but it seems way nicer than ?from=SRC&to=DST.

Not sure if there are better/more idiomatic REST patterns for this. (It might be useful to look around and check other existing REST APIs for traversals in existing graph databases.)

15

Various naming inconsistencies here:

  • "file" should be "content" or "cnt" for short
  • we are mixing short and long version of object types/names, e.g., "dir" and "rev" v. "snapshot" and "release"

cc: @seirl

18

this can be made more hierarchic to reduce duplication of information, e.g.:

counts: {
  nodes: ...;
  edges: ...;
};
indegree: {
  max: ...;
  min: ...;
}

etc.

33

{space,time,spacetime}-recursive are our use cases, but what the API is offering here are just endpoints to visit the graph.
Additionally, the other use cases not listed yet (e.g., provenance lookups) are still visits, they just happen to traverse the graph in the opposite direction.
Last, it seems to me that the only differentiating point between the various visits are the edges that we allow ourselves to cross during the visit. If we make that a parameter of the API, we essentially obtain an API with a single "visit" method, which is quite elegant.

All things considered, how about a method like this:

  • /graph/visit/SRC_TYPE/DST_TYPE[?direction={forward,backward}]

with direction defaulting to forward.

I realized this is able to deal only with the case where we allow ourselves to only traverse one type of edges. It could be generalized by an optional parameter to specify which *additional* edges we allow ourselves to traverse, e.g., as a list of "SRC_TYPE/DST_TYPE" strings.

Once we have that, we might add shortcut methods for common use cases, e.g., visiting rev→dir→{dir,cnt} subgraphs. They will just be syntactical shortcuts that require no ad-hoc implementation.

47

Yep, a list of paths will not do here.
We need a list of node identifiers. A node identifier is currently a pair, <node type, sha1>.

I'm not clear on where/how to include full paths. In general visits they can become very long, potentially up to the diameter of the subgraph explored by the visit. Maybe it should be an optional additional attribute, defaulting to false.

This revision now requires changes to proceed.May 13 2019, 11:55 AM
vlorentz added inline comments.May 13 2019, 12:00 PM
api/docs/api.rst
78

(and if it's messages, then it must be bytes instead of strings; so you may want to use msgpack instead of JSON)

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

zack added a comment.May 13 2019, 12:06 PM

Oh, BTW, we should look at GraphQL for inspiration for this API. I'm not entirely sure it's worth it/useful for visits, but we should at least check.

api/docs/api.rst
78

We want to have minimal information returned attached to the nodes here: so by default is should just be the revision identifiers. It should be up to the user to go fetch additional info. And, we can support additional parameters that allows the user to request including in the response additional fields (e.g., log messages in this case).

vlorentz added inline comments.May 13 2019, 12:08 PM
api/docs/api.rst
78

Okay. So it should say "revision ids" instead of "revision logs"

zack added a comment.May 13 2019, 12:12 PM
In D1460#32446, @seirl wrote:

Would this be another use case of the node type relationship graph? Or rather, its transitive closure (and the transitive closure of its transposition for the "backwards" case)? It would allow us to express that /release/revision?direction=forwards is doable, but /release/snapshot?direction=forwards isn't, and throw the appropriate errors.

Yes, it is — my generalization idea comes from that graph too. But AFAICT your proposal is on the implementation side and note on the API one, right? I.e., if we have the static info that some edges are forbidden, we can error out instead of trying (that said, would it be worth it?).

haltode added a comment.EditedMay 14 2019, 7:37 AM

Thanks for all the great feedbacks!

So as you suggested @zack we could have such API:

GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:

GET /graph/stats/src_type/dst_type

As a mapping instead of the datasets name, and parse the stats file. Or maybe compute our own stats after a graph traversal and generalize with the visit operation:

GET /graph/visit/src_type/dst_type/stats/[?extra_edge="src_type/dst_type"]*

Or have its own endpoint:

GET /graph/stats/src_type/dst_type/[?extra_edge="src_type/dst_type"]*
api/docs/api.rst
47

Yes, by "path" I meant a path of hashes. I only have access to those informations from the graph datasets (.nodes.csv.gz and .edges.csv.gz).

78

Yep, once again I meant to use the revisions identifiers.

zack added a comment.May 16 2019, 8:51 AM
GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Concerning the Stats operation, right now I am using the files generated from WebGraph so we could either use:

GET /graph/stats/src_type/dst_type

I like this one, and I like it way more than adding stats at the end of the visit endpoint (the rational being that the initial part of an endpoint is the method verb and should define its main function).

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

In D1460#32835, @zack wrote:
GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*

This looks god. My only remaining nitpick would be on the parameter name, maybe extra_edges (trailing 's').

Here is a revisited version, I also added a starting point parameter for the visit:

GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

  • Let the client do the extra API call
  • Define some higher level API endpoints to return formated results
  • Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

zack added a comment.May 19 2019, 9:17 AM

Here is a revisited version, I also added a starting point parameter for the visit:

GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]

Approved.

As a matter of terminology, as it is gonna be a recurrent term, let's use "node identifier" (or "node_id" for short) instead of src_hash.

And, on that matter, given we already need a pair node_type/hash there, I was thinking that we can just use SWH PIDs and be done with it. (We might consider stripping the prefix "swh:1:", but I doubt it's worth the saving, given the cost of divergence from the standard.)

What remains to be defined is how to specify extra attributes that will be added to the list of returned node identifiers.

On the server side, the only information we have is <node_type, node_id> so adding extra attributes means calling the SWH API anyway. I can think of multiple options here:

  • Let the client do the extra API call
  • Define some higher level API endpoints to return formated results
  • Add additional options to the visit/ endpoint to attach the extra info

I think that option 1 or 2 are best, since the visit/ endpoint already have quite a lot of information to it.

After our chat on this point, I agree we should just have the identifiers here and nothing else.

We might consider a separate service to "do the JOINs" from path to other attributes, but it is indeed a separate concern that is not in scope of graph compression.

haltode updated this revision to Diff 4883.May 19 2019, 4:28 PM

Update API description based on feedbacks

haltode marked 12 inline comments as done.May 19 2019, 4:29 PM
haltode added a comment.EditedMay 22 2019, 6:31 AM

I think that the src_type/dst_type in both the URL and extra_edges is a bit redundant. We could refactor the visit function into the following endpoint:

GET /graph/visit/swh_id/[?allowed_edges=["src_type/dst_type",...]][?direction={forward,backward}]

Where by default we can explore the graph following all types of edges, and restrict it if necessary.

zack added a comment.May 22 2019, 9:14 AM

I think that the src_type/dst_type in both the URL and extra_edges is a bit redundant. We could refactor the visit function into the following endpoint:

GET /graph/visit/swh_id/[?allowed_edges=["src_type/dst_type",...]][?direction={forward,backward}]

Where by default we can explore the graph following all types of edges, and restrict it if necessary.

This is in fact totally along the lines of what I mentioned last week in chat, but didn't have the time to write properly; so thanks for doing so :-)
My main reason for linking this more than the previous approach is that "visit this node" is really the main intent that one has, everything else (including the edges you're allowed to cross) is a "detail".

Some other thoughts that crossed my mind:

  • given we no longer have the constraint of putting the "main" type of edges before the query parameters, we can use ":" as separator for from/to edges, I think it'd be nicer on the eye than "/"
  • s/allowed_edges/edges/, maybe?
  • if edges are not specified, we should follow *all* edges during the visit
  • in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"
zack requested changes to this revision.May 22 2019, 9:15 AM
This revision now requires changes to proceed.May 22 2019, 9:15 AM
In D1460#33391, @zack wrote:

Some other thoughts that crossed my mind:

  • given we no longer have the constraint of putting the "main" type of edges before the query parameters, we can use ":" as separator for from/to edges, I think it'd be nicer on the eye than "/"
  • s/allowed_edges/edges/, maybe?

Sure.

  • if edges are not specified, we should follow *all* edges during the visit

Yes, this is what I meant here: "Where by default we can explore the graph following all types of edges"

  • in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"

I chose only DFS because the memory overhead of the BFS on the entire graph might be too much if we already have the compressed graph in memory.

zack added a comment.May 22 2019, 9:50 AM
  • if edges are not specified, we should follow *all* edges during the visit

Yes, this is what I meant here: "Where by default we can explore the graph following all types of edges"

Ah, sorry, I overlooked that. (So: yes, full agreement !)

  • in addition to direction, we can also have a type of visit, which could take "dfs", "bfs", defaulting to "dfs"

I chose only DFS because the memory overhead of the BFS on the entire graph might be too much if we already have the compressed graph in memory.

Let's separate two concerns here: the API spec from what is doable behind the scenes. If that's the "only" issue, let's put the parameter here, with a single value allowed which is also the default. Once doing more will become doable, we'll add the other option.

haltode updated this revision to Diff 4927.May 22 2019, 10:09 AM
  • Factor 'visit' endpoint
  • Add 'traversal' query string (dfs or bfs)
haltode updated this revision to Diff 4928.May 22 2019, 10:11 AM

Remove 'bfs' as an option for 'traversal'

zack requested changes to this revision.May 30 2019, 10:33 AM
zack added inline comments.
api/docs/api.rst
14

minor: this deserve an hyperlink to the PID doc

17

I don't see how "colon-separated list of node types" is expressive enough. We need to specify a list of edge type, and each edge type is a pair of node types. So I think this should be something like "comma-separated list of src:dst strings", otherwise you can only specify one *path* type (possibly crossing multiple types of edges), but no more than that.

This revision now requires changes to proceed.May 30 2019, 10:33 AM
haltode updated this revision to Diff 5031.May 31 2019, 7:32 AM
  • Add SWH id documentation link
  • Rephrase edge type description
haltode marked 2 inline comments as done.May 31 2019, 7:32 AM
haltode updated this revision to Diff 5032.May 31 2019, 8:16 AM

Rebasing on master.

zack accepted this revision.Jun 4 2019, 2:07 PM
This revision is now accepted and ready to land.Jun 4 2019, 2:07 PM
This revision was automatically updated to reflect the committed changes.