Page MenuHomeSoftware Heritage

api: docs: new graph API (path/ and explore/ endpoints)
ClosedPublic

Authored by haltode on Jun 21 2019, 8:51 AM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Summary

This API is both more general and flexible than previous one.

The visit/ endpoint was re-worked into the explore/ endpoint:

  • Removing traversal option since we care about *all* paths (cf path/ endpoint, here a BFS does not carry out additional informations)
  • Add a 'nodes' field in the JSON output, we do not necessarily always want full paths.
  • Add a 'destination' in the URL that can either be a specific node id, or a node type.

A new path/ endpoint was added to account for single path information. The
traversal option with the BFS is used here because it has real meanings:

  • The BFS can be used to get the shortest path between nodes.
  • DFS/BFS can be chosen depending on the subgraph topology to explore it more effectively.

Diff Detail

Repository
rDGRPH Compressed graph representation
Branch
new-visit-api
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6488
Build 9032: arc lint + arc unit

Event Timeline

zack added a subscriber: seirl.

Thanks for this updated version, we're definitely refining various use cases here.

I've added some comments and requested changes, but I'd also love to have @seirl's feedback before you rework this.

api/docs/api.rst
71–72

i propose "Walk" for this endpoint

79–80

As we have to state this multiple time, it should be factorized. Let's add a preliminary section, before the endpoints, where we clarify how to specify:

  1. individual graph nodes → using SWH PIDs
  2. node types → using the 3-letter specifiers
  3. edge types → using the syntax we defined

note that (2) and (3) should also cater for the * qualifier, which I think got lost in this version of the API even if we discussed it

81–82

see suggested refactoring above

82

this should be something like "the traversal will stop at the first node encountered which match the desired destination" (we should also explicitly state that the given node will be included in the returned node/path list)

85

to be refactored, as noted above

98

How about allowing the user specify if they want paths/nodes/both in the output?

The default should be having only nodes, and it should allow for a much more efficient implementation if we can skip paths.

102

this key looks redundant, we can just return the list without an extra dictionary, no?

(OTOH this makes me wonder whether we wont a different kind of wrapping, like always returning a dictionary with a result field for the main result, as that will leave room for returning out-of-band information. To be checked against best practices for JSON-based API and compared with the current SWH API.)

111–112

I think this one should remain "Visit", as it's really the right terminology for this in graph theory.

119–121

see suggested refactoring above

122–124

I'm not sure I see a use case for this one.

If we use the "type" based destination, it's not really more expressive than using edge restrictions (aside from the possibility of stopping at the first encountered node of a given type, but that seems more like a odd side-effect than desired behavior).

If we use the "node identity" based destination, than it is more expressive than only having edge restrictions, but I still don't see a use case for listing all paths to a given node. Usually when you want that you're just traversing in the other direction starting from that node.

Bottom line: unless I'm missing some relevant use case, I'd just remove the dst parameter from here.

zack requested changes to this revision.Jun 23 2019, 4:39 PM
This revision now requires changes to proceed.Jun 23 2019, 4:39 PM
api/docs/api.rst
122–124

Oh, something else in passing about this one, which is not really about the API, but rather its upcoming implementation: for the node-to-node requests, the most relevant implementation is complete, incremental visit with backtracking; whereas for the node-to-node_type requests the most relevant implementation is probably a Bernoulli random walk with backtracking.

api/docs/api.rst
79–80

Yes the API was definitely intended to work with the '*' semantic, this was an oversight.

98

How about this (instead of a query string):

GET /graph/visit/:src -> returns both
GET /graph/visit/nodes/:src -> returns only nodes
GET /graph/visit/paths/:src -> returns only paths

102

From what I read in the rest of SWH API, removing the unnecessary path key seems the way to go.

122–124

I initially added the dst parameter for homogeneity in the two endpoints, but indeed there is no particular use case for us.

  • Add semantics section with examples
  • Rename endpoints to walk/ and visit/
  • Update JSON output style for walk/ endpoint
zack requested changes to this revision.Jun 25 2019, 7:53 PM

Looks great ! I've noted down a bunch of requested changes, but they really only about style. Nothing that will get in the way of actually implementing any of this.

api/docs/api.rst
3–6

This section now looks good, except a couple minor points about weird wording, this is one of them: s/Semantics/Terminology/ and something like "This API uses the following notions:" as introductory phrase.

8–9

if you want to highlight the node here, it should be something like: "Node: a node in the Software Heritage graph, identified by a persistent identifier", where you can hyperlink "SWH graph" to the data model page in the doc, and "persistent identifier" to the PID one.

It's also worth introducing the "[SWH] PID" acronym here.

Alternatively you can highlight the identifier itself, but that looks less clean.

19–27

each of this should state first what it is w.r.t. the previous section, e.g.:

  • ``swh:1:cnt:...'' the identifier of a node of type content containing ...
  • `"dir:dir,dir:cnt"` node types allowing edges from directories to directories, and …
  • etc.
35

s/ending point included/final destination node included/

37

let's be more explicit: "starting node specified as a SWH PID"

38–39

again, more explicit: "specification of the desired destination node, either as a node identifier or as a node type"

43–44

traversal algorithm; can be either `dfs` or …

45–46

direction in which graph edges will be followed; can be either `forward` or …

48

s/no error success/

49

shorter: "invalid query string provided"

50

shorter: "starting node cannot be found"

79–80

this can now be just "SWH PID", with the new terminology section

80–82

for consistency with previous suggestion: ""specification of the desired destination node, either as a node identifier or as a node type""

86–89

as above:
200: success
400: invalid query string provided
404: starting node cannot be found

98

This is pretty very elegant, I'm sold !

This revision now requires changes to proceed.Jun 25 2019, 7:53 PM
  • Better wording and explanations
  • Add visit sub-endpoints (/visit/nodes and /visit/paths)
  • Remove 'dst' attribute in /visit/ endpoint
This revision is now accepted and ready to land.Jun 26 2019, 1:12 PM