Changeset View
Standalone View
api/docs/api.rst
- This file was added.
Graph traversal API | |||||
zack: Minor point: the fact it's compressed is an implementation detail. What we are defining here is… | |||||
=================== | |||||
Visit | |||||
----- | |||||
.. http:get:: /graph/visit/:swh_id | |||||
Performs a graph traversal and returns explored paths (in the order of the | |||||
traversal). | |||||
:param string swh_id: starting point for the traversal, represented as a | |||||
`Software Heritage persitent identifier | |||||
Done Inline Actionsminor: this deserve an hyperlink to the PID doc zack: minor: this deserve an hyperlink to the PID doc | |||||
<https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers>`_ | |||||
Done Inline ActionsVarious naming inconsistencies here:
cc: @seirl zack: Various naming inconsistencies here:
- "file" should be "content" or "cnt" for short
- we are… | |||||
Done Inline ActionsThe notion of dataset will be gone soon: we have it now because we compressed different subgraphs separately, but we really want to compress the entire graph at once (as it's needed for cross-subgraph traversals). As a result all subgraph-specific aspects of the API should be generalized. A first proposal to do so is to replace all "SRC_to_DST" strings with "/SRC/DST" URL paths, e.g., "dir_to_rev" → "/dir/rev" fixing the implicit notion that the first is the source, the second the destination. They're not strictly speaking hierarchical, but it seems way nicer than ?from=SRC&to=DST. Not sure if there are better/more idiomatic REST patterns for this. (It might be useful to look around and check other existing REST APIs for traversals in existing graph databases.) zack: The notion of dataset will be gone soon: we have it now because we compressed different… | |||||
:query string edges: edges types the traversal can follow, expressed as a | |||||
Done Inline ActionsI don't see how "colon-separated list of node types" is expressive enough. We need to specify a list of edge type, and each edge type is a pair of node types. So I think this should be something like "comma-separated list of src:dst strings", otherwise you can only specify one *path* type (possibly crossing multiple types of edges), but no more than that. zack: I don't see how "colon-separated list of node types" is expressive enough. We need to specify a… | |||||
comma-separated list of ``src:dst`` strings (e.g.: ``"rev:rev,rev:dir"``). | |||||
Done Inline Actionsthis can be made more hierarchic to reduce duplication of information, e.g.: counts: { nodes: ...; edges: ...; }; indegree: { max: ...; min: ...; } etc. zack: this can be made more hierarchic to reduce duplication of information, e.g.:
```
counts: {… | |||||
If no string is provided, the traversal will use all types of edges. | |||||
:query string traversal: default to ``dfs`` | |||||
:query string direction: can be either ``forward`` or ``backward``, default | |||||
to ``forward`` | |||||
:statuscode 200: no error | |||||
:statuscode 400: an invalid query string has been provided | |||||
:statuscode 404: requested starting node cannot be found in the graph | |||||
.. sourcecode:: http | |||||
HTTP/1.1 200 OK | |||||
Content-Type: application/json | |||||
{ | |||||
Done Inline Actions{space,time,spacetime}-recursive are our use cases, but what the API is offering here are just endpoints to visit the graph. All things considered, how about a method like this:
with direction defaulting to forward. I realized this is able to deal only with the case where we allow ourselves to only traverse one type of edges. It could be generalized by an optional parameter to specify which *additional* edges we allow ourselves to traverse, e.g., as a list of "SRC_TYPE/DST_TYPE" strings. Once we have that, we might add shortcut methods for common use cases, e.g., visiting rev→dir→{dir,cnt} subgraphs. They will just be syntactical shortcuts that require no ad-hoc implementation. zack: {space,time,spacetime}-recursive are our use cases, but what the API is offering here are just… | |||||
"paths": [ | |||||
[ | |||||
"swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2", | |||||
... | |||||
], | |||||
[ | |||||
"swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505", | |||||
... | |||||
], | |||||
... | |||||
] | |||||
} | |||||
Stats | |||||
Done Inline ActionsIt could be useful to also provide their hash. vlorentz: It could be useful to also provide their hash. | |||||
Done Inline ActionsYep, a list of paths will not do here. I'm not clear on where/how to include full paths. In general visits they can become very long, potentially up to the diameter of the subgraph explored by the visit. Maybe it should be an optional additional attribute, defaulting to false. zack: Yep, a list of paths will not do here.
We need a list of node identifiers. A node identifier is… | |||||
Done Inline ActionsYes, by "path" I meant a path of hashes. I only have access to those informations from the graph datasets (.nodes.csv.gz and .edges.csv.gz). haltode: Yes, by "path" I meant a path of hashes. I only have access to those informations from the… | |||||
----- | |||||
.. http:get:: /graph/stats/:src_type/:dst_type | |||||
Returns statistics on the compressed graph. | |||||
:param string src_type: can either be ``dir``, ``ori``, ``rel``, ``rev``, or | |||||
``snp`` | |||||
:param string dst_type: can either be ``cnt``, ``dir``, ``obj``, ``rev``, or | |||||
``snp`` | |||||
:statuscode 200: no error | |||||
:statuscode 404: requested subgraph cannot be found | |||||
.. sourcecode:: http | |||||
Done Inline Actionssame vlorentz: same | |||||
HTTP/1.1 200 OK | |||||
Content-Type: application/json | |||||
{ | |||||
"counts": { | |||||
"nodes": 16222788, | |||||
"edges": 9907464 | |||||
}, | |||||
"ratios": { | |||||
"compression": 0.367, | |||||
"bits_per_node": 5.846, | |||||
"bits_per_edge": 9.573, | |||||
"avg_locality": 270.369 | |||||
}, | |||||
"indegree": { | |||||
Done Inline ActionsWhat do you mean by "revision logs"? revision messages? vlorentz: What do you mean by "revision logs"? revision messages? | |||||
Done Inline Actions(and if it's messages, then it must be bytes instead of strings; so you may want to use msgpack instead of JSON) vlorentz: (and if it's messages, then it must be bytes instead of strings; so you may want to use msgpack… | |||||
Done Inline ActionsWe want to have minimal information returned attached to the nodes here: so by default is should just be the revision identifiers. It should be up to the user to go fetch additional info. And, we can support additional parameters that allows the user to request including in the response additional fields (e.g., log messages in this case). zack: We want to have minimal information returned attached to the nodes here: so by default is… | |||||
Done Inline ActionsOkay. So it should say "revision ids" instead of "revision logs" vlorentz: Okay. So it should say "revision ids" instead of "revision logs" | |||||
Done Inline ActionsYep, once again I meant to use the revisions identifiers. haltode: Yep, once again I meant to use the revisions identifiers. | |||||
"min": 0, | |||||
"max": 12382, | |||||
"avg": 0.6107127825377487 | |||||
}, | |||||
"outdegree": { | |||||
"min": 0, | |||||
"max": 1, | |||||
"avg": 0.6107127825377487 | |||||
} | |||||
} |
Minor point: the fact it's compressed is an implementation detail. What we are defining here is a generic API to efficiently run graph algos. I *think* that most of them will be traversal graphs, so this is likely gonna be the "Graph traversal API", and can be further generalized later if need be.