Conventions
===========
- **Node identification**: in the following, nodes are always identified by
their Software Heritage Persistent Identifiers (SWH PIDs).
Use cases
=========
Browsing
--------
The following use cases require traversing the *forward graph*.
- **ls**: given a directory node, list (non recursively) all linked nodes of
type directory and content
Implementation:
**TODO** /!\ we don't have a ready-to-use API endpoint to answer this;
we need a simple "return adjacency list" method (or, alternatively, a
BFS that stops after one layer, but the former will be more generally
useful). Proposal:
/graph/edges/:src?edges=...
will return all edges that has :src: as from node, matching the
restriction in edges=..., as usual.
- **ls -R**: given a directory node, recursively list all linked nodes of type
directory and content
Implementation:
/graph/visit/paths/:NODE_ID?edges=dir:cnt,dir:dir
- **git log**: given a revision node, recursively list all linked nodes of type
revision
Implementation:
/graph/visit/nodes/:NODE_ID?edges=rev:rev
Vault
-----
The following use cases require traversing the *forward graph*.
- **tarball** (same as *ls -R* above)
- **git bundle**: given a node, recursively list all linked nodes of any kind
Implementation:
/graph/visit/nodes/:NODE_ID?edges=*
Provenance
----------
The following use cases require traversing the *backward (transposed) graph*.
- **commit provenance**: given a content or directory node, return *a* commit
whose directory (recursively) contains it
Implementation:
TODO
- **complete commit provenance**: given a content or directory node, return
*all* commits whose directory (recursively) contains it
Implementation:
TODO
- **origin provenance**: given a content, directory, or commit node, return
*an* origin that has at least one snapshot that (recursively) contains it
Implementation:
TODO
- **complete origin provenance**: given a content, directory, or commit node,
return *all* origins that have at least one snapshot that (recursively)
contains it
Implementation:
TODO
- *SLOC tracking*: left as future work
### Provenance statistics
The following use cases require traversing the *backward (transposed) graph*.
- **content popularity across commits**: for each content, count the number of
commits (or *commit popularity*) that link to a directory that (recursively)
includes it. Plot the distribution of content popularity across commits.
Implementation:
TODO
- **commit popularity across origins**: for each commit, count the number of
origins (or *origin popularity*) that have a snapshot that (recursively)
includes it. Plot the distribution of commit popularity across origins.
Implementation:
TODO
- *SLOC popularity across contents*: left as future work
Benchmarks
==========
Notes on how to benchmark graph access:
- separate pure-graph timings from timings related to use additional mappings
(e.g., node types), no matter if the mappings are in-memory or on-disk
- separate in-memory timings from on-disk timings; in particular, separate the
timing of translating node identifiers between internal integers and SWH PIDs
- for each use case that requires a node as input, we will randomize the choice
of the input node and repeat the experiment a suitable number of times; where
possible we will aggregate results computing basic statistics (average,
standard deviation), as well as normalize results w.r.t. the "size" of the
chosen node (e.g., number of nodes/path length in the resulting visit)
Basic benchmarks
----------------
- **Edge traversal**: given a node, retrieve the first node in its adjacency
list.
For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
times to retrieve the adjacency list of a node (and/or test if an edge exists
between two nodes) in the 2-3 us range, for the largest graph in their
experiments (22 M nodes, 600 M edges).
Each use case is a benchmark
----------------------------
In addition to abstract benchmark, we will use each use case above as a
scenario-based benchmark.