This document lists use cases and benchmark scenarii for the Software Heritage
graph service.
Conventions
===========
- **Node identification**: in the following, nodes are always identified by
their Software Heritage P:ref:`persistent I-identifiers` (SWH PIDs).
Use cases
=========
Browsing
--------
The following use cases require traversing the *forward graph*.
- **ls**: given a directory node, list (non recursively) all linked nodes of
type directory and content
Implementation: **TODO** we don't have a ready-to-use API endpoint for this.
We need a simple "return adjacency list" method (or, alternatively, a BFS
that stops after one layer, but the former will be more generally useful).
Proposal: T1886.:
/graph/neighbos/:NODE_ID?edges=dir:cnt,dir:dir
- **ls -R**: given a directory node, recursively list all linked nodes of type
directory and content
Implementation::
/graph/visit/paths/:NODE_ID?edges=dir:cnt,dir:dir
- **git log**: given a revision node, recursively list all linked nodes of type
revision
Implementation::
/graph/visit/nodes/:NODE_ID?edges=rev:rev
Vault
-----
The following use cases require traversing the *forward graph*.
- **tarball** (same as *ls -R* above)
- **git bundle**: given a node, recursively list all linked nodes of any kind
Implementation::
/graph/visit/nodes/:NODE_ID?edges=*
Provenance
----------
The following use cases require traversing the *backward (transposed)
graph*.
- **commit provenance**: given a content or directory node, return *a* commit
whose directory (recursively) contains it
Implementation:
:
/graph/walk/:NODE_ID/rev?direction=backward
- **complete commit provenance**: given a content or directory node, return
*all* commits whose directory (recursively) contains it
Implementation:
/graph/visit/nodes/:src?edges=dir:dir,cnt:dir,dir:rev?directory=backward
- *TODO*: this implementation is not satisfactory, as it also return
intermediate nodes, while we only care about commit nodes (i.e., the last
"layer"):
Proposal: T1889/graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
- *TODO*: when traversing the graph backward it is not entirely clear how
edges should be specified (Related: `T1888). We're assuming here that we
should list the type of edges in the transposted graph (hence,<https://forge.softwareheritage.org/T1888>`_). `dir:dir`We're assuming here that we
should list the type of edges occurring in the transposed graph (hence,
``dir:dir`` follow edges from directories to *parent* directories).
- **origin provenance**: given a content, directory, or commit node, return
*an* origin that has at least one snapshot that (recursively) contains it
Implementation:
:
/graph/walk/:NODE_ID/ori?direction=backward
- **complete origin provenance**: given a content, directory, or commit node,
return *all* origins that have at least one snapshot that (recursively)
contains it
Implementation::
/graph/visit/nodes/:src?leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev,rev:snp,snp:ori?directory=backward
- The same two caveats/*TODO*s of *complete commit provenance* applies here.
- *SLOC tracking*: left as future work
### Provenance statistics
- **revision size** distribution: for each revision, count the number of content,that recursively link to it. Plot the popularity of revision size (sum provide the number of edges in the revision-content flat model).Provenance statistics
~~~~~~~~~~~~~~~~~~~~~
- **origin size** distribution: for each origin, count the number of revisions,that recursively link to it. Plot the popularity of origin size.
The following use cases require traversing the *backward (transposed)
graph*.
- **content popularity across commits**: for each content, count the number of
commits (or *commit popularity*) that link to a directory that (recursively)
includes it. Plot the distribution of content popularity across commits.
Implementation: apply *complete commit provenance* to each content node,
count the returned commits, aggregate.
- **commit popularity across origins**: for each commit, count the number of
origins (or *origin popularity*) that have a snapshot that (recursively)
includes it. Plot the distribution of commit popularity across origins.
Implementation: apply *complete origin provenance* to each commit node, count
the returned origins, aggregate.
- *SLOC popularity across contents*: left as future work
The following use cases require traversing the *forward graph*.
- **revision size** (as n. of contents) distribution: for each revision, count
the number of contents that are (recursively) reachable from it. Plot the
distribution of revision sizes.
- **origin size** (as n. of revisions) distribution: for each origin, count the
number of revisions that are (recursively) reachable from it. Plot the
distribution of origin sizes.
Benchmarks
==========
Notes on how to benchmark graph access:
- separate pure-graph timings from timings related to use additional mappings
(e.g., node types), no matter if the mappings are in-memory or on-disk
- separate in-memory timings from on-disk timings; in particular, separate the
timing of translating node identifiers between internal integers and SWH PIDs
- for each use case that requires a node as input, we will randomize the choice
of the input node and repeat the experiment a suitable number of times; where
possible we will aggregate results computing basic statistics (average,
standard deviation), as well as normalize results w.r.t. the "“size"” of the
chosen node (e.g., number of nodes/path length in the resulting visit)
Basic benchmarks
----------------
- **Edge traversal**: given a node, retrieve the first node in its adjacency
list.
For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
times to retrieve the adjacency list of a node (and/or test if an edge exists
between two nodes) in the 2-3 us range, for the largest graph in their
experiments (22 M nodes, 600 M edges).
Each use case is a benchmark
----------------------------
In addition to abstract benchmark, we will use each use case above as a
scenario-based benchmark.