Page MenuHomeSoftware Heritage

use-cases.rst
No OneTemporary

use-cases.rst

=========
Use cases
=========
This document lists use cases and benchmark scenarios for the Software Heritage
graph service.
Conventions
===========
- **Node identification**: in the following, nodes are always identified by
their :ref:`SWHIDs <persistent-identifiers>`.
Use cases
=========
Browsing
--------
The following use cases require traversing the *forward graph*.
- **ls**: given a directory node, list (non recursively) all linked nodes of
type directory and content
Implementation::
/graph/neighbors/:DIR_ID?edges=dir:cnt,dir:dir
- **ls -R**: given a directory node, recursively list all linked nodes of type
directory and content
Implementation::
/graph/visit/paths/:DIR_ID?edges=dir:cnt,dir:dir
- **git log**: given a revision node, recursively list all linked nodes of type
revision
Implementation::
/graph/visit/nodes/:REV_ID?edges=rev:rev
Vault
-----
The following use cases require traversing the *forward graph*.
- **tarball** (same as *ls -R* above)
- **git bundle**: given a node, recursively list all linked nodes of any kind
Implementation::
/graph/visit/nodes/:NODE_ID?edges=*
Provenance
----------
The following use cases require traversing the *backward (transposed)
graph*.
- **commit provenance**: given a content or directory node, return *a* commit
whose directory (recursively) contains it
Implementation::
/graph/walk/:NODE_ID/rev?direction=backward&edges=dir:dir,cnt:dir,dir:rev
- **complete commit provenance**: given a content or directory node, return
*all* commits whose directory (recursively) contains it
Implementation::
/graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
- **origin provenance**: given a content, directory, or commit node, return
*an* origin that has at least one snapshot that (recursively) contains it
Implementation::
/graph/walk/:NODE_ID/ori?direction=backward&edges=*
- **complete origin provenance**: given a content, directory, or commit node,
return *all* origins that have at least one snapshot that (recursively)
contains it
Implementation::
/graph/leaves/:NODE_ID?direction=backward&edges=*
- *SLOC tracking*: left as future work
Provenance statistics
~~~~~~~~~~~~~~~~~~~~~
The following use cases require traversing the *backward (transposed)
graph*.
- **content popularity across commits**: for each content, count the number of
commits (or *commit popularity*) that link to a directory that (recursively)
includes it. Plot the distribution of content popularity across commits.
Implementation: apply *complete commit provenance* to each content node,
count the returned commits, aggregate.
- **commit popularity across origins**: for each commit, count the number of
origins (or *origin popularity*) that have a snapshot that (recursively)
includes it. Plot the distribution of commit popularity across origins.
Implementation: apply *complete origin provenance* to each commit node, count
the returned origins, aggregate.
- *SLOC popularity across contents*: left as future work
The following use cases require traversing the *forward graph*.
- **revision size** (as n. of contents) distribution: for each revision, count
the number of contents that are (recursively) reachable from it. Plot the
distribution of revision sizes.
- **origin size** (as n. of revisions) distribution: for each origin, count the
number of revisions that are (recursively) reachable from it. Plot the
distribution of origin sizes.
Benchmarks
==========
Notes on how to benchmark graph access:
- separate pure-graph timings from timings related to use additional mappings
(e.g., node types), no matter if the mappings are in-memory or on-disk
- separate in-memory timings from on-disk timings; in particular, separate the
timing of translating node identifiers between internal integers and SWHIDs
- for each use case that requires a node as input, we will randomize the choice
of the input node and repeat the experiment a suitable number of times; where
possible we will aggregate results computing basic statistics (average,
standard deviation), as well as normalize results w.r.t. the “size” of the
chosen node (e.g., number of nodes/path length in the resulting visit)
Basic benchmarks
----------------
- **Edge traversal**: given a node, retrieve the first node in its adjacency
list.
For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
times to retrieve the adjacency list of a node (and/or test if an edge exists
between two nodes) in the 2-3 us range, for the largest graph in their
experiments (22 M nodes, 600 M edges).
Each use case is a benchmark
----------------------------
In addition to abstract benchmark, we will use each use case above as a
scenario-based benchmark.

File Metadata

Mime Type
text/plain
Expires
Fri, Jul 4, 2:39 PM (4 d, 23 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3287645

Event Timeline