Page MenuHomeSoftware Heritage
Paste P461

swh-graph use cases
ActivePublic

Authored by zack on Jul 7 2019, 1:20 PM.
This document lists use cases and benchmark scenarii for the Software Heritage
graph service.
Conventions
===========
- **Node identification**: in the following, nodes are always identified by
their Software Heritage :ref:`persistent-identifiers` (SWH PIDs).
Use cases
=========
Browsing
--------
The following use cases require traversing the *forward graph*.
- **ls**: given a directory node, list (non recursively) all linked nodes of
type directory and content
Implementation::
/graph/neighbos/:NODE_ID?edges=dir:cnt,dir:dir
- **ls -R**: given a directory node, recursively list all linked nodes of type
directory and content
Implementation::
/graph/visit/paths/:NODE_ID?edges=dir:cnt,dir:dir
- **git log**: given a revision node, recursively list all linked nodes of type
revision
Implementation::
/graph/visit/nodes/:NODE_ID?edges=rev:rev
Vault
-----
The following use cases require traversing the *forward graph*.
- **tarball** (same as *ls -R* above)
- **git bundle**: given a node, recursively list all linked nodes of any kind
Implementation::
/graph/visit/nodes/:NODE_ID?edges=*
Provenance
----------
The following use cases require traversing the *backward (transposed)
graph*.
- **commit provenance**: given a content or directory node, return *a* commit
whose directory (recursively) contains it
Implementation::
/graph/walk/:NODE_ID/rev?direction=backward
- **complete commit provenance**: given a content or directory node, return
*all* commits whose directory (recursively) contains it
Implementation::
/graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev
- *TODO*: when traversing the graph backward it is not entirely clear how
edges should be specified (Related: `T1888
<https://forge.softwareheritage.org/T1888>`_). We're assuming here that we
should list the type of edges occurring in the transposed graph (hence,
``dir:dir`` follow edges from directories to *parent* directories).
- **origin provenance**: given a content, directory, or commit node, return
*an* origin that has at least one snapshot that (recursively) contains it
Implementation::
/graph/walk/:NODE_ID/ori?direction=backward
- **complete origin provenance**: given a content, directory, or commit node,
return *all* origins that have at least one snapshot that (recursively)
contains it
Implementation::
/graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev,rev:snp,snp:ori
- The same *TODO* of *complete commit provenance* applies here.
- *SLOC tracking*: left as future work
Provenance statistics
~~~~~~~~~~~~~~~~~~~~~
The following use cases require traversing the *backward (transposed)
graph*.
- **content popularity across commits**: for each content, count the number of
commits (or *commit popularity*) that link to a directory that (recursively)
includes it. Plot the distribution of content popularity across commits.
Implementation: apply *complete commit provenance* to each content node,
count the returned commits, aggregate.
- **commit popularity across origins**: for each commit, count the number of
origins (or *origin popularity*) that have a snapshot that (recursively)
includes it. Plot the distribution of commit popularity across origins.
Implementation: apply *complete origin provenance* to each commit node, count
the returned origins, aggregate.
- *SLOC popularity across contents*: left as future work
The following use cases require traversing the *forward graph*.
- **revision size** (as n. of contents) distribution: for each revision, count
the number of contents that are (recursively) reachable from it. Plot the
distribution of revision sizes.
- **origin size** (as n. of revisions) distribution: for each origin, count the
number of revisions that are (recursively) reachable from it. Plot the
distribution of origin sizes.
Benchmarks
==========
Notes on how to benchmark graph access:
- separate pure-graph timings from timings related to use additional mappings
(e.g., node types), no matter if the mappings are in-memory or on-disk
- separate in-memory timings from on-disk timings; in particular, separate the
timing of translating node identifiers between internal integers and SWH PIDs
- for each use case that requires a node as input, we will randomize the choice
of the input node and repeat the experiment a suitable number of times; where
possible we will aggregate results computing basic statistics (average,
standard deviation), as well as normalize results w.r.t. the “size” of the
chosen node (e.g., number of nodes/path length in the resulting visit)
Basic benchmarks
----------------
- **Edge traversal**: given a node, retrieve the first node in its adjacency
list.
For reference: Apostolico, Drovandi in *Graph Compression by BFS* report
times to retrieve the adjacency list of a node (and/or test if an edge exists
between two nodes) in the 2-3 us range, for the largest graph in their
experiments (22 M nodes, 600 M edges).
Each use case is a benchmark
----------------------------
In addition to abstract benchmark, we will use each use case above as a
scenario-based benchmark.

Event Timeline

zack changed the title of this paste from untitled to Masterwork From Distant Lands.
zack changed the title of this paste from Masterwork From Distant Lands to swh-graph use cases.Jul 7 2019, 1:21 PM
zack changed the visibility from "All Users" to "Public (No Login Required)".
zack added a project: Compressed graph service.
zack updated the paste's language from autodetect to remarkup.
zack edited the content of this paste. (Show Details)
zack edited the content of this paste. (Show Details)
zack updated the paste's language from remarkup to rst.Jul 15 2019, 4:19 PM
zack edited the content of this paste. (Show Details)