This document lists use cases and benchmark scenarii for the Software Heritage graph service. Conventions =========== - **Node identification**: in the following, nodes are always identified by their Software Heritage :ref:`persistent-identifiers` (SWH PIDs). Use cases ========= Browsing -------- The following use cases require traversing the *forward graph*. - **ls**: given a directory node, list (non recursively) all linked nodes of type directory and content Implementation:: /graph/neighbos/:NODE_ID?edges=dir:cnt,dir:dir - **ls -R**: given a directory node, recursively list all linked nodes of type directory and content Implementation:: /graph/visit/paths/:NODE_ID?edges=dir:cnt,dir:dir - **git log**: given a revision node, recursively list all linked nodes of type revision Implementation:: /graph/visit/nodes/:NODE_ID?edges=rev:rev Vault ----- The following use cases require traversing the *forward graph*. - **tarball** (same as *ls -R* above) - **git bundle**: given a node, recursively list all linked nodes of any kind Implementation:: /graph/visit/nodes/:NODE_ID?edges=* Provenance ---------- The following use cases require traversing the *backward (transposed) graph*. - **commit provenance**: given a content or directory node, return *a* commit whose directory (recursively) contains it Implementation:: /graph/walk/:NODE_ID/rev?direction=backward - **complete commit provenance**: given a content or directory node, return *all* commits whose directory (recursively) contains it Implementation:: /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev - *TODO*: when traversing the graph backward it is not entirely clear how edges should be specified (Related: `T1888 `_). We're assuming here that we should list the type of edges occurring in the transposed graph (hence, ``dir:dir`` follow edges from directories to *parent* directories). - **origin provenance**: given a content, directory, or commit node, return *an* origin that has at least one snapshot that (recursively) contains it Implementation:: /graph/walk/:NODE_ID/ori?direction=backward - **complete origin provenance**: given a content, directory, or commit node, return *all* origins that have at least one snapshot that (recursively) contains it Implementation:: /graph/leaves/:NODE_ID?direction=backward&edges=dir:dir,cnt:dir,dir:rev,rev:snp,snp:ori - The same *TODO* of *complete commit provenance* applies here. - *SLOC tracking*: left as future work Provenance statistics ~~~~~~~~~~~~~~~~~~~~~ The following use cases require traversing the *backward (transposed) graph*. - **content popularity across commits**: for each content, count the number of commits (or *commit popularity*) that link to a directory that (recursively) includes it. Plot the distribution of content popularity across commits. Implementation: apply *complete commit provenance* to each content node, count the returned commits, aggregate. - **commit popularity across origins**: for each commit, count the number of origins (or *origin popularity*) that have a snapshot that (recursively) includes it. Plot the distribution of commit popularity across origins. Implementation: apply *complete origin provenance* to each commit node, count the returned origins, aggregate. - *SLOC popularity across contents*: left as future work The following use cases require traversing the *forward graph*. - **revision size** (as n. of contents) distribution: for each revision, count the number of contents that are (recursively) reachable from it. Plot the distribution of revision sizes. - **origin size** (as n. of revisions) distribution: for each origin, count the number of revisions that are (recursively) reachable from it. Plot the distribution of origin sizes. Benchmarks ========== Notes on how to benchmark graph access: - separate pure-graph timings from timings related to use additional mappings (e.g., node types), no matter if the mappings are in-memory or on-disk - separate in-memory timings from on-disk timings; in particular, separate the timing of translating node identifiers between internal integers and SWH PIDs - for each use case that requires a node as input, we will randomize the choice of the input node and repeat the experiment a suitable number of times; where possible we will aggregate results computing basic statistics (average, standard deviation), as well as normalize results w.r.t. the “size” of the chosen node (e.g., number of nodes/path length in the resulting visit) Basic benchmarks ---------------- - **Edge traversal**: given a node, retrieve the first node in its adjacency list. For reference: Apostolico, Drovandi in *Graph Compression by BFS* report times to retrieve the adjacency list of a node (and/or test if an edge exists between two nodes) in the 2-3 us range, for the largest graph in their experiments (22 M nodes, 600 M edges). Each use case is a benchmark ---------------------------- In addition to abstract benchmark, we will use each use case above as a scenario-based benchmark.