diff --git a/tools/git2graph/README.md b/docs/git2graph.md similarity index 99% copy from tools/git2graph/README.md copy to docs/git2graph.md index 9372122..c44914d 100644 --- a/tools/git2graph/README.md +++ b/docs/git2graph.md @@ -1,69 +1,69 @@ git2graph ========= `git2graph` crawls a Git repository and outputs it as a graph, i.e., as a pair of textual files . The nodes file will contain a list of graph nodes as Software Heritage (SWH) Persistent Identifiers (PIDs); the edges file a list of graph edges as PID pairs. Dependencies ------------ Build time dependencies: - [libgit2](https://libgit2.org/) Test dependencies: - [bats](https://github.com/bats-core/bats-core) Micro benchmark --------------- $ time ./git2graph -n >(zstdmt > nodes.csv.zst) -e >(zstdmt -c > edges.csv.zst) /srv/src/linux 160,38s user 12,72s system 98% cpu 2:55,02 total - + $ zstdcat nodes.csv.zst | wc -l 6503403 $ zstdcat edges.csv.zst | wc -l 305096029 Parallel use ------------ `git2graph` writes fixed-length lines, long either 51 bytes (nodes) or 102 bytes (edges). When writing to a FIFO less than `PIPE_BUF` bytes (which is 4096 bytes on Linux, and guaranteed to be at least 512 bytes by POSIX), writes are atomic. Hence it is possible to mass analyze many repositories in parallel with something like: $ mkfifo nodes.fifo edges.fifo $ sort -u < nodes.fifo | zstdmt > nodes.csv.zst & $ sort -u < edges.fifo | zstdmt > edges.csv.zst & $ parallel git2graph -n nodes.fifo -e edges.fifo -- repo_dir_1 repo_dir_2 ... $ rm nodes.fifo edges.fifo Note that you most likely want to tune `sort` in order to be parallel (`--parallel`), use a large buffer size (`-S`), and use a temporary directory with enough available space (`-T`). (The above example uses `parallel` from [moreutils](https://joeyh.name/code/moreutils/), but it could trivially be adapted to use [GNU parallel](https://www.gnu.org/software/parallel/) or similar parallelization tools.) Limitations ----------- Snapshot PID calculation is not fully compatible with the [spec](https://docs.softwareheritage.org/devel/apidoc/swh.model.html#swh.model.identifiers.snapshot_identifier), because currently only HEAD is considered as a symbolic reference. Other symbolic refs, if present, will be ignored, potentially leading to a different snapshot PID than what Software Heritage will obtain. This is due to a limitation of libgit2, that at the time of writing doesn't allow to list all symbolic references. The graph structure is not affected, but looking up obtained snapshots by PID on the main Software Heritage archive might fail. diff --git a/docs/index.rst b/docs/index.rst index ed16a6c..d1daf3e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,18 +1,19 @@ .. _swh-graph: Software Heritage - graph service ================================= Tooling and service providing fast access to the graph representation of the Software Heritage archive. The service is in-memory, based on a compressed representation of the Software Heritage Merkle DAG (see :ref:`data-model`). .. toctree:: :maxdepth: 1 :caption: Overview compression api use-cases docker + git2graph /apidoc/swh.graph diff --git a/tools/git2graph/README.md b/tools/git2graph/README.md deleted file mode 100644 index 9372122..0000000 --- a/tools/git2graph/README.md +++ /dev/null @@ -1,69 +0,0 @@ -git2graph -========= - -`git2graph` crawls a Git repository and outputs it as a graph, i.e., as a pair -of textual files . The nodes file will contain a list of graph -nodes as Software Heritage (SWH) Persistent Identifiers (PIDs); the edges file -a list of graph edges as PID pairs. - - -Dependencies ------------- - -Build time dependencies: - -- [libgit2](https://libgit2.org/) - -Test dependencies: - -- [bats](https://github.com/bats-core/bats-core) - - -Micro benchmark ---------------- - - $ time ./git2graph -n >(zstdmt > nodes.csv.zst) -e >(zstdmt -c > edges.csv.zst) /srv/src/linux - 160,38s user 12,72s system 98% cpu 2:55,02 total - - $ zstdcat nodes.csv.zst | wc -l - 6503403 - $ zstdcat edges.csv.zst | wc -l - 305096029 - - -Parallel use ------------- - -`git2graph` writes fixed-length lines, long either 51 bytes (nodes) or 102 -bytes (edges). When writing to a FIFO less than `PIPE_BUF` bytes (which is 4096 -bytes on Linux, and guaranteed to be at least 512 bytes by POSIX), writes are -atomic. Hence it is possible to mass analyze many repositories in parallel with -something like: - - $ mkfifo nodes.fifo edges.fifo - $ sort -u < nodes.fifo | zstdmt > nodes.csv.zst & - $ sort -u < edges.fifo | zstdmt > edges.csv.zst & - $ parallel git2graph -n nodes.fifo -e edges.fifo -- repo_dir_1 repo_dir_2 ... - $ rm nodes.fifo edges.fifo - -Note that you most likely want to tune `sort` in order to be parallel -(`--parallel`), use a large buffer size (`-S`), and use a temporary directory -with enough available space (`-T`). (The above example uses `parallel` -from [moreutils](https://joeyh.name/code/moreutils/), but it could trivially be -adapted to use [GNU parallel](https://www.gnu.org/software/parallel/) or -similar parallelization tools.) - - -Limitations ------------ - -Snapshot PID calculation is not fully compatible with the -[spec](https://docs.softwareheritage.org/devel/apidoc/swh.model.html#swh.model.identifiers.snapshot_identifier), -because currently only HEAD is considered as a symbolic reference. Other -symbolic refs, if present, will be ignored, potentially leading to a different -snapshot PID than what Software Heritage will obtain. This is due to a -limitation of libgit2, that at the time of writing doesn't allow to list all -symbolic references. - -The graph structure is not affected, but looking up obtained snapshots by PID -on the main Software Heritage archive might fail. diff --git a/tools/git2graph/README.md b/tools/git2graph/README.md new file mode 120000 index 0000000..5ae67b0 --- /dev/null +++ b/tools/git2graph/README.md @@ -0,0 +1 @@ +../../docs/git2graph.md \ No newline at end of file