diff --git a/PKG-INFO b/PKG-INFO new file mode 100644 index 0000000..18ff922 --- /dev/null +++ b/PKG-INFO @@ -0,0 +1,56 @@ +Metadata-Version: 2.1 +Name: swh.graph +Version: 0.5.0 +Summary: Software Heritage graph service +Home-page: https://forge.softwareheritage.org/diffusion/DGRPH +Author: Software Heritage developers +Author-email: swh-devel@inria.fr +License: UNKNOWN +Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest +Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Source, https://forge.softwareheritage.org/source/swh-graph +Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-graph/ +Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Intended Audience :: Developers +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Operating System :: OS Independent +Classifier: Development Status :: 3 - Alpha +Requires-Python: >=3.7 +Description-Content-Type: text/x-rst +Provides-Extra: testing +License-File: LICENSE +License-File: AUTHORS + +Software Heritage - graph service +================================= + +Tooling and services, collectively known as ``swh-graph``, providing fast +access to the graph representation of the `Software Heritage +`_ +`archive `_. The service is in-memory, +based on a compressed representation of the Software Heritage Merkle DAG. + + +Bibliography +------------ + +In addition to accompanying technical documentation, ``swh-graph`` is also +described in the following scientific paper. If you publish results based on +``swh-graph``, please acknowledge it by citing the paper as follows: + +.. note:: + + Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. + `Ultra-Large-Scale Repository Analysis via Graph Compression + `_. In proceedings of `SANER + 2020 `_: The 27th IEEE International + Conference on Software Analysis, Evolution and Reengineering, pages + 184-194. IEEE 2020. + + Links: `preprint + `_, + `bibtex + `_. + + diff --git a/README.rst b/README.rst deleted file mode 120000 index cffceba..0000000 --- a/README.rst +++ /dev/null @@ -1 +0,0 @@ -docs/README.rst \ No newline at end of file diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..d7abf98 --- /dev/null +++ b/README.rst @@ -0,0 +1,30 @@ +Software Heritage - graph service +================================= + +Tooling and services, collectively known as ``swh-graph``, providing fast +access to the graph representation of the `Software Heritage +`_ +`archive `_. The service is in-memory, +based on a compressed representation of the Software Heritage Merkle DAG. + + +Bibliography +------------ + +In addition to accompanying technical documentation, ``swh-graph`` is also +described in the following scientific paper. If you publish results based on +``swh-graph``, please acknowledge it by citing the paper as follows: + +.. note:: + + Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. + `Ultra-Large-Scale Repository Analysis via Graph Compression + `_. In proceedings of `SANER + 2020 `_: The 27th IEEE International + Conference on Software Analysis, Evolution and Reengineering, pages + 184-194. IEEE 2020. + + Links: `preprint + `_, + `bibtex + `_. diff --git a/setup.cfg b/setup.cfg index 8d79b7e..1d722c2 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,8 @@ [flake8] -# E203: whitespaces before ':' -# E231: missing whitespace after ',' -# W503: line break before binary operator ignore = E203,E231,W503 max-line-length = 88 + +[egg_info] +tag_build = +tag_date = 0 + diff --git a/swh.graph.egg-info/PKG-INFO b/swh.graph.egg-info/PKG-INFO new file mode 100644 index 0000000..18ff922 --- /dev/null +++ b/swh.graph.egg-info/PKG-INFO @@ -0,0 +1,56 @@ +Metadata-Version: 2.1 +Name: swh.graph +Version: 0.5.0 +Summary: Software Heritage graph service +Home-page: https://forge.softwareheritage.org/diffusion/DGRPH +Author: Software Heritage developers +Author-email: swh-devel@inria.fr +License: UNKNOWN +Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest +Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Source, https://forge.softwareheritage.org/source/swh-graph +Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-graph/ +Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Intended Audience :: Developers +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Operating System :: OS Independent +Classifier: Development Status :: 3 - Alpha +Requires-Python: >=3.7 +Description-Content-Type: text/x-rst +Provides-Extra: testing +License-File: LICENSE +License-File: AUTHORS + +Software Heritage - graph service +================================= + +Tooling and services, collectively known as ``swh-graph``, providing fast +access to the graph representation of the `Software Heritage +`_ +`archive `_. The service is in-memory, +based on a compressed representation of the Software Heritage Merkle DAG. + + +Bibliography +------------ + +In addition to accompanying technical documentation, ``swh-graph`` is also +described in the following scientific paper. If you publish results based on +``swh-graph``, please acknowledge it by citing the paper as follows: + +.. note:: + + Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. + `Ultra-Large-Scale Repository Analysis via Graph Compression + `_. In proceedings of `SANER + 2020 `_: The 27th IEEE International + Conference on Software Analysis, Evolution and Reengineering, pages + 184-194. IEEE 2020. + + Links: `preprint + `_, + `bibtex + `_. + + diff --git a/swh.graph.egg-info/SOURCES.txt b/swh.graph.egg-info/SOURCES.txt new file mode 100644 index 0000000..e2e1364 --- /dev/null +++ b/swh.graph.egg-info/SOURCES.txt @@ -0,0 +1,193 @@ +.gitignore +.pre-commit-config.yaml +AUTHORS +CODE_OF_CONDUCT.md +CONTRIBUTORS +LICENSE +MANIFEST.in +Makefile +Makefile.local +README.rst +mypy.ini +pyproject.toml +pytest.ini +requirements-swh.txt +requirements-test.txt +requirements.txt +setup.cfg +setup.py +tox.ini +docker/Dockerfile +docker/build.sh +docker/run.sh +docs/.gitignore +docs/Makefile +docs/Makefile.local +docs/README.rst +docs/api.rst +docs/cli.rst +docs/compression.rst +docs/conf.py +docs/docker.rst +docs/git2graph.md +docs/index.rst +docs/quickstart.rst +docs/use-cases.rst +docs/_static/.placeholder +docs/_templates/.placeholder +docs/images/.gitignore +docs/images/Makefile +docs/images/compression_steps.dot +java/.coding-style.xml +java/.gitignore +java/AUTHORS +java/LICENSE +java/README.md +java/pom.xml +java/src/main/java/org/softwareheritage/graph/AllowedEdges.java +java/src/main/java/org/softwareheritage/graph/AllowedNodes.java +java/src/main/java/org/softwareheritage/graph/Entry.java +java/src/main/java/org/softwareheritage/graph/Graph.java +java/src/main/java/org/softwareheritage/graph/Node.java +java/src/main/java/org/softwareheritage/graph/NodesFiltering.java +java/src/main/java/org/softwareheritage/graph/SWHID.java +java/src/main/java/org/softwareheritage/graph/Stats.java +java/src/main/java/org/softwareheritage/graph/Subgraph.java +java/src/main/java/org/softwareheritage/graph/SwhPath.java +java/src/main/java/org/softwareheritage/graph/Traversal.java +java/src/main/java/org/softwareheritage/graph/algo/TopologicalTraversal.java +java/src/main/java/org/softwareheritage/graph/benchmark/AccessEdge.java +java/src/main/java/org/softwareheritage/graph/benchmark/BFS.java +java/src/main/java/org/softwareheritage/graph/benchmark/Benchmark.java +java/src/main/java/org/softwareheritage/graph/benchmark/Browsing.java +java/src/main/java/org/softwareheritage/graph/benchmark/Provenance.java +java/src/main/java/org/softwareheritage/graph/benchmark/Vault.java +java/src/main/java/org/softwareheritage/graph/benchmark/utils/Random.java +java/src/main/java/org/softwareheritage/graph/benchmark/utils/Statistics.java +java/src/main/java/org/softwareheritage/graph/benchmark/utils/Timing.java +java/src/main/java/org/softwareheritage/graph/experiments/forks/FindCommonAncestor.java +java/src/main/java/org/softwareheritage/graph/experiments/forks/FindPath.java +java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java +java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java +java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java +java/src/main/java/org/softwareheritage/graph/experiments/multiplicationfactor/GenDistribution.java +java/src/main/java/org/softwareheritage/graph/experiments/topology/AveragePaths.java +java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java +java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java +java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java +java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java +java/src/main/java/org/softwareheritage/graph/labels/AbstractLongListLabel.java +java/src/main/java/org/softwareheritage/graph/labels/DirEntry.java +java/src/main/java/org/softwareheritage/graph/labels/FixedWidthLongListLabel.java +java/src/main/java/org/softwareheritage/graph/labels/SwhLabel.java +java/src/main/java/org/softwareheritage/graph/maps/LabelMapBuilder.java +java/src/main/java/org/softwareheritage/graph/maps/MapFile.java +java/src/main/java/org/softwareheritage/graph/maps/NodeIdMap.java +java/src/main/java/org/softwareheritage/graph/maps/NodeMapBuilder.java +java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java +java/src/main/java/org/softwareheritage/graph/server/App.java +java/src/main/java/org/softwareheritage/graph/server/Endpoint.java +java/src/main/java/org/softwareheritage/graph/utils/ExportSubdataset.java +java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java +java/src/main/java/org/softwareheritage/graph/utils/MPHTranslate.java +java/src/main/java/org/softwareheritage/graph/utils/ReadGraph.java +java/src/main/java/org/softwareheritage/graph/utils/ReadLabelledGraph.java +java/src/main/java/org/softwareheritage/graph/utils/WriteRevisionTimestamps.java +java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java +java/src/test/java/org/softwareheritage/graph/GraphTest.java +java/src/test/java/org/softwareheritage/graph/LeavesTest.java +java/src/test/java/org/softwareheritage/graph/NeighborsTest.java +java/src/test/java/org/softwareheritage/graph/SubgraphTest.java +java/src/test/java/org/softwareheritage/graph/VisitTest.java +java/src/test/java/org/softwareheritage/graph/WalkTest.java +java/target/swh-graph-0.5.0.jar +reports/.gitignore +reports/benchmarks/Makefile +reports/benchmarks/benchmarks.tex +reports/experiments/Makefile +reports/experiments/experiments.tex +reports/linux_log/LinuxLog.java +reports/linux_log/Makefile +reports/linux_log/linux_log.tex +reports/node_mapping/Makefile +reports/node_mapping/NodeIdMapHaloDB.java +reports/node_mapping/NodeIdMapRocksDB.java +reports/node_mapping/node_mapping.tex +swh/__init__.py +swh.graph.egg-info/PKG-INFO +swh.graph.egg-info/SOURCES.txt +swh.graph.egg-info/dependency_links.txt +swh.graph.egg-info/entry_points.txt +swh.graph.egg-info/requires.txt +swh.graph.egg-info/top_level.txt +swh/graph/__init__.py +swh/graph/backend.py +swh/graph/cli.py +swh/graph/client.py +swh/graph/config.py +swh/graph/dot.py +swh/graph/graph.py +swh/graph/naive_client.py +swh/graph/py.typed +swh/graph/swhid.py +swh/graph/webgraph.py +swh/graph/server/__init__.py +swh/graph/server/app.py +swh/graph/tests/__init__.py +swh/graph/tests/conftest.py +swh/graph/tests/test_api_client.py +swh/graph/tests/test_cli.py +swh/graph/tests/test_graph.py +swh/graph/tests/test_swhid.py +swh/graph/tests/dataset/.gitignore +swh/graph/tests/dataset/example.edges.csv +swh/graph/tests/dataset/example.edges.csv.zst +swh/graph/tests/dataset/example.nodes.csv +swh/graph/tests/dataset/example.nodes.csv.zst +swh/graph/tests/dataset/generate_graph.sh +swh/graph/tests/dataset/img/.gitignore +swh/graph/tests/dataset/img/Makefile +swh/graph/tests/dataset/img/example.dot +swh/graph/tests/dataset/output/example-transposed.graph +swh/graph/tests/dataset/output/example-transposed.obl +swh/graph/tests/dataset/output/example-transposed.offsets +swh/graph/tests/dataset/output/example-transposed.properties +swh/graph/tests/dataset/output/example.graph +swh/graph/tests/dataset/output/example.indegree +swh/graph/tests/dataset/output/example.mph +swh/graph/tests/dataset/output/example.node2swhid.bin +swh/graph/tests/dataset/output/example.node2type.map +swh/graph/tests/dataset/output/example.obl +swh/graph/tests/dataset/output/example.offsets +swh/graph/tests/dataset/output/example.order +swh/graph/tests/dataset/output/example.outdegree +swh/graph/tests/dataset/output/example.properties +swh/graph/tests/dataset/output/example.stats +swh/graph/tests/dataset/output/example.swhid2node.bin +tools/dir2graph +tools/swhid2int2int2swhid.sh +tools/git2graph/.gitignore +tools/git2graph/Makefile +tools/git2graph/README.md +tools/git2graph/git2graph.c +tools/git2graph/tests/edge-filters.bats +tools/git2graph/tests/full-graph.bats +tools/git2graph/tests/node-filters.bats +tools/git2graph/tests/repo_helper.bash +tools/git2graph/tests/data/sample-repo.tgz +tools/git2graph/tests/data/graphs/dir-nodes/edges.csv +tools/git2graph/tests/data/graphs/dir-nodes/nodes.csv +tools/git2graph/tests/data/graphs/from-dir-edges/edges.csv +tools/git2graph/tests/data/graphs/from-dir-edges/nodes.csv +tools/git2graph/tests/data/graphs/from-rel-edges/edges.csv +tools/git2graph/tests/data/graphs/from-rel-edges/nodes.csv +tools/git2graph/tests/data/graphs/fs-nodes/edges.csv +tools/git2graph/tests/data/graphs/fs-nodes/nodes.csv +tools/git2graph/tests/data/graphs/full/edges.csv +tools/git2graph/tests/data/graphs/full/nodes.csv +tools/git2graph/tests/data/graphs/rev-edges/edges.csv +tools/git2graph/tests/data/graphs/rev-edges/nodes.csv +tools/git2graph/tests/data/graphs/rev-nodes/edges.csv +tools/git2graph/tests/data/graphs/rev-nodes/nodes.csv +tools/git2graph/tests/data/graphs/to-rev-edges/edges.csv +tools/git2graph/tests/data/graphs/to-rev-edges/nodes.csv \ No newline at end of file diff --git a/swh.graph.egg-info/dependency_links.txt b/swh.graph.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/swh.graph.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/swh.graph.egg-info/entry_points.txt b/swh.graph.egg-info/entry_points.txt new file mode 100644 index 0000000..cfdaffe --- /dev/null +++ b/swh.graph.egg-info/entry_points.txt @@ -0,0 +1,6 @@ + + [console_scripts] + swh-graph=swh.graph.cli:main + [swh.cli.subcommands] + graph=swh.graph.cli + \ No newline at end of file diff --git a/swh.graph.egg-info/requires.txt b/swh.graph.egg-info/requires.txt new file mode 100644 index 0000000..aef777c --- /dev/null +++ b/swh.graph.egg-info/requires.txt @@ -0,0 +1,12 @@ +aiohttp +click +py4j +psutil +swh.core[http]>=0.3 +swh.model>=0.13.0 + +[testing] +pytest +types-click +types-pyyaml +types-requests diff --git a/swh.graph.egg-info/top_level.txt b/swh.graph.egg-info/top_level.txt new file mode 100644 index 0000000..0cb0f8f --- /dev/null +++ b/swh.graph.egg-info/top_level.txt @@ -0,0 +1 @@ +swh diff --git a/tools/git2graph/README.md b/tools/git2graph/README.md deleted file mode 120000 index 5ae67b0..0000000 --- a/tools/git2graph/README.md +++ /dev/null @@ -1 +0,0 @@ -../../docs/git2graph.md \ No newline at end of file diff --git a/tools/git2graph/README.md b/tools/git2graph/README.md new file mode 100644 index 0000000..b16fc93 --- /dev/null +++ b/tools/git2graph/README.md @@ -0,0 +1,71 @@ +git2graph +========= + +`git2graph` crawls a Git repository and outputs it as a graph, i.e., as a pair +of textual files . The nodes file will contain a list of graph +nodes as [Software Heritage](https://www.softwareheritage.org/) +{ref}`identifiers (SWHIDs) `; the edges file a list of +graph edges as SWHID pairs. + + +Dependencies +------------ + +Build time dependencies: + +- [glib](https://developer.gnome.org/glib/) +- [libgit2](https://libgit2.org/) + +Test dependencies: + +- [bats](https://github.com/bats-core/bats-core) + + +Micro benchmark +--------------- + + $ time ./git2graph -n >(zstdmt > nodes.csv.zst) -e >(zstdmt -c > edges.csv.zst) /srv/src/linux + 160,38s user 12,72s system 98% cpu 2:55,02 total + + $ zstdcat nodes.csv.zst | wc -l + 6503403 + $ zstdcat edges.csv.zst | wc -l + 305096029 + + +Parallel use +------------ + +`git2graph` writes fixed-length lines, long either 51 bytes (nodes) or 102 +bytes (edges). When writing to a FIFO less than `PIPE_BUF` bytes (which is 4096 +bytes on Linux, and guaranteed to be at least 512 bytes by POSIX), writes are +atomic. Hence it is possible to mass analyze many repositories in parallel with +something like: + + $ mkfifo nodes.fifo edges.fifo + $ sort -u < nodes.fifo | zstdmt > nodes.csv.zst & + $ sort -u < edges.fifo | zstdmt > edges.csv.zst & + $ parallel git2graph -n nodes.fifo -e edges.fifo -- repo_dir_1 repo_dir_2 ... + $ rm nodes.fifo edges.fifo + +Note that you most likely want to tune `sort` in order to be parallel +(`--parallel`), use a large buffer size (`-S`), and use a temporary directory +with enough available space (`-T`). (The above example uses `parallel` +from [moreutils](https://joeyh.name/code/moreutils/), but it could trivially be +adapted to use [GNU parallel](https://www.gnu.org/software/parallel/) or +similar parallelization tools.) + + +Limitations +----------- + +SWHID calculation for snapshots is not fully compatible with the +{py:func}`specification `, because +currently only HEAD is considered as a symbolic reference. Other symbolic refs, +if present, will be ignored, potentially leading to a different snapshot SWHID +than what Software Heritage will obtain. This is due to a limitation of +libgit2, that at the time of writing does not allow to list all symbolic +references. + +The graph structure is not affected, but looking up obtained snapshots by SWHID +on the main Software Heritage archive might fail.