diff --git a/docs/java-api.rst b/docs/java-api.rst index 07d0041..982236a 100644 --- a/docs/java-api.rst +++ b/docs/java-api.rst @@ -1,744 +1,744 @@ .. _swh-graph-java-api: Using the Java API ================== .. highlight:: java While the :ref:`HTTP API ` is useful for many common use-cases, it is often not sufficient to implement more complex algorithms. This section describes the low-level Java API that ``swh-graph`` provides on top of the WebGraph framework to manipulate the compressed graph of Software Heritage. A cursory understanding of the `WebGraph framework `_ and its API is helpful to understand the notions detailed here. .. _swh-graph-java-basics: Basics ------ In the WebGraph framework, graphs are generally subclasses of `ImmutableGraph `_, the abstract class providing the core API to manipulate and iterate on graphs. Under the hood, compressed graphs are stored as the `BVGraph `_ class, which contains the actual codec used to compress and decompress adjacency lists. Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]` where *n* is the total number of nodes in the graph. Each node has an associated *adjacency list*, i.e., a list of nodes going from that source node to a destination node. This list represents the **edges** (or **arcs**) of the graph. **Note**: edges are always directed. Undirected graphs are internally stored as a pair of directed edges (src → dst, dst → src), and are called "symmetric" graphs. On disk, a simple BVGraph with the basename ``graph`` would be represented as the following set of files: - ``graph.graph``: contains the compressed adjacency lists of each node, which can be decompressed by the BVGraph codec. - ``graph.properties``: contains metadata on the graph, such as the number of nodes and arcs, as well as additional loading information needed by the BVGraph codec. - ``graph.offsets``: a list of offsets of where the adjacency list of each node is stored in the main graph file. - ``graph.obl``: optionally, an "offset big-list file" which can be used to load graphs faster. An ImmutableGraph can be loaded using different *load methods*, which have each different performance implications: - ``load()``: the entire graph is loaded in RAM and supports random access. - ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see ``mmap(1)``), at the cost of being potentially slower, especially when doing random access on slow storage. - ``loadOffline()``: no data is actually loaded is memory, only sequential iteration is possible. The following code loads a graph stored on disk under the ``compressed/graph`` basename, using the memory-mapped loading mode, and stores it as a generic ImmutableGraph: .. code-block:: java ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph"); Note that most of the time you will want to use the SWH-specific subclass **SwhUnidirectionalGraph** instead, which has the same API as an ImmutableGraph except it adds other SWH-specific methods. It is described later in the :ref:`swh-graph-java-node-mappings` section. Running the code ---------------- To run a piece of Java code written using the Java API, you need to run it with all the dependencies in your classpath (the WebGraph libraries and the swh-graph library). The easiest way to do it is to add the *fat jar* shipped in the swh-graph library on PyPI, which contains all the required dependencies. .. code-block:: console $ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java Note that to load bigger graphs, the default heap size of the JVM is likely to be insufficient to load entire graphs in memory. It is advised to increase this heap size with the ``-Xmx`` flag: .. code-block:: console $ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java For more information on performance tuning and memory considerations, you should also read the :ref:`swh-graph-memory` page, in which we recommend additional JVM options for loading large graphs. Simple traversal ---------------- The ImmutableGraph class provides primitives to iterate and traverse graphs. It contains the following methods: - ``graph.numNodes()`` returns the number of nodes in the graph (*n*). - ``graph.numArcs()`` returns the number of arcs in the graph. And, given a node ID :math:`k \in [0, n - 1]`: - ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are *adjacent* to *k* (i.e., its outgoing *neighbors*). - ``graph.outdegree(k)`` returns the number of outgoing neighbors of *k*. Example: Average outdegree ~~~~~~~~~~~~~~~~~~~~~~~~~~ The following code can be used to compute the average outdegree of a graph, which is a useful measure of its density: .. code-block:: java public static long averageOutdegree(ImmutableGraph graph) { return ((long) graph.numArcs()) / graph.numNodes(); } Example: Degree distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using the ``outdegree()`` primitive, we can compute the outdegree distribution of the graph by iterating on all its nodes. The distribution will be returned as a map that associates to each degree *d* the number of nodes with outdegree *d*. .. code-block:: java public static Map outdegreeDistribution(ImmutableGraph graph) { HashMap distribution = new HashMap(); for (long k = 0; k < graph.numNodes(); ++k) { distribution.merge(graph.outdegree(k), 1L, Long::sum); } return distribution; } Example: Depth-First Traversal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``successors`` primitive can be used to write a simple stack-based DFS traversal on the graph which starts from a given node and prints all the descendant nodes in its transitive closure: .. code-block:: java :emphasize-lines: 10 public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) { Stack stack = new Stack<>(); HashSet visited = new HashSet(); stack.push(srcNodeId); visited.add(srcNodeId); while (!stack.isEmpty()) { long currentNodeId = stack.pop(); System.out.println(currentNodeId); LazyLongIterator it = graph.successors(currentNodeId); for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { if (!visited.contains(neighborNodeId)) { stack.push(neighborNodeId); visited.add(neighborNodeId); } } } } Example: Breadth-First Traversal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Swapping the stack for a queue changes the traversal order from depth-first to breadth-first: .. code-block:: java :emphasize-lines: 2 public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) { Queue queue = new ArrayDeque<>(); HashSet visited = new HashSet(); queue.add(srcNodeId); visited.add(srcNodeId); while (!queue.isEmpty()) { long currentNodeId = queue.poll(); System.out.println(currentNodeId); LazyLongIterator it = graph.successors(currentNodeId); for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { if (!visited.contains(neighborNodeId)) { queue.add(neighborNodeId); visited.add(neighborNodeId); } } } } .. _swh-graph-java-node-mappings: Node types and SWHIDs --------------------- In the Software Heritage archive, nodes are not represented by a simple integer, but by a :ref:`SWHID `, which contain both the *type* of the node (revision, directory, blob...) and its unique identifier. We use **node mappings** which allow us to translate between SWHIDs and the compact node IDs used in the compressed graph. Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the `GOVMinimalPerfectHashFunction `_ class of the Sux4J library, which maps N keys to N consecutive integers with no collisions. The following files are used to store the mappings between the nodes and their types: - ``graph.mph``: contains a serialized minimal perfect hash function computed on the list of all the SWHIDs in the graph. - ``graph.order``: contains the permutation that associates with each output of the MPH the node ID to which it corresponds - ``graph.node2swhid.bin``: contains a compact binary representation of all the SWHIDs in the graph, ordered by their rank in the graph file. - ``graph.node2type.bin``: contains a `LongBigArrayBitVector `_ which stores the type of each node. To use these mappings easily, we provide the class **SwhUnidirectionalGraph**, an ImmutableGraph which wraps the underlying graph and adds a few utility methods to obtain SWH-specific information on the graph. A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph, as long as the mapping files listed above are present:: SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); This class exposes the same graph primitives as an ImmutableGraph, but it additionally contains the following methods: - ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given node ID. This function does a lookup of the SWHID at offset *i* in the file ``graph.node2swhid.bin``. - ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given SWHID. It works by hashing the SWHID with the function stored in ``graph.mph``, then permuting it using the permutation stored in ``graph.order``. It does additional domain-checking by calling ``getSWHID()`` on its own result to check that the input SWHID was valid. -- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as +- ``SwhType getNodeType(long nodeID)``: returns the type of a given node, as an enum of all the different object types in the Software Heritage data model. It does so by looking up the value at offset *i* in the bit vector stored in ``graph.node2type.bin``. Example: Find the target directory of a revision ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As an example, we use the methods mentioned above to perform the following task: "given a revision, return its target directory". To do so, we first look up the node ID of the given revision in the compressed graph. We iterate on the successors of that node, and return the SWHID of the first destination node that has the "directory" type. .. code-block:: java :emphasize-lines: 2 public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) { long src = graph.getNodeId(revSwhid); - assert graph.getNodeType(src) == Node.Type.REV; + assert graph.getNodeType(src) == SwhType.REV; LazyLongIterator it = graph.successors(currentNodeId); for (long dst; (dst = it.nextLong()) != -1;) { - if (graph.getNodeType(dst) == Node.Type.DIR) { + if (graph.getNodeType(dst) == SwhType.DIR) { return graph.getSWHID(dst); } } throw new RuntimeError("Revision has no target directory"); } .. _swh-graph-java-node-properties: Node properties --------------- The Software Heritage Graph is a *property graph*, which means it has various properties associated with its nodes and edges (e.g., commit timestamps, authors, messages, ...). We compress these properties and store them in files alongside the compressed graph. This allows you to write traversal algorithms that depend on these properties. By default, properties are not assumed to be present are are not loaded when the graph itself is loaded. If you want to use a property, you need to explicitly load it first. As an example, this is how you load the "content length" property to get the length of a given blob:: SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename); graph.loadContentLength(); long blobSize = graph.getContentLength(graph.getNodeID(swhid)); The documentation of the SwhGraphProperties class (**TODO: link**) lists all the different properties, their types, and the methods used to load them and to get their value for a specific node. A few things of note: - A single loading call can load multiple properties at once; this is because they are stored in the same file to be more space efficient. - Persons (authors, committers etc) are exported as a single pseudonymized integer ID, that uniquely represents a full name + email. - Timestamps are stored as a long integer (for the timestamp itself) and a short integer (for the UTC offset). .. _swh-graph-java-edge-properties: Edge labels ----------- While looking up graph properties on the *nodes* of the graph is relatively straightforward, doing so for labels on the *arcs* is comparatively more difficult. These include the names and permissions of directory entries, as well as the branch names of snapshots. The `ArcLabelledImmutableGraph `_ class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them *labelled iterators*, which essentially allow us to look up the label of the arcs while iterating on them. This labelled graph is stored in the following files: - ``graph-labelled.properties``: a property file describing the graph, notably containing the basename of the wrapped graph. - ``graph-labelled.labels``: the compressed labels - ``graph-labelled.labeloffsets``: the offsets used to access the labels in random order. The SwhUnidirectionalGraph class contains *labelled* loading methods (``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods are used instead of the standard non-labelled ones, the graph is loaded as an ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods can then be used: - ``labelledSuccessors(k)`` returns a `LabelledArcIterator `_ which is used in the same way as a LazyLongIterator except it also contains a ``label()`` method to get the label of the currently traversed arc. - ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator `_ of all the nodes in the graph, which replaces the LazyLongIterator of the ``successor()`` function by a LabelledArcIterator similar to above. Label format ~~~~~~~~~~~~ The labels of each arc are returned as a ``DirEntry[]`` array. They encode both the name of a directory entry and its permissions. For snapshot branches, only the "name" field is useful. Arc label names are encoded as an integer ID representing each unique entry/branch name present in the graph. To retrieve the actual name associated with a given label ID, one needs to load the reverse mapping similar to how you would do for a normal property:: SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename); graph.loadLabelNames(); The byte array representing the actual label name can then be loaded with:: byte[] name = graph.getLabelName(label.filenameId); Multiedges ~~~~~~~~~~ The Software Heritage is not a *simple graph*, where at most one edge can exist between two vertices, but a *multigraph*, where multiple edges can be incident to the same two vertices. Consider for instance the case of a single directory ``test/`` containing twice the same file blob (e.g., the empty file), under two different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty). The simple graph view of this directory will represent it as a single edge ``test`` → *empty file*, while the multigraph view will represent it as *two* edges between the same nodes. Due to the copy-list model of compression, WebGraph only stores simple graphs, and thus stores multiedges as single edges, to which we cannot associate a single label name (in our example, we need to associate both names ``ISSUES.txt`` and ``TODO.txt``). To represent this possibility of having multiple file names for a single arc, in the case of multiple relationships between two identical nodes, each arc label is stored as an *array* of DirEntry, each record representing one relationship between two nodes. Example: Printing all the entries of a directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following code showcases how one can print all the entries (name, permission and target SWHID) of a given directory, using the labelled methods seen above. .. code-block:: java public static void printEntries(ImmutableGraph g, long dirNode) { LabelledArcIterator s = g.labelledSuccessors(dirNode); for (long dst; (dst = it.nextLong()) >= 0;) { DirEntry[] labels = (DirEntry[]) s.label().get(); for (DirEntry label : labels) { System.out.format( "%s %s %d\n", graph.getSWHID(dst); new String(graph.getLabelName(label.filenameId)), label.permission ); } } } // Usage: $PROGRAM public static void main(String[] args) { SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]); g.loadLabelNames(); long dirNode = g.getNodeID(new SWHID(args[1])); printEntries(g, dirNode); } Transposed graph ---------------- Up until now, we have only looked at how to traverse the *forward* graph, i.e., the directed graph whose edges are in the same direction as the Merkle DAG of the Software Heritage archive. For many purposes, especially that of finding the *provenance* of software artifacts, it is useful to query the *backward* (or *transposed*) graph instead, which is the same as the forward graph except all the edges are reversed. The transposed graph has its own set of files, counterparts to the files needed for the forward graph: - ``graph-transposed.graph`` - ``graph-transposed.properties`` - ``graph-transposed.offsets`` - ``graph-transposed.obl`` - ``graph-transposed-labelled.labels`` - ``graph-transposed-labelled.labeloffsets`` - ``graph-transposed-labelled.properties`` However, because node IDs are the same in the forward and the backward graph, all the files that pertain to mappings between the node IDs and various properties (SWHIDs, property data, node permutations etc) remain the same. Example: Earliest revision containing a given blob ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following code loads all the committer timestamps of the revisions in the graph, then walks the *transposed* graph to return the earliest revision containing a given object. .. code-block:: java public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) { long oldestRev = -1; long oldestRevTs = Long.MAX_VALUE; Stack stack = new Stack<>(); HashSet visited = new HashSet(); stack.push(src); visited.add(src); while (!stack.isEmpty()) { long currentNodeId = stack.pop(); LazyLongIterator it = graph.successors(currentNodeId); for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { if (!visited.contains(neighborNodeId)) { stack.push(neighborNodeId); visited.add(neighborNodeId); - if (g.getNodeType(neighborNodeId) == Node.Type.REV) { + if (g.getNodeType(neighborNodeId) == SwhType.REV) { Long ts = g.getCommitterTimestamp(neighborNodeId); if (ts != null && ts < oldestRevTs) { oldestRev = neighborNodeId; oldestRevTs = ts; } } } } } return oldestRev; } // Usage: $PROGRAM public static void main(String[] args) { // Load the backward (= transposed) graph as a SwhUnidirectionalGraph SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed"); g.loadCommitterTimestamps(); long node = g.getNodeID(new SWHID(args[1])); long oldestRev = findEarliestRevisionContaining(g, node); System.out.println(g.getSWHID(oldestRev)); } Bidirectional Graph ------------------- BidirectionalImmutableGraph ~~~~~~~~~~~~~~~~~~~~~~~~~~~ While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and then manipulated just like the forward graph, it is often convenient to have *both* the forward and the backward graph in memory. Some traversal algorithms require first going down in the forward graph to select some nodes, then going up to find their provenance. To achieve that, we use the `BidirectionalImmutableGraph `_ class from WebGraph, which stores both a graph and its transpose. This class provides the following methods to iterate on the **backward** graph, shown here with their counterparts for the forward graph: .. list-table:: :header-rows: 1 * - Forward graph operation - Backward graph operation * - ``outdegree(k)`` - ``indegree(k)`` * - ``successors(k)`` - ``predecessors(k)`` In addition, the class offers a few convenience methods which are generally useful when you have both a graph and its transpose: - ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by inverting the references to the forward and the backward graphs. Successors become predecessors, and vice-versa. - ``symmetrize()`` returns the symmetric (= undirected) version of the bidirectional graph. It is implemented by a union between the forward and the backward graph, which basically boils down to removing the directionality of the edges (the successors of a node are also its predecessors). SwhBidirectionalGraph ~~~~~~~~~~~~~~~~~~~~~ Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it contains the method ``labelledPredecessors()``, the equivalent of ``labelledSuccessors()`` but on the backward graph. Because SwhUnidirectionalGraph inherits from ImmutableGraph, and SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the common behavior between the two classes in a SwhGraph interface, which can represent either an unidirectional or a bidirectional graph. To avoid loading the node properties two times (once for each direction), they are stored in a separate class called SwhGraphProperties. In a SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node properties in memory by storing references to the same SwhGraphProperty object. .. code-block:: text ┌──────────────┐ │ImmutableGraph◄────────┐ └────▲─────────┘ │extends │ │ │ ┌──────────┴────────────────┐ extends│ │BidirectionalImmutableGraph│ │ └────────────▲──────────────┘ │ │extends ┌──────────────┴───────┐ ┌──────┴──────────────┐ │SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│ └──┬──────────────┬────┘ └────────┬───────────┬┘ │ │ contains x2 │ │ │ │ │ │ │ implements│ │implements │ │ ┌▼──────────┐ │ │ │ │SwhGraph(I)◄────────┘ │ contains│ └───────────┘ │contains │ │ │ ┌──────────────────┐ │ └────────────►SwhGraphProperties◄──────────────┘ └──────────────────┘ Example: Find all the shared-commit forks of a given origin ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is possible to define the *forks* of an origin as being the set of origins which share at least one revision with that origin. The following code loads the graph in both directions using a SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward* graph to find all its root revisions. It then walks the *backward* graph to find all the origins containing these root revisions, i.e., its *forks*. .. code-block:: java public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) { Stack forwardStack = new Stack<>(); HashSet forwardVisited = new HashSet(); Stack backwardStack = new Stack<>(); HashSet backwardVisited = new HashSet(); // First traversal (forward graph): find all the root revisions of the // origin forwardStack.push(srcOrigin); forwardVisited.add(srcOrigin); while (!forwardStack.isEmpty()) { long curr = forwardStack.pop(); LazyLongIterator it = graph.successors(curr); boolean isRootRevision = true; for (long succ; (succ = it.nextLong()) != -1;) { - Node.Type nt = g.getNodeType(succ); + SwhType nt = g.getNodeType(succ); if (!forwardVisited.contains(succ) - && nt != Node.Type.DIR && nt != Node.Type.CNT) { + && nt != SwhType.DIR && nt != SwhType.CNT) { forwardStack.push(succ); forwardVisited.add(succ); isRootRevision = false; } } - if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) { + if (g.getNodeType(curr) == SwhType.REV && isRootRevision) { // Found a root revision, add it to the second stack backwardStack.push(curr); backwardVisited.add(curr); } } // Second traversal (backward graph): find all the origins containing // any of these root revisions and print them while (!backwardStack.isEmpty()) { long curr = backwardStack.pop(); LazyLongIterator it = graph.predecessors(curr); boolean isRootRevision = true; for (long succ; (succ = it.nextLong()) != -1;) { - Node.Type nt = g.getNodeType(succ); + SwhType nt = g.getNodeType(succ); if (!backwardVisited.contains(succ)) { backwardStack.push(succ); backwardVisited.add(succ); - if (nt == Node.Type.ORI) { + if (nt == SwhType.ORI) { // Found an origin, print it. System.out.println(g.getSWHID(succ)); } } } } } // Usage: $PROGRAM public static void main(String[] args) { // Load both forward and backward graphs as a SwhBidirectionalGraph SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]); long node = g.getNodeID(new SWHID(args[1])); findSharedCommitForks(g, node); } Large-scale processing ---------------------- Multithreading ~~~~~~~~~~~~~~ ImmutableGraph is not thread-safe. When writing multithreaded algorithms, calling ``successors()`` on the same graph from multiple threads will return garbage. Instead, each thread should create its own "lightweight copy" of the graph by calling ``.copy()``. This will not actually copy the entire graph data, which will remain shared across threads, but it will create new instances of the iterators so that each thread can independently iterate on the graph data. Data structures for large traversals ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When doing very large traversals, such as a BFS on the entire graph, the usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite inefficient. If you know you are going to traverse large parts of the graph, it's better to use more appropriate data structures, a lot of which can be found in the dsiutils library. In particular: - `LongArrayBitVector `_ is an efficient bit-vector implementation, which can be used to store the nodes that have already been seen in the visit. Its memory footprint is too big to use for small traversals, but it is very efficient to traverse the full graph, as every node only takes a single bit. - `ByteDiskQueue `_ can be used to efficiently store the queue of nodes to visit on disk, when it is too large to fit in RAM. Other types in dsiutils and fastutil can save significant memory: ``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList``, and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over ``HashMap``. We strongly recommend reading the documentation of the unimi libraries and looking at the code for usage examples. BigArrays ~~~~~~~~~ When working with the Software Heritage graph, is often necessary to store large arrays of values, with a size exceeding 2^32 items. Unfortunately, standard Java arrays do not support this. To circumvent this, WebGraph uses the `BigArrays scheme `_ from the fastutil library: "big arrays" are stored as arrays of arrays, supporting quadrillions of records. A BigArray ``long[][] a`` can be used with the following methods: - ``BigArrays.get(a, i)`` to get the value at index *i* - ``BigArrays.set(a, i, v)`` to set the value at index *i* to *v*. - ``BigArrays.length(a)`` to get the total length of the bigarray. diff --git a/java/src/main/java/org/softwareheritage/graph/AllowedEdges.java b/java/src/main/java/org/softwareheritage/graph/AllowedEdges.java index 0c382f7..a91276f 100644 --- a/java/src/main/java/org/softwareheritage/graph/AllowedEdges.java +++ b/java/src/main/java/org/softwareheritage/graph/AllowedEdges.java @@ -1,98 +1,98 @@ /* * Copyright (c) 2019-2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import java.util.ArrayList; /** * Edge restriction based on node types, used when visiting the graph. *

* Software Heritage * graph contains multiple node types (contents, directories, revisions, ...) and restricting * the traversal to specific node types is necessary for many querying operations: * use cases. * * @author The Software Heritage developers */ public class AllowedEdges { /** * 2D boolean matrix storing access rights for all combination of src/dst node types (first * dimension is source, second dimension is destination), when edge restriction is not enforced this * array is set to null for early bypass. */ public boolean[][] restrictedTo; /** * Constructor. * * @param edgesFmt a formatted string describing allowed * edges */ public AllowedEdges(String edgesFmt) { - int nbNodeTypes = Node.Type.values().length; + int nbNodeTypes = SwhType.values().length; this.restrictedTo = new boolean[nbNodeTypes][nbNodeTypes]; // Special values (null, empty, "*") if (edgesFmt == null || edgesFmt.isEmpty()) { return; } if (edgesFmt.equals("*")) { // Allows for quick bypass (with simple null check) when no edge restriction restrictedTo = null; return; } // Format: "src1:dst1,src2:dst2,[...]" String[] edgeTypes = edgesFmt.split(","); for (String edgeType : edgeTypes) { String[] nodeTypes = edgeType.split(":"); if (nodeTypes.length != 2) { throw new IllegalArgumentException("Cannot parse edge type: " + edgeType); } - ArrayList srcTypes = Node.Type.parse(nodeTypes[0]); - ArrayList dstTypes = Node.Type.parse(nodeTypes[1]); - for (Node.Type srcType : srcTypes) { - for (Node.Type dstType : dstTypes) { + ArrayList srcTypes = SwhType.parse(nodeTypes[0]); + ArrayList dstTypes = SwhType.parse(nodeTypes[1]); + for (SwhType srcType : srcTypes) { + for (SwhType dstType : dstTypes) { restrictedTo[srcType.ordinal()][dstType.ordinal()] = true; } } } } /** * Checks if a given edge can be followed during graph traversal. * * @param srcType edge source type * @param dstType edge destination type * @return true if allowed and false otherwise */ - public boolean isAllowed(Node.Type srcType, Node.Type dstType) { + public boolean isAllowed(SwhType srcType, SwhType dstType) { if (restrictedTo == null) return true; return restrictedTo[srcType.ordinal()][dstType.ordinal()]; } /** * Return a new AllowedEdges instance with reversed edge restrictions. e.g. "src1:dst1,src2:dst2" * becomes "dst1:src1,dst2:src2" * * @return a new AllowedEdges instance with reversed edge restrictions */ public AllowedEdges reverse() { AllowedEdges reversed = new AllowedEdges(null); reversed.restrictedTo = new boolean[restrictedTo.length][restrictedTo[0].length]; for (int i = 0; i < restrictedTo.length; i++) { for (int j = 0; j < restrictedTo[0].length; j++) { reversed.restrictedTo[i][j] = restrictedTo[j][i]; } } return reversed; } } diff --git a/java/src/main/java/org/softwareheritage/graph/AllowedNodes.java b/java/src/main/java/org/softwareheritage/graph/AllowedNodes.java index b63edf2..d80cae4 100644 --- a/java/src/main/java/org/softwareheritage/graph/AllowedNodes.java +++ b/java/src/main/java/org/softwareheritage/graph/AllowedNodes.java @@ -1,57 +1,57 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; /** * Node type restriction, useful to implement filtering of returned nodes during traversal. * * @author The Software Heritage developers */ public class AllowedNodes { public boolean[] restrictedTo; /** * Constructor. * * @param nodesFmt a formatted string describing allowed nodes */ public AllowedNodes(String nodesFmt) { - int nbNodeTypes = Node.Type.values().length; + int nbNodeTypes = SwhType.values().length; this.restrictedTo = new boolean[nbNodeTypes]; // Special values (null, empty, "*") if (nodesFmt == null || nodesFmt.isEmpty()) { return; } if (nodesFmt.equals("*")) { // Allows for quick bypass (with simple null check) when no node restriction restrictedTo = null; return; } // Format: "nodeType1,nodeType2,[...]" String[] nodeTypesStr = nodesFmt.split(","); for (String nodeTypeStr : nodeTypesStr) { - for (Node.Type nodeType : Node.Type.parse(nodeTypeStr)) { - this.restrictedTo[Node.Type.toInt(nodeType)] = true; + for (SwhType nodeType : SwhType.parse(nodeTypeStr)) { + this.restrictedTo[SwhType.toInt(nodeType)] = true; } } } /** * Checks if a given node type is allowed. * * @param nodeType node type to check * @return true if allowed and false otherwise */ - public boolean isAllowed(Node.Type nodeType) { + public boolean isAllowed(SwhType nodeType) { if (restrictedTo == null) return true; - return restrictedTo[Node.Type.toInt(nodeType)]; + return restrictedTo[SwhType.toInt(nodeType)]; } } diff --git a/java/src/main/java/org/softwareheritage/graph/Node.java b/java/src/main/java/org/softwareheritage/graph/Node.java deleted file mode 100644 index 9d46a76..0000000 --- a/java/src/main/java/org/softwareheritage/graph/Node.java +++ /dev/null @@ -1,146 +0,0 @@ -/* - * Copyright (c) 2019-2022 The Software Heritage developers - * See the AUTHORS file at the top-level directory of this distribution - * License: GNU General Public License version 3, or any later version - * See top-level LICENSE file for more information - */ - -package org.softwareheritage.graph; - -import java.util.*; - -/** - * A node in the Software Heritage graph. - * - * @author The Software Heritage developers - */ - -public class Node { - /** - * Software Heritage graph node types, as described in the - * data model. - */ - public enum Type { - /** Content node */ - CNT, - /** Directory node */ - DIR, - /** Origin node */ - ORI, - /** Release node */ - REL, - /** Revision node */ - REV, - /** Snapshot node */ - SNP; - - /** - * Converts integer to corresponding SWH node type. - * - * @param intType node type represented as an integer - * @return the corresponding {@link Node.Type} value - * @see org.softwareheritage.graph.Node.Type - */ - public static Node.Type fromInt(int intType) { - switch (intType) { - case 0: - return CNT; - case 1: - return DIR; - case 2: - return ORI; - case 3: - return REL; - case 4: - return REV; - case 5: - return SNP; - } - return null; - } - - /** - * Converts node types to the corresponding int value - * - * @param type node type as an enum - * @return the corresponding int value - */ - public static int toInt(Node.Type type) { - switch (type) { - case CNT: - return 0; - case DIR: - return 1; - case ORI: - return 2; - case REL: - return 3; - case REV: - return 4; - case SNP: - return 5; - } - throw new IllegalArgumentException("Unknown node type: " + type); - } - - /** - * Converts string to corresponding SWH node type. - * - * @param strType node type represented as a string - * @return the corresponding {@link Node.Type} value - * @see org.softwareheritage.graph.Node.Type - */ - public static Node.Type fromStr(String strType) { - if (!strType.matches("cnt|dir|ori|rel|rev|snp")) { - throw new IllegalArgumentException("Unknown node type: " + strType); - } - return Node.Type.valueOf(strType.toUpperCase()); - } - - /** - * Converts byte array name to the int code of the corresponding SWH node type. Used for - * performance-critical deserialization. - * - * @param name node type represented as a byte array (e.g. b"cnt") - * @return the ordinal value of the corresponding {@link Node.Type} - * @see org.softwareheritage.graph.Node.Type - */ - public static int byteNameToInt(byte[] name) { - if (Arrays.equals(name, "cnt".getBytes())) { - return 0; - } else if (Arrays.equals(name, "dir".getBytes())) { - return 1; - } else if (Arrays.equals(name, "ori".getBytes())) { - return 2; - } else if (Arrays.equals(name, "rel".getBytes())) { - return 3; - } else if (Arrays.equals(name, "rev".getBytes())) { - return 4; - } else if (Arrays.equals(name, "snp".getBytes())) { - return 5; - } else - return -1; - } - - /** - * Parses SWH node type possible values from formatted string (see the - * API syntax). - * - * @param strFmtType node types represented as a formatted string - * @return a list containing the {@link Node.Type} values - * @see org.softwareheritage.graph.Node.Type - */ - public static ArrayList parse(String strFmtType) { - ArrayList types = new ArrayList<>(); - - if (strFmtType.equals("*")) { - List nodeTypes = Arrays.asList(Node.Type.values()); - types.addAll(nodeTypes); - } else { - types.add(Node.Type.fromStr(strFmtType)); - } - - return types; - } - } -} diff --git a/java/src/main/java/org/softwareheritage/graph/SWHID.java b/java/src/main/java/org/softwareheritage/graph/SWHID.java index 18951fc..3ccb90a 100644 --- a/java/src/main/java/org/softwareheritage/graph/SWHID.java +++ b/java/src/main/java/org/softwareheritage/graph/SWHID.java @@ -1,125 +1,125 @@ /* * Copyright (c) 2019 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import com.fasterxml.jackson.annotation.JsonValue; import org.apache.commons.codec.DecoderException; import org.apache.commons.codec.binary.Hex; /** * A Software Heritage persistent identifier (SWHID), see persistent * identifier documentation. * * @author The Software Heritage developers */ public class SWHID { /** Fixed hash length of the SWHID */ public static final int HASH_LENGTH = 40; /** Full SWHID as a string */ String swhid; /** SWHID node type */ - Node.Type type; + SwhType type; /** * Constructor. * * @param swhid full SWHID as a string */ public SWHID(String swhid) { this.swhid = swhid; // SWHID format: 'swh:1:type:hash' String[] parts = swhid.split(":"); if (parts.length != 4 || !parts[0].equals("swh") || !parts[1].equals("1")) { throw new IllegalArgumentException("malformed SWHID: " + swhid); } - this.type = Node.Type.fromStr(parts[2]); + this.type = SwhType.fromStr(parts[2]); if (!parts[3].matches("[0-9a-f]{" + HASH_LENGTH + "}")) { throw new IllegalArgumentException("malformed SWHID: " + swhid); } } /** * Creates a SWHID from a compact binary representation. *

* The binary format is specified in the Python module swh.graph.swhid:str_to_bytes . */ public static SWHID fromBytes(byte[] input) { byte[] digest = new byte[20]; System.arraycopy(input, 2, digest, 0, digest.length); - String swhidStr = String.format("swh:%d:%s:%s", input[0], Node.Type.fromInt(input[1]).toString().toLowerCase(), + String swhidStr = String.format("swh:%d:%s:%s", input[0], SwhType.fromInt(input[1]).toString().toLowerCase(), Hex.encodeHexString(digest)); return new SWHID(swhidStr); } @Override public boolean equals(Object otherObj) { if (otherObj == this) return true; if (!(otherObj instanceof SWHID)) return false; SWHID other = (SWHID) otherObj; return swhid.equals(other.getSWHID()); } @Override public int hashCode() { return swhid.hashCode(); } @Override public String toString() { return swhid; } /** * Converts SWHID to a compact binary representation. *

* The binary format is specified in the Python module swh.graph.swhid:str_to_bytes . */ public byte[] toBytes() { byte[] bytes = new byte[22]; byte[] digest; bytes[0] = (byte) 1; // namespace version - bytes[1] = (byte) Node.Type.toInt(this.type); // SWHID type + bytes[1] = (byte) SwhType.toInt(this.type); // SWHID type try { digest = Hex.decodeHex(this.swhid.substring(10)); // SHA1 hash System.arraycopy(digest, 0, bytes, 2, digest.length); } catch (DecoderException e) { throw new IllegalArgumentException("invalid hex sequence in SWHID: " + this.swhid); } return bytes; } /** * Returns full SWHID as a string. * * @return full SWHID string */ @JsonValue public String getSWHID() { return swhid; } /** * Returns SWHID node type. * - * @return SWHID corresponding {@link Node.Type} - * @see org.softwareheritage.graph.Node.Type + * @return SWHID corresponding {@link SwhType} + * @see SwhType */ - public Node.Type getType() { + public SwhType getType() { return type; } } diff --git a/java/src/main/java/org/softwareheritage/graph/Subgraph.java b/java/src/main/java/org/softwareheritage/graph/Subgraph.java index 591279c..9cafc0b 100644 --- a/java/src/main/java/org/softwareheritage/graph/Subgraph.java +++ b/java/src/main/java/org/softwareheritage/graph/Subgraph.java @@ -1,231 +1,231 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import it.unimi.dsi.big.webgraph.ImmutableGraph; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.big.webgraph.NodeIterator; import java.util.NoSuchElementException; public class Subgraph extends ImmutableGraph { private final SwhBidirectionalGraph underlyingGraph; public final AllowedNodes allowedNodeTypes; private long nodeCount = -1; /** * Constructor. * */ public Subgraph(SwhBidirectionalGraph underlyingGraph, AllowedNodes allowedNodeTypes) { this.underlyingGraph = underlyingGraph.copy(); this.allowedNodeTypes = allowedNodeTypes; } /** * Return a flyweight copy of the graph. */ @Override public Subgraph copy() { return new Subgraph(this.underlyingGraph.copy(), allowedNodeTypes); } @Override public boolean randomAccess() { return underlyingGraph.randomAccess(); } /** * Return a transposed version of the graph. */ public Subgraph transpose() { return new Subgraph(underlyingGraph.transpose(), allowedNodeTypes); } /** * Return a symmetric version of the graph. */ public Subgraph symmetrize() { return new Subgraph(underlyingGraph.symmetrize(), allowedNodeTypes); } /** * Returns number of nodes in the graph. * * @return number of nodes in the graph */ @Override public long numNodes() { if (nodeCount == -1) { for (long i = 0; i < underlyingGraph.numNodes(); ++i) { if (nodeExists(i)) ++nodeCount; } } return nodeCount; } /** * Returns number of edges in the graph. * * @return number of edges in the graph */ @Override public long numArcs() { throw new UnsupportedOperationException("Cannot determine the number of arcs in a subgraph"); } public long maxNodeNumber() { return underlyingGraph.numNodes(); } public boolean nodeExists(long node) { return allowedNodeTypes.isAllowed(underlyingGraph.getNodeType(node)); } /** * Returns lazy iterator of successors of a node. * * @param nodeId node specified as a long id * @return lazy iterator of successors of the node, specified as a * WebGraph LazyLongIterator */ @Override public LazyLongIterator successors(long nodeId) { if (!nodeExists(nodeId)) { throw new IllegalArgumentException("Node " + nodeId + " not in subgraph"); } LazyLongIterator allSuccessors = underlyingGraph.successors(nodeId); return new LazyLongIterator() { @Override public long nextLong() { long neighbor; while ((neighbor = allSuccessors.nextLong()) != -1) { if (nodeExists(neighbor)) { return neighbor; } } return -1; } @Override public long skip(final long n) { long i; for (i = 0; i < n && nextLong() != -1; i++) ; return i; } }; } /** * Returns the outdegree of a node. * * @param nodeId node specified as a long id * @return outdegree of a node */ @Override public long outdegree(long nodeId) { long deg = 0; for (LazyLongIterator allSuccessors = successors(nodeId); allSuccessors.nextLong() != -1; ++deg) ; return deg; } @Override public NodeIterator nodeIterator() { return new NodeIterator() { final long n = numNodes(); long i = -1; long done = 0; @Override public boolean hasNext() { return done <= n; } @Override public long nextLong() { if (!hasNext()) throw new NoSuchElementException(); do { ++i; if (i >= underlyingGraph.numNodes()) throw new NoSuchElementException(); } while (!nodeExists(i)); ++done; return i; } @Override public long outdegree() { return Subgraph.this.outdegree(i); } @Override public LazyLongIterator successors() { return Subgraph.this.successors(i); } }; } /** * Returns lazy iterator of predecessors of a node. * * @param nodeId node specified as a long id * @return lazy iterator of predecessors of the node, specified as a * WebGraph LazyLongIterator */ public LazyLongIterator predecessors(long nodeId) { return this.transpose().successors(nodeId); } /** * Returns the indegree of a node. * * @param nodeId node specified as a long id * @return indegree of a node */ public long indegree(long nodeId) { return this.transpose().outdegree(nodeId); } /** * Converts {@link SWHID} node to long. * * @param swhid node specified as a {@link SWHID} * @return internal long node id * @see SWHID */ public long getNodeId(SWHID swhid) { return underlyingGraph.getNodeId(swhid); } /** * Converts long id node to {@link SWHID}. * * @param nodeId node specified as a long id * @return external SWHID * @see SWHID */ public SWHID getSWHID(long nodeId) { return underlyingGraph.getSWHID(nodeId); } /** * Returns node type. * * @param nodeId node specified as a long id * @return corresponding node type - * @see Node.Type + * @see SwhType */ - public Node.Type getNodeType(long nodeId) { + public SwhType getNodeType(long nodeId) { return underlyingGraph.getNodeType(nodeId); } } diff --git a/java/src/main/java/org/softwareheritage/graph/SwhGraph.java b/java/src/main/java/org/softwareheritage/graph/SwhGraph.java index 432de35..aee50cd 100644 --- a/java/src/main/java/org/softwareheritage/graph/SwhGraph.java +++ b/java/src/main/java/org/softwareheritage/graph/SwhGraph.java @@ -1,151 +1,151 @@ /* * Copyright (c) 2021-2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import java.io.IOException; /** * Common interface for SWH graph classes. * * This interface forwards all property loading/access methods to the SwhGraphProperties object * returned by the getProperties() method of the implementing class. This allows API users to write * graph.getNodeType() instead of graph.getProperties().getNodeType(). */ public interface SwhGraph { /** * Cleans up graph resources after use. */ void close() throws IOException; /** * Returns the SWH graph properties object of this graph. * * @return graph properties */ SwhGraphProperties getProperties(); /** @see SwhGraphProperties#getPath() */ default String getPath() { return getProperties().getPath(); } /** @see SwhGraphProperties#getNodeId(SWHID) */ default long getNodeId(SWHID swhid) { return getProperties().getNodeId(swhid); } /** @see SwhGraphProperties#getSWHID(long) */ default SWHID getSWHID(long nodeId) { return getProperties().getSWHID(nodeId); } /** @see SwhGraphProperties#getNodeType(long) */ - default Node.Type getNodeType(long nodeId) { + default SwhType getNodeType(long nodeId) { return getProperties().getNodeType(nodeId); } /** @see SwhGraphProperties#loadContentLength() */ default void loadContentLength() throws IOException { getProperties().loadContentLength(); } /** @see SwhGraphProperties#getContentLength(long) */ default Long getContentLength(long nodeId) { return getProperties().getContentLength(nodeId); } /** @see SwhGraphProperties#loadPersonIds() */ default void loadPersonIds() throws IOException { getProperties().loadPersonIds(); } /** @see SwhGraphProperties#getAuthorId(long) */ default Long getAuthorId(long nodeId) { return getProperties().getAuthorId(nodeId); } /** @see SwhGraphProperties#getCommitterId(long) */ default Long getCommitterId(long nodeId) { return getProperties().getCommitterId(nodeId); } /** @see SwhGraphProperties#loadContentIsSkipped() */ default void loadContentIsSkipped() throws IOException { getProperties().loadContentIsSkipped(); } /** @see SwhGraphProperties#isContentSkipped(long) */ default boolean isContentSkipped(long nodeId) { return getProperties().isContentSkipped(nodeId); } /** @see SwhGraphProperties#loadAuthorTimestamps() */ default void loadAuthorTimestamps() throws IOException { getProperties().loadAuthorTimestamps(); } /** @see SwhGraphProperties#getAuthorTimestamp(long) */ default Long getAuthorTimestamp(long nodeId) { return getProperties().getAuthorTimestamp(nodeId); } /** @see SwhGraphProperties#getAuthorTimestampOffset(long) */ default Short getAuthorTimestampOffset(long nodeId) { return getProperties().getAuthorTimestampOffset(nodeId); } /** @see SwhGraphProperties#loadCommitterTimestamps() */ default void loadCommitterTimestamps() throws IOException { getProperties().loadCommitterTimestamps(); } /** @see SwhGraphProperties#getCommitterTimestamp(long) */ default Long getCommitterTimestamp(long nodeId) { return getProperties().getCommitterTimestamp(nodeId); } /** @see SwhGraphProperties#getCommitterTimestampOffset(long) */ default Short getCommitterTimestampOffset(long nodeId) { return getProperties().getCommitterTimestampOffset(nodeId); } /** @see SwhGraphProperties#loadMessages() */ default void loadMessages() throws IOException { getProperties().loadMessages(); } /** @see SwhGraphProperties#getMessage(long) */ default byte[] getMessage(long nodeId) { return getProperties().getMessage(nodeId); } /** @see SwhGraphProperties#getUrl(long) */ default String getUrl(long nodeId) { return getProperties().getUrl(nodeId); } /** @see SwhGraphProperties#loadTagNames() */ default void loadTagNames() throws IOException { getProperties().loadTagNames(); } /** @see SwhGraphProperties#getTagName(long) */ default byte[] getTagName(long nodeId) { return getProperties().getTagName(nodeId); } /** @see SwhGraphProperties#loadLabelNames() */ default void loadLabelNames() throws IOException { getProperties().loadLabelNames(); } /** @see SwhGraphProperties#getLabelName(long) */ default byte[] getLabelName(long labelId) { return getProperties().getLabelName(labelId); } } diff --git a/java/src/main/java/org/softwareheritage/graph/SwhGraphProperties.java b/java/src/main/java/org/softwareheritage/graph/SwhGraphProperties.java index 9de9762..3372947 100644 --- a/java/src/main/java/org/softwareheritage/graph/SwhGraphProperties.java +++ b/java/src/main/java/org/softwareheritage/graph/SwhGraphProperties.java @@ -1,330 +1,330 @@ /* * Copyright (c) 2021-2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import it.unimi.dsi.big.util.MappedFrontCodedStringBigList; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.bytes.ByteBigList; import it.unimi.dsi.fastutil.bytes.ByteMappedBigList; import it.unimi.dsi.fastutil.ints.IntBigList; import it.unimi.dsi.fastutil.ints.IntMappedBigList; import it.unimi.dsi.fastutil.io.BinIO; import it.unimi.dsi.fastutil.longs.LongBigList; import it.unimi.dsi.fastutil.longs.LongMappedBigList; import it.unimi.dsi.fastutil.shorts.ShortBigList; import it.unimi.dsi.fastutil.shorts.ShortMappedBigList; import it.unimi.dsi.sux4j.util.EliasFanoLongBigList; import org.apache.commons.configuration2.ex.ConfigurationException; import org.softwareheritage.graph.maps.NodeIdMap; import org.softwareheritage.graph.maps.NodeTypesMap; import java.io.IOException; import java.io.RandomAccessFile; import java.util.Base64; /** * This objects contains SWH graph properties such as node labels. * * Some property mappings are necessary because Software Heritage uses string based persistent * identifiers (SWHID) while WebGraph uses integers internally. * * The two node ID mappings (long id ↔ SWHID) are used for the input (users refer to the graph * using SWHID) and the output (convert back to SWHID for users results). * * Since graph traversal can be restricted depending on the node type (see {@link AllowedEdges}), a * long id → node type map is stored as well to avoid a full SWHID lookup. * * @see NodeIdMap * @see NodeTypesMap */ public class SwhGraphProperties { private final String path; private final NodeIdMap nodeIdMap; private final NodeTypesMap nodeTypesMap; private LongBigList authorTimestamp; private ShortBigList authorTimestampOffset; private LongBigList committerTimestamp; private ShortBigList committerTimestampOffset; private LongBigList contentLength; private LongArrayBitVector contentIsSkipped; private IntBigList authorId; private IntBigList committerId; private ByteBigList messageBuffer; private LongBigList messageOffsets; private ByteBigList tagNameBuffer; private LongBigList tagNameOffsets; private MappedFrontCodedStringBigList edgeLabelNames; protected SwhGraphProperties(String path, NodeIdMap nodeIdMap, NodeTypesMap nodeTypesMap) { this.path = path; this.nodeIdMap = nodeIdMap; this.nodeTypesMap = nodeTypesMap; } public static SwhGraphProperties load(String path) throws IOException { return new SwhGraphProperties(path, new NodeIdMap(path), new NodeTypesMap(path)); } /** * Cleans up resources after use. */ public void close() throws IOException { edgeLabelNames.close(); } /** Return the basename of the compressed graph */ public String getPath() { return path; } /** * Converts {@link SWHID} node to long. * * @param swhid node specified as a {@link SWHID} * @return internal long node id * @see SWHID */ public long getNodeId(SWHID swhid) { return nodeIdMap.getNodeId(swhid); } /** * Converts long id node to {@link SWHID}. * * @param nodeId node specified as a long id * @return external SWHID * @see SWHID */ public SWHID getSWHID(long nodeId) { return nodeIdMap.getSWHID(nodeId); } /** * Returns node type. * * @param nodeId node specified as a long id * @return corresponding node type - * @see Node.Type + * @see SwhType */ - public Node.Type getNodeType(long nodeId) { + public SwhType getNodeType(long nodeId) { return nodeTypesMap.getType(nodeId); } private static LongBigList loadMappedLongs(String path) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(path, "r")) { return LongMappedBigList.map(raf.getChannel()); } } private static IntBigList loadMappedInts(String path) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(path, "r")) { return IntMappedBigList.map(raf.getChannel()); } } private static ShortBigList loadMappedShorts(String path) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(path, "r")) { return ShortMappedBigList.map(raf.getChannel()); } } private static ByteBigList loadMappedBytes(String path) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(path, "r")) { return ByteMappedBigList.map(raf.getChannel()); } } private static LongBigList loadEFLongs(String path) throws IOException { try { return (EliasFanoLongBigList) BinIO.loadObject(path); } catch (ClassNotFoundException e) { throw new IOException(e); } } private static byte[] getLine(ByteBigList byteArray, long start) { long end = start; while (end < byteArray.size64() && byteArray.getByte(end) != '\n') { end++; } int length = (int) (end - start); byte[] buffer = new byte[length]; byteArray.getElements(start, buffer, 0, length); return buffer; } /** Load the sizes of the content nodes */ public void loadContentLength() throws IOException { contentLength = loadMappedLongs(path + ".property.content.length.bin"); } /** Get the size (in bytes) of the given content node */ public Long getContentLength(long nodeId) { if (contentLength == null) { throw new IllegalStateException("Content lengths not loaded"); } long res = contentLength.getLong(nodeId); return (res >= 0) ? res : null; } /** Load the IDs of the persons (authors and committers) */ public void loadPersonIds() throws IOException { authorId = loadMappedInts(path + ".property.author_id.bin"); committerId = loadMappedInts(path + ".property.committer_id.bin"); } /** Get a unique integer ID representing the author of the given revision or release node */ public Long getAuthorId(long nodeId) { if (authorId == null) { throw new IllegalStateException("Author IDs not loaded"); } long res = authorId.getInt(nodeId); return (res >= 0) ? res : null; } /** Get a unique integer ID representing the committer of the given revision node */ public Long getCommitterId(long nodeId) { if (committerId == null) { throw new IllegalStateException("Committer IDs not loaded"); } long res = committerId.getInt(nodeId); return (res >= 0) ? res : null; } /** * Loads a boolean array indicating whether the given content node was skipped during archive * ingestion */ public void loadContentIsSkipped() throws IOException { try { contentIsSkipped = (LongArrayBitVector) BinIO.loadObject(path + ".property.content.is_skipped.bin"); } catch (ClassNotFoundException e) { throw new IOException(e); } } /** Returns whether the given content node was skipped during archive ingestion */ public boolean isContentSkipped(long nodeId) { if (contentIsSkipped == null) { throw new IllegalStateException("Skipped content array not loaded"); } return contentIsSkipped.getBoolean(nodeId); } /** Load the timestamps at which the releases and revisions were authored */ public void loadAuthorTimestamps() throws IOException { authorTimestamp = loadMappedLongs(path + ".property.author_timestamp.bin"); authorTimestampOffset = loadMappedShorts(path + ".property.author_timestamp_offset.bin"); } /** Return the timestamp at which the given revision or release was authored */ public Long getAuthorTimestamp(long nodeId) { if (authorTimestamp == null) { throw new IllegalStateException("Author timestamps not loaded"); } long res = authorTimestamp.getLong(nodeId); return (res > Long.MIN_VALUE) ? res : null; } /** Return the timestamp offset at which the given revision or release was authored */ public Short getAuthorTimestampOffset(long nodeId) { if (authorTimestampOffset == null) { throw new IllegalStateException("Author timestamp offsets not loaded"); } short res = authorTimestampOffset.getShort(nodeId); return (res > Short.MIN_VALUE) ? res : null; } /** Load the timestamps at which the releases and revisions were committed */ public void loadCommitterTimestamps() throws IOException { committerTimestamp = loadMappedLongs(path + ".property.committer_timestamp.bin"); committerTimestampOffset = loadMappedShorts(path + ".property.committer_timestamp_offset.bin"); } /** Return the timestamp at which the given revision was committed */ public Long getCommitterTimestamp(long nodeId) { if (committerTimestamp == null) { throw new IllegalStateException("Committer timestamps not loaded"); } long res = committerTimestamp.getLong(nodeId); return (res > Long.MIN_VALUE) ? res : null; } /** Return the timestamp offset at which the given revision was committed */ public Short getCommitterTimestampOffset(long nodeId) { if (committerTimestampOffset == null) { throw new IllegalStateException("Committer timestamp offsets not loaded"); } short res = committerTimestampOffset.getShort(nodeId); return (res > Short.MIN_VALUE) ? res : null; } /** Load the revision messages, the release messages and the origin URLs */ public void loadMessages() throws IOException { messageBuffer = loadMappedBytes(path + ".property.message.bin"); messageOffsets = loadMappedLongs(path + ".property.message.offset.bin"); } /** Get the message of the given revision or release node */ public byte[] getMessage(long nodeId) { if (messageBuffer == null || messageOffsets == null) { throw new IllegalStateException("Messages not loaded"); } long startOffset = messageOffsets.getLong(nodeId); if (startOffset == -1) { return null; } return Base64.getDecoder().decode(getLine(messageBuffer, startOffset)); } /** Get the URL of the given origin node */ public String getUrl(long nodeId) { byte[] url = getMessage(nodeId); return (url != null) ? new String(url) : null; } /** Load the release names */ public void loadTagNames() throws IOException { tagNameBuffer = loadMappedBytes(path + ".property.tag_name.bin"); tagNameOffsets = loadMappedLongs(path + ".property.tag_name.offset.bin"); } /** Get the name of the given release node */ public byte[] getTagName(long nodeId) { if (tagNameBuffer == null || tagNameOffsets == null) { throw new IllegalStateException("Tag names not loaded"); } long startOffset = tagNameOffsets.getLong(nodeId); if (startOffset == -1) { return null; } return Base64.getDecoder().decode(getLine(tagNameBuffer, startOffset)); } /** Load the arc label names (directory entry names and snapshot branch names) */ public void loadLabelNames() throws IOException { try { edgeLabelNames = MappedFrontCodedStringBigList.load(path + ".labels.fcl"); } catch (ConfigurationException e) { throw new IOException(e); } } /** * Get the arc label name (either a directory entry name or snapshot branch name) associated with * the given label ID */ public byte[] getLabelName(long labelId) { if (edgeLabelNames == null) { throw new IllegalStateException("Label names not loaded"); } return Base64.getDecoder().decode(edgeLabelNames.getArray(labelId)); } } diff --git a/java/src/main/java/org/softwareheritage/graph/SwhType.java b/java/src/main/java/org/softwareheritage/graph/SwhType.java new file mode 100644 index 0000000..5578837 --- /dev/null +++ b/java/src/main/java/org/softwareheritage/graph/SwhType.java @@ -0,0 +1,140 @@ +/* + * Copyright (c) 2022 The Software Heritage developers + * See the AUTHORS file at the top-level directory of this distribution + * License: GNU General Public License version 3, or any later version + * See top-level LICENSE file for more information + */ + +package org.softwareheritage.graph; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +/** + * Software Heritage graph node types, as described in the + * data model. + */ +public enum SwhType { + /** Content node */ + CNT, + /** Directory node */ + DIR, + /** Origin node */ + ORI, + /** Release node */ + REL, + /** Revision node */ + REV, + /** Snapshot node */ + SNP; + + /** + * Converts integer to corresponding SWH node type. + * + * @param intType node type represented as an integer + * @return the corresponding {@link SwhType} value + * @see SwhType + */ + public static SwhType fromInt(int intType) { + switch (intType) { + case 0: + return CNT; + case 1: + return DIR; + case 2: + return ORI; + case 3: + return REL; + case 4: + return REV; + case 5: + return SNP; + } + return null; + } + + /** + * Converts node types to the corresponding int value + * + * @param type node type as an enum + * @return the corresponding int value + */ + public static int toInt(SwhType type) { + switch (type) { + case CNT: + return 0; + case DIR: + return 1; + case ORI: + return 2; + case REL: + return 3; + case REV: + return 4; + case SNP: + return 5; + } + throw new IllegalArgumentException("Unknown node type: " + type); + } + + /** + * Converts string to corresponding SWH node type. + * + * @param strType node type represented as a string + * @return the corresponding {@link SwhType} value + * @see SwhType + */ + public static SwhType fromStr(String strType) { + if (!strType.matches("cnt|dir|ori|rel|rev|snp")) { + throw new IllegalArgumentException("Unknown node type: " + strType); + } + return SwhType.valueOf(strType.toUpperCase()); + } + + /** + * Converts byte array name to the int code of the corresponding SWH node type. Used for + * performance-critical deserialization. + * + * @param name node type represented as a byte array (e.g. b"cnt") + * @return the ordinal value of the corresponding {@link SwhType} + * @see SwhType + */ + public static int byteNameToInt(byte[] name) { + if (Arrays.equals(name, "cnt".getBytes())) { + return 0; + } else if (Arrays.equals(name, "dir".getBytes())) { + return 1; + } else if (Arrays.equals(name, "ori".getBytes())) { + return 2; + } else if (Arrays.equals(name, "rel".getBytes())) { + return 3; + } else if (Arrays.equals(name, "rev".getBytes())) { + return 4; + } else if (Arrays.equals(name, "snp".getBytes())) { + return 5; + } else + return -1; + } + + /** + * Parses SWH node type possible values from formatted string (see the + * API syntax). + * + * @param strFmtType node types represented as a formatted string + * @return a list containing the {@link SwhType} values + * @see SwhType + */ + public static ArrayList parse(String strFmtType) { + ArrayList types = new ArrayList<>(); + + if (strFmtType.equals("*")) { + List nodeTypes = Arrays.asList(SwhType.values()); + types.addAll(nodeTypes); + } else { + types.add(SwhType.fromStr(strFmtType)); + } + + return types; + } +} diff --git a/java/src/main/java/org/softwareheritage/graph/compress/ExtractNodes.java b/java/src/main/java/org/softwareheritage/graph/compress/ExtractNodes.java index 9d58fff..9e2ad40 100644 --- a/java/src/main/java/org/softwareheritage/graph/compress/ExtractNodes.java +++ b/java/src/main/java/org/softwareheritage/graph/compress/ExtractNodes.java @@ -1,411 +1,411 @@ /* * Copyright (c) 2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.compress; import com.github.luben.zstd.ZstdOutputStream; import com.martiansoftware.jsap.*; import it.unimi.dsi.logging.ProgressLogger; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import org.softwareheritage.graph.utils.Sort; import java.io.*; import java.nio.charset.StandardCharsets; import java.util.*; import java.util.concurrent.ExecutionException; import java.util.concurrent.ForkJoinPool; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicLong; import java.util.concurrent.atomic.AtomicLongArray; /** * Read a graph dataset and extract all the unique node SWHIDs it contains, including the ones that * are not stored as actual objects in the graph, but only referred to by the edges. * Additionally, extract the set of all unique edge labels in the graph. * *

    *
  • The set of nodes is written in ${outputBasename}.nodes.csv.zst, as a * zst-compressed sorted list of SWHIDs, one per line.
  • *
  • The set of edge labels is written in ${outputBasename}.labels.csv.zst, as a * zst-compressed sorted list of labels encoded in base64, one per line.
  • *
  • The number of unique nodes referred to in the graph is written in a text file, * ${outputBasename}.nodes.count.txt
  • *
  • The number of unique edges referred to in the graph is written in a text file, * ${outputBasename}.edges.count.txt
  • *
  • The number of unique edge labels is written in a text file, * ${outputBasename}.labels.count.txt
  • *
  • Statistics on the number of nodes of each type are written in a text file, * ${outputBasename}.nodes.stats.txt
  • *
  • Statistics on the number of edges of each type are written in a text file, * ${outputBasename}.edges.stats.txt
  • *
* *

* Rationale: Because the graph can contain holes, loose objects and dangling * objects, some nodes that are referred to as destinations in the edge relationships might not * actually be stored in the graph itself. However, to compress the graph using a graph compression * library, it is necessary to have a list of all the nodes in the graph, including the * ones that are simply referred to by the edges but not actually stored as concrete objects. *

* *

* This class reads the entire graph dataset, and uses sort -u to extract the set of * all the unique nodes and unique labels that will be needed as an input for the compression * process. *

*/ public class ExtractNodes { private final static Logger logger = LoggerFactory.getLogger(ExtractNodes.class); // Create one thread per processor. final static int numThreads = Runtime.getRuntime().availableProcessors(); // Allocate up to 20% of maximum memory for sorting subprocesses. final static long sortBufferSize = (long) (Runtime.getRuntime().maxMemory() * 0.2 / numThreads / 2); private static JSAPResult parseArgs(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(ComposePermutations.class.getName(), "", new Parameter[]{ new UnflaggedOption("dataset", JSAP.STRING_PARSER, JSAP.REQUIRED, "Path to the edges dataset"), new UnflaggedOption("outputBasename", JSAP.STRING_PARSER, JSAP.REQUIRED, "Basename of the output files"), new FlaggedOption("format", JSAP.STRING_PARSER, "orc", JSAP.NOT_REQUIRED, 'f', "format", "Format of the input dataset (orc, csv)"), new FlaggedOption("sortBufferSize", JSAP.STRING_PARSER, String.valueOf(sortBufferSize) + "b", JSAP.NOT_REQUIRED, 'S', "sort-buffer-size", "Size of the memory buffer used by each sort process"), new FlaggedOption("sortTmpDir", JSAP.STRING_PARSER, null, JSAP.NOT_REQUIRED, 'T', "temp-dir", "Path to the temporary directory used by sort")}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { System.err.println("Usage error: " + e.getMessage()); System.exit(1); } return config; } public static void main(String[] args) throws IOException, InterruptedException { JSAPResult parsedArgs = parseArgs(args); String datasetPath = parsedArgs.getString("dataset"); String outputBasename = parsedArgs.getString("outputBasename"); String datasetFormat = parsedArgs.getString("format"); String sortBufferSize = parsedArgs.getString("sortBufferSize"); String sortTmpPath = parsedArgs.getString("sortTmpDir", null); File sortTmpDir = new File(sortTmpPath); sortTmpDir.mkdirs(); // Open edge dataset GraphDataset dataset; if (datasetFormat.equals("orc")) { dataset = new ORCGraphDataset(datasetPath); } else if (datasetFormat.equals("csv")) { dataset = new CSVEdgeDataset(datasetPath); } else { throw new IllegalArgumentException("Unknown dataset format: " + datasetFormat); } extractNodes(dataset, outputBasename, sortBufferSize, sortTmpDir); } public static void extractNodes(GraphDataset dataset, String outputBasename, String sortBufferSize, File sortTmpDir) throws IOException, InterruptedException { // Read the dataset and write the nodes and labels to the sorting processes AtomicLong edgeCount = new AtomicLong(0); - AtomicLongArray edgeCountByType = new AtomicLongArray(Node.Type.values().length * Node.Type.values().length); + AtomicLongArray edgeCountByType = new AtomicLongArray(SwhType.values().length * SwhType.values().length); int numThreads = Runtime.getRuntime().availableProcessors(); ForkJoinPool forkJoinPool = new ForkJoinPool(numThreads); Process[] nodeSorters = new Process[numThreads]; File[] nodeBatchPaths = new File[numThreads]; Process[] labelSorters = new Process[numThreads]; File[] labelBatches = new File[numThreads]; long[] progressCounts = new long[numThreads]; AtomicInteger nextThreadId = new AtomicInteger(0); ThreadLocal threadLocalId = ThreadLocal.withInitial(nextThreadId::getAndIncrement); ProgressLogger pl = new ProgressLogger(logger, 10, TimeUnit.SECONDS); pl.itemsName = "edges"; pl.start("Reading node/edge files and writing sorted batches."); GraphDataset.NodeCallback nodeCallback = (node) -> { int threadId = threadLocalId.get(); if (nodeSorters[threadId] == null) { nodeBatchPaths[threadId] = File.createTempFile("nodes", ".txt", sortTmpDir); nodeSorters[threadId] = Sort.spawnSort(sortBufferSize, sortTmpDir.getPath(), List.of("-o", nodeBatchPaths[threadId].getPath())); } OutputStream nodeOutputStream = nodeSorters[threadId].getOutputStream(); nodeOutputStream.write(node); nodeOutputStream.write('\n'); }; GraphDataset.NodeCallback labelCallback = (label) -> { int threadId = threadLocalId.get(); if (labelSorters[threadId] == null) { labelBatches[threadId] = File.createTempFile("labels", ".txt", sortTmpDir); labelSorters[threadId] = Sort.spawnSort(sortBufferSize, sortTmpDir.getPath(), List.of("-o", labelBatches[threadId].getPath())); } OutputStream labelOutputStream = labelSorters[threadId].getOutputStream(); labelOutputStream.write(label); labelOutputStream.write('\n'); }; try { forkJoinPool.submit(() -> { try { dataset.readEdges((node) -> { nodeCallback.onNode(node); }, (src, dst, label, perm) -> { nodeCallback.onNode(src); nodeCallback.onNode(dst); if (label != null) { labelCallback.onNode(label); } edgeCount.incrementAndGet(); // Extract type of src and dst from their SWHID: swh:1:XXX byte[] srcTypeBytes = Arrays.copyOfRange(src, 6, 6 + 3); byte[] dstTypeBytes = Arrays.copyOfRange(dst, 6, 6 + 3); - int srcType = Node.Type.byteNameToInt(srcTypeBytes); - int dstType = Node.Type.byteNameToInt(dstTypeBytes); + int srcType = SwhType.byteNameToInt(srcTypeBytes); + int dstType = SwhType.byteNameToInt(dstTypeBytes); if (srcType != -1 && dstType != -1) { - edgeCountByType.incrementAndGet(srcType * Node.Type.values().length + dstType); + edgeCountByType.incrementAndGet(srcType * SwhType.values().length + dstType); } else { System.err.println("Invalid edge type: " + new String(srcTypeBytes) + " -> " + new String(dstTypeBytes)); System.exit(1); } int threadId = threadLocalId.get(); if (++progressCounts[threadId] > 1000) { synchronized (pl) { pl.update(progressCounts[threadId]); } progressCounts[threadId] = 0; } }); } catch (IOException e) { throw new RuntimeException(e); } }).get(); } catch (ExecutionException e) { throw new RuntimeException(e); } // Close all the sorters stdin for (int i = 0; i < numThreads; i++) { if (nodeSorters[i] != null) { nodeSorters[i].getOutputStream().close(); } if (labelSorters[i] != null) { labelSorters[i].getOutputStream().close(); } } // Wait for sorting processes to finish for (int i = 0; i < numThreads; i++) { if (nodeSorters[i] != null) { nodeSorters[i].waitFor(); } if (labelSorters[i] != null) { labelSorters[i].waitFor(); } } pl.done(); ArrayList nodeSortMergerOptions = new ArrayList<>(List.of("-m")); ArrayList labelSortMergerOptions = new ArrayList<>(List.of("-m")); for (int i = 0; i < numThreads; i++) { if (nodeBatchPaths[i] != null) { nodeSortMergerOptions.add(nodeBatchPaths[i].getPath()); } if (labelBatches[i] != null) { labelSortMergerOptions.add(labelBatches[i].getPath()); } } // Spawn node merge-sorting process Process nodeSortMerger = Sort.spawnSort(sortBufferSize, sortTmpDir.getPath(), nodeSortMergerOptions); nodeSortMerger.getOutputStream().close(); OutputStream nodesFileOutputStream = new ZstdOutputStream( new BufferedOutputStream(new FileOutputStream(outputBasename + ".nodes.csv.zst"))); NodesOutputThread nodesOutputThread = new NodesOutputThread( new BufferedInputStream(nodeSortMerger.getInputStream()), nodesFileOutputStream); nodesOutputThread.start(); // Spawn label merge-sorting process Process labelSortMerger = Sort.spawnSort(sortBufferSize, sortTmpDir.getPath(), labelSortMergerOptions); labelSortMerger.getOutputStream().close(); OutputStream labelsFileOutputStream = new ZstdOutputStream( new BufferedOutputStream(new FileOutputStream(outputBasename + ".labels.csv.zst"))); LabelsOutputThread labelsOutputThread = new LabelsOutputThread( new BufferedInputStream(labelSortMerger.getInputStream()), labelsFileOutputStream); labelsOutputThread.start(); pl.logger().info("Waiting for merge-sort and writing output files..."); nodeSortMerger.waitFor(); labelSortMerger.waitFor(); nodesOutputThread.join(); labelsOutputThread.join(); - long[][] edgeCountByTypeArray = new long[Node.Type.values().length][Node.Type.values().length]; + long[][] edgeCountByTypeArray = new long[SwhType.values().length][SwhType.values().length]; for (int i = 0; i < edgeCountByTypeArray.length; i++) { for (int j = 0; j < edgeCountByTypeArray[i].length; j++) { - edgeCountByTypeArray[i][j] = edgeCountByType.get(i * Node.Type.values().length + j); + edgeCountByTypeArray[i][j] = edgeCountByType.get(i * SwhType.values().length + j); } } // Write node, edge and label counts/statistics printEdgeCounts(outputBasename, edgeCount.get(), edgeCountByTypeArray); printNodeCounts(outputBasename, nodesOutputThread.getNodeCount(), nodesOutputThread.getNodeTypeCounts()); printLabelCounts(outputBasename, labelsOutputThread.getLabelCount()); // Clean up sorted batches for (int i = 0; i < numThreads; i++) { if (nodeBatchPaths[i] != null) { nodeBatchPaths[i].delete(); } if (labelBatches[i] != null) { labelBatches[i].delete(); } } } private static void printEdgeCounts(String basename, long edgeCount, long[][] edgeTypeCounts) throws IOException { PrintWriter nodeCountWriter = new PrintWriter(basename + ".edges.count.txt"); nodeCountWriter.println(edgeCount); nodeCountWriter.close(); PrintWriter nodeTypesCountWriter = new PrintWriter(basename + ".edges.stats.txt"); TreeMap edgeTypeCountsMap = new TreeMap<>(); - for (Node.Type src : Node.Type.values()) { - for (Node.Type dst : Node.Type.values()) { - long cnt = edgeTypeCounts[Node.Type.toInt(src)][Node.Type.toInt(dst)]; + for (SwhType src : SwhType.values()) { + for (SwhType dst : SwhType.values()) { + long cnt = edgeTypeCounts[SwhType.toInt(src)][SwhType.toInt(dst)]; if (cnt > 0) edgeTypeCountsMap.put(src.toString().toLowerCase() + ":" + dst.toString().toLowerCase(), cnt); } } for (Map.Entry entry : edgeTypeCountsMap.entrySet()) { nodeTypesCountWriter.println(entry.getKey() + " " + entry.getValue()); } nodeTypesCountWriter.close(); } private static void printNodeCounts(String basename, long nodeCount, long[] nodeTypeCounts) throws IOException { PrintWriter nodeCountWriter = new PrintWriter(basename + ".nodes.count.txt"); nodeCountWriter.println(nodeCount); nodeCountWriter.close(); PrintWriter nodeTypesCountWriter = new PrintWriter(basename + ".nodes.stats.txt"); TreeMap nodeTypeCountsMap = new TreeMap<>(); - for (Node.Type v : Node.Type.values()) { - nodeTypeCountsMap.put(v.toString().toLowerCase(), nodeTypeCounts[Node.Type.toInt(v)]); + for (SwhType v : SwhType.values()) { + nodeTypeCountsMap.put(v.toString().toLowerCase(), nodeTypeCounts[SwhType.toInt(v)]); } for (Map.Entry entry : nodeTypeCountsMap.entrySet()) { nodeTypesCountWriter.println(entry.getKey() + " " + entry.getValue()); } nodeTypesCountWriter.close(); } private static void printLabelCounts(String basename, long labelCount) throws IOException { PrintWriter nodeCountWriter = new PrintWriter(basename + ".labels.count.txt"); nodeCountWriter.println(labelCount); nodeCountWriter.close(); } private static class NodesOutputThread extends Thread { private final InputStream sortedNodesStream; private final OutputStream nodesOutputStream; private long nodeCount = 0; - private final long[] nodeTypeCounts = new long[Node.Type.values().length]; + private final long[] nodeTypeCounts = new long[SwhType.values().length]; NodesOutputThread(InputStream sortedNodesStream, OutputStream nodesOutputStream) { this.sortedNodesStream = sortedNodesStream; this.nodesOutputStream = nodesOutputStream; } @Override public void run() { BufferedReader reader = new BufferedReader( new InputStreamReader(sortedNodesStream, StandardCharsets.UTF_8)); try { String line; while ((line = reader.readLine()) != null) { nodesOutputStream.write(line.getBytes(StandardCharsets.UTF_8)); nodesOutputStream.write('\n'); nodeCount++; try { - Node.Type nodeType = Node.Type.fromStr(line.split(":")[2]); - nodeTypeCounts[Node.Type.toInt(nodeType)]++; + SwhType nodeType = SwhType.fromStr(line.split(":")[2]); + nodeTypeCounts[SwhType.toInt(nodeType)]++; } catch (ArrayIndexOutOfBoundsException e) { System.err.println("Error parsing SWHID: " + line); System.exit(1); } } nodesOutputStream.close(); } catch (IOException e) { throw new RuntimeException(e); } } public long getNodeCount() { return nodeCount; } public long[] getNodeTypeCounts() { return nodeTypeCounts; } } private static class LabelsOutputThread extends Thread { private final InputStream sortedLabelsStream; private final OutputStream labelsOutputStream; private long labelCount = 0; LabelsOutputThread(InputStream sortedLabelsStream, OutputStream labelsOutputStream) { this.labelsOutputStream = labelsOutputStream; this.sortedLabelsStream = sortedLabelsStream; } @Override public void run() { BufferedReader reader = new BufferedReader( new InputStreamReader(sortedLabelsStream, StandardCharsets.UTF_8)); try { String line; while ((line = reader.readLine()) != null) { labelsOutputStream.write(line.getBytes(StandardCharsets.UTF_8)); labelsOutputStream.write('\n'); labelCount++; } labelsOutputStream.close(); } catch (IOException e) { throw new RuntimeException(e); } } public long getLabelCount() { return labelCount; } } } diff --git a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java index 74ef2f3..105a921 100644 --- a/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java +++ b/java/src/main/java/org/softwareheritage/graph/compress/NodeMapBuilder.java @@ -1,201 +1,201 @@ /* * Copyright (c) 2019-2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.compress; import com.github.luben.zstd.ZstdInputStream; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.BigArrays; import it.unimi.dsi.fastutil.Size64; import it.unimi.dsi.fastutil.io.BinIO; import it.unimi.dsi.fastutil.longs.LongBigArrays; import it.unimi.dsi.fastutil.longs.LongBigList; import it.unimi.dsi.fastutil.objects.Object2LongFunction; import it.unimi.dsi.io.FastBufferedReader; import it.unimi.dsi.io.LineIterator; import it.unimi.dsi.logging.ProgressLogger; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import org.softwareheritage.graph.Node; import org.softwareheritage.graph.SWHID; +import org.softwareheritage.graph.SwhType; import org.softwareheritage.graph.maps.NodeIdMap; import org.softwareheritage.graph.maps.NodeTypesMap; import java.io.*; import java.nio.charset.StandardCharsets; import java.util.Scanner; import java.util.concurrent.TimeUnit; /** * Create maps needed at runtime by the graph service, in particular: *

*

    *
  • WebGraph long node id → SWHID
  • *
  • WebGraph long node id → SWH node type (enum)
  • *
* * @author The Software Heritage developers */ public class NodeMapBuilder { final static String SORT_BUFFER_SIZE = "40%"; final static Logger logger = LoggerFactory.getLogger(NodeMapBuilder.class); /** * Main entrypoint. * * @param args command line arguments */ public static void main(String[] args) throws IOException { if (args.length != 2) { logger.error("Usage: COMPRESSED_GRAPH_BASE_NAME TEMP_DIR < NODES_CSV"); System.exit(1); } String graphPath = args[0]; String tmpDir = args[1]; logger.info("starting maps generation..."); precomputeNodeIdMap(graphPath, tmpDir); logger.info("maps generation completed"); } /** * Computes and dumps on disk mapping files. * * @param graphPath path of the compressed graph */ static void precomputeNodeIdMap(String graphPath, String tmpDir) throws IOException { ProgressLogger plSWHID2Node = new ProgressLogger(logger, 10, TimeUnit.SECONDS); ProgressLogger plNode2SWHID = new ProgressLogger(logger, 10, TimeUnit.SECONDS); plSWHID2Node.itemsName = "nodes"; plNode2SWHID.itemsName = "nodes"; // first half of SWHID->node mapping: SWHID -> WebGraph MPH (long) Object2LongFunction mphMap = NodeIdMap.loadMph(graphPath + ".mph"); long nbIds = (mphMap instanceof Size64) ? ((Size64) mphMap).size64() : mphMap.size(); plSWHID2Node.expectedUpdates = nbIds; plNode2SWHID.expectedUpdates = nbIds; // second half of SWHID->node mapping: WebGraph MPH (long) -> BFS order (long) long[][] bfsMap = LongBigArrays.newBigArray(nbIds); logger.info("loading BFS order file..."); long loaded = BinIO.loadLongs(graphPath + ".order", bfsMap); logger.info("BFS order file loaded"); if (loaded != nbIds) { logger.error("graph contains " + nbIds + " nodes, but read " + loaded); System.exit(2); } /* * Read on stdin a list of SWHIDs, hash them with MPH, then permute them according to the .order * file */ FastBufferedReader buffer = new FastBufferedReader( new InputStreamReader(new ZstdInputStream(new BufferedInputStream(System.in)))); LineIterator swhidIterator = new LineIterator(buffer); /* * The WebGraph node id -> SWHID mapping can be obtained from the SWHID->node one by numerically * sorting on node id and sequentially writing obtained SWHIDs to a binary map. Delegates the * sorting job to /usr/bin/sort via pipes */ ProcessBuilder processBuilder = new ProcessBuilder(); processBuilder.command("sort", "--numeric-sort", "--key", "2", "--buffer-size", SORT_BUFFER_SIZE, "--temporary-directory", tmpDir); Process sort = processBuilder.start(); BufferedOutputStream sort_stdin = new BufferedOutputStream(sort.getOutputStream()); BufferedInputStream sort_stdout = new BufferedInputStream(sort.getInputStream()); // for the binary format of nodeToSwhidMap, see Python module swh.graph.swhid:IntToSwhidMap try (BufferedOutputStream nodeToSwhidMap = new BufferedOutputStream( new FileOutputStream(graphPath + NodeIdMap.NODE_TO_SWHID))) { /* * background handler for sort output, it will be fed SWHID/node pairs, and will itself fill * nodeToSwhidMap as soon as data from sort is ready. */ SortOutputHandler outputHandler = new SortOutputHandler(sort_stdout, nodeToSwhidMap, plNode2SWHID); outputHandler.start(); /* * Type map from WebGraph node ID to SWH type. Used at runtime by pure Java graph traversals to * efficiently check edge restrictions. */ - final int nbBitsPerNodeType = (int) Math.ceil(Math.log(Node.Type.values().length) / Math.log(2)); + final int nbBitsPerNodeType = (int) Math.ceil(Math.log(SwhType.values().length) / Math.log(2)); LongArrayBitVector nodeTypesBitVector = LongArrayBitVector.ofLength(nbBitsPerNodeType * nbIds); LongBigList nodeTypesMap = nodeTypesBitVector.asLongBigList(nbBitsPerNodeType); plSWHID2Node.start("Hashing SWHIDs to fill sort input"); for (long iNode = 0; iNode < nbIds && swhidIterator.hasNext(); iNode++) { String swhidStr = swhidIterator.next().toString(); SWHID swhid = new SWHID(swhidStr); long mphId = mphMap.getLong(swhidStr.getBytes(StandardCharsets.US_ASCII)); long nodeId = BigArrays.get(bfsMap, mphId); sort_stdin.write((swhidStr + "\t" + nodeId + "\n").getBytes(StandardCharsets.US_ASCII)); nodeTypesMap.set(nodeId, swhid.getType().ordinal()); plSWHID2Node.lightUpdate(); } plSWHID2Node.done(); sort_stdin.close(); // write type map logger.info("storing type map"); BinIO.storeObject(nodeTypesMap, graphPath + NodeTypesMap.NODE_TO_TYPE); logger.info("type map stored"); // wait for nodeToSwhidMap filling try { logger.info("waiting for node2swhid map..."); int sortExitCode = sort.waitFor(); if (sortExitCode != 0) { logger.error("sort returned non-zero exit code: " + sortExitCode); System.exit(2); } outputHandler.join(); } catch (InterruptedException e) { logger.error("processing of sort output failed with: " + e); System.exit(2); } } } private static class SortOutputHandler extends Thread { private final Scanner input; private final OutputStream output; private final ProgressLogger pl; SortOutputHandler(InputStream input, OutputStream output, ProgressLogger pl) { this.input = new Scanner(input, StandardCharsets.US_ASCII); this.output = output; this.pl = pl; } public void run() { boolean sortDone = false; logger.info("node2swhid: waiting for sort output..."); while (input.hasNextLine()) { if (!sortDone) { sortDone = true; this.pl.start("filling node2swhid map"); } String line = input.nextLine(); // format: SWHID NODE_ID SWHID swhid = new SWHID(line.split("\\t")[0]); // get SWHID try { output.write(swhid.toBytes()); } catch (IOException e) { logger.error("writing to node->SWHID map failed with: " + e); } this.pl.lightUpdate(); } this.pl.done(); } } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java index 9352853..446b0e1 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCC.java @@ -1,256 +1,256 @@ /* * Copyright (c) 2019 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.forks; import com.google.common.primitives.Longs; import com.martiansoftware.jsap.*; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.Arrays; import it.unimi.dsi.io.ByteDiskQueue; import it.unimi.dsi.logging.ProgressLogger; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.util.*; public class ForkCC { public Boolean includeRootDir; private SwhBidirectionalGraph graph; private Long emptySnapshot; private LongArrayBitVector visited; private LongArrayBitVector whitelist; private static JSAPResult parse_args(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(ForkCC.class.getName(), "", new Parameter[]{ new FlaggedOption("graphPath", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'g', "graph", "Basename of the compressed graph"), new FlaggedOption("whitelistPath", JSAP.STRING_PARSER, null, JSAP.NOT_REQUIRED, 't', "whitelist", "Whitelist of origins"), new FlaggedOption("includeRootDir", JSAP.BOOLEAN_PARSER, "false", JSAP.NOT_REQUIRED, 'R', "includerootdir", "Include root directory (default: false)"), new FlaggedOption("outdir", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'o', "outdir", "Directory where to put the results"),}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { e.printStackTrace(); } return config; } private static void printDistribution(ArrayList> components) { TreeMap distribution = new TreeMap<>(); for (ArrayList component : components) { distribution.merge((long) component.size(), 1L, Long::sum); } for (Map.Entry entry : distribution.entrySet()) { System.out.format("%d %d\n", entry.getKey(), entry.getValue()); } } private static void printLargestComponent(ArrayList> components) { int indexLargest = 0; for (int i = 1; i < components.size(); ++i) { if (components.get(i).size() > components.get(indexLargest).size()) indexLargest = i; } ArrayList component = components.get(indexLargest); for (Long node : component) { System.out.println(node); } } private void load_graph(String graphBasename) throws IOException { System.err.println("Loading graph " + graphBasename + " ..."); this.graph = SwhBidirectionalGraph.loadMapped(graphBasename).symmetrize(); System.err.println("Graph loaded."); this.emptySnapshot = null; this.whitelist = null; this.visited = null; this.includeRootDir = null; } private boolean nodeIsEmptySnapshot(Long node) { - if (this.emptySnapshot == null && this.graph.getNodeType(node) == Node.Type.SNP + if (this.emptySnapshot == null && this.graph.getNodeType(node) == SwhType.SNP && this.graph.outdegree(node) == 0) { System.err.println("Found empty snapshot: " + node); this.emptySnapshot = node; } return node.equals(this.emptySnapshot); } private Boolean shouldVisit(Long node) { - Node.Type nt = this.graph.getNodeType(node); - if (nt == Node.Type.CNT) { + SwhType nt = this.graph.getNodeType(node); + if (nt == SwhType.CNT) { return false; } - if (nt == Node.Type.DIR && !includeRootDir) + if (nt == SwhType.DIR && !includeRootDir) return false; if (this.nodeIsEmptySnapshot(node)) return false; if (visited.getBoolean(node)) return false; return true; } private ArrayList> compute(ProgressLogger pl) throws IOException { final long n = graph.numNodes(); // Allow enough memory to behave like in-memory queue int bufferSize = (int) Math.min(Arrays.MAX_ARRAY_SIZE & ~0x7, 8L * n); // Use a disk based queue to store BFS frontier final File queueFile = File.createTempFile(ForkCC.class.getSimpleName(), "queue"); final ByteDiskQueue queue = ByteDiskQueue.createNew(queueFile, bufferSize, true); final byte[] byteBuf = new byte[Long.BYTES]; // WARNING: no 64-bit version of this data-structure, but it can support // indices up to 2^37 visited = LongArrayBitVector.ofLength(n); pl.expectedUpdates = n; pl.itemsName = "nodes"; pl.start("Starting connected components visit..."); ArrayList> components = new ArrayList<>(); for (long i = 0; i < n; i++) { - if (!shouldVisit(i) || this.graph.getNodeType(i) == Node.Type.DIR) + if (!shouldVisit(i) || this.graph.getNodeType(i) == SwhType.DIR) continue; ArrayList component = new ArrayList<>(); queue.enqueue(Longs.toByteArray(i)); visited.set(i); while (!queue.isEmpty()) { queue.dequeue(byteBuf); final long currentNode = Longs.fromByteArray(byteBuf); - Node.Type cur_nt = this.graph.getNodeType(currentNode); - if (cur_nt == Node.Type.ORI && (this.whitelist == null || this.whitelist.getBoolean(currentNode))) { + SwhType cur_nt = this.graph.getNodeType(currentNode); + if (cur_nt == SwhType.ORI && (this.whitelist == null || this.whitelist.getBoolean(currentNode))) { // TODO: add a check that the origin has >=1 non-empty snapshot component.add(currentNode); } final LazyLongIterator iterator = graph.successors(currentNode); long succ; while ((succ = iterator.nextLong()) != -1) { if (!shouldVisit(succ)) continue; - if (this.graph.getNodeType(succ) == Node.Type.DIR && cur_nt != Node.Type.REV) + if (this.graph.getNodeType(succ) == SwhType.DIR && cur_nt != SwhType.REV) continue; visited.set(succ); queue.enqueue(Longs.toByteArray(succ)); } pl.update(); } if (component.size() > 0) { components.add(component); } } pl.done(); queue.close(); return components; } private static void printDistribution(ArrayList> components, Formatter out) { TreeMap distribution = new TreeMap<>(); for (ArrayList component : components) { distribution.merge((long) component.size(), 1L, Long::sum); } for (Map.Entry entry : distribution.entrySet()) { out.format("%d %d\n", entry.getKey(), entry.getValue()); } } private static void printLargestComponent(ArrayList> components, Formatter out) { int indexLargest = 0; for (int i = 1; i < components.size(); ++i) { if (components.get(i).size() > components.get(indexLargest).size()) indexLargest = i; } ArrayList component = components.get(indexLargest); for (Long node : component) { out.format("%d\n", node); } } private static void printAllComponents(ArrayList> components, Formatter out) { for (int i = 1; i < components.size(); ++i) { ArrayList component = components.get(i); for (Long node : component) { out.format("%d ", node); } out.format("\n"); } } private void parseWhitelist(String path) { System.err.println("Loading whitelist " + path + " ..."); this.whitelist = LongArrayBitVector.ofLength(this.graph.numNodes()); Scanner scanner; try { scanner = new Scanner(new File(path)); while (scanner.hasNextLong()) { whitelist.set(scanner.nextLong()); } System.err.println("Whitelist loaded."); } catch (FileNotFoundException e) { e.printStackTrace(); } } public static void main(String[] args) { JSAPResult config = parse_args(args); String graphPath = config.getString("graphPath"); String whitelistPath = config.getString("whitelistPath"); boolean includeRootDir = config.getBoolean("includeRootDir"); String outdirPath = config.getString("outdir"); ForkCC forkCc = new ForkCC(); try { forkCc.load_graph(graphPath); forkCc.includeRootDir = includeRootDir; } catch (IOException e) { System.out.println("Could not load graph: " + e); System.exit(2); } if (whitelistPath != null) { forkCc.parseWhitelist(whitelistPath); } ProgressLogger logger = new ProgressLogger(); // noinspection ResultOfMethodCallIgnored new File(outdirPath).mkdirs(); try { ArrayList> components = forkCc.compute(logger); printDistribution(components, new Formatter(outdirPath + "/distribution.txt")); printLargestComponent(components, new Formatter(outdirPath + "/largest_clique.txt")); printAllComponents(components, new Formatter(outdirPath + "/all_cliques.txt")); } catch (IOException e) { e.printStackTrace(); } logger.done(); } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java index 361cce8..746d51e 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ForkCliques.java @@ -1,230 +1,230 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.forks; import ch.qos.logback.classic.Level; import ch.qos.logback.classic.Logger; import com.google.common.primitives.Longs; import com.martiansoftware.jsap.*; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.logging.ProgressLogger; import org.slf4j.LoggerFactory; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; import java.util.*; public class ForkCliques { private SwhBidirectionalGraph graph; private LongArrayBitVector whitelist; private void load_graph(String graphBasename) throws IOException { System.err.println("Loading graph " + graphBasename + " ..."); this.graph = SwhBidirectionalGraph.loadMapped(graphBasename); System.err.println("Graph loaded."); this.whitelist = null; } private static JSAPResult parse_args(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(ForkCliques.class.getName(), "", new Parameter[]{ new FlaggedOption("graphPath", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'g', "graph", "Basename of the compressed graph"), new FlaggedOption("whitelistPath", JSAP.STRING_PARSER, null, JSAP.NOT_REQUIRED, 't', "whitelist", "Whitelist of origins"), new FlaggedOption("outdir", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'o', "outdir", "Directory where to put the results"),}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { e.printStackTrace(); } return config; } private ArrayList dfsAt(Long baseNode) { ArrayList res = new ArrayList<>(); final Deque stack = new ArrayDeque<>(); HashSet seen = new HashSet<>(); stack.push(baseNode); while (!stack.isEmpty()) { final Long currentNode = stack.pop(); final LazyLongIterator iterator = this.graph.predecessors(currentNode); long succ; while ((succ = iterator.nextLong()) != -1) { if (!seen.contains(succ)) { - Node.Type nt = this.graph.getNodeType(succ); - if (nt == Node.Type.DIR || nt == Node.Type.CNT) + SwhType nt = this.graph.getNodeType(succ); + if (nt == SwhType.DIR || nt == SwhType.CNT) continue; - if (nt == Node.Type.ORI && (this.whitelist == null || this.whitelist.getBoolean(succ))) { + if (nt == SwhType.ORI && (this.whitelist == null || this.whitelist.getBoolean(succ))) { res.add(succ); } else { stack.push(succ); seen.add(succ); } } } } Collections.sort(res); return res; } private boolean isBaseRevision(Long node) { - if (this.graph.getNodeType(node) != Node.Type.REV) + if (this.graph.getNodeType(node) != SwhType.REV) return false; final LazyLongIterator iterator = this.graph.successors(node); long succ; while ((succ = iterator.nextLong()) != -1) { - if (this.graph.getNodeType(succ) == Node.Type.REV) + if (this.graph.getNodeType(succ) == SwhType.REV) return false; } return true; } static private String fingerprint(ArrayList cluster) { MessageDigest digest; try { digest = MessageDigest.getInstance("SHA-256"); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); return null; } for (Long n : cluster) digest.update(Longs.toByteArray(n)); return new String(digest.digest()); } private ArrayList> compute(ProgressLogger pl) { final long n = this.graph.numNodes(); HashSet fingerprints = new HashSet<>(); ArrayList> clusters = new ArrayList<>(); pl.expectedUpdates = n; pl.itemsName = "nodes"; pl.start("Starting topological sort..."); for (long i = 0; i < n; i++) { if (isBaseRevision(i)) { ArrayList currentCluster = dfsAt(i); String clusterFp = fingerprint(currentCluster); if (!fingerprints.contains(clusterFp)) { fingerprints.add(clusterFp); clusters.add(currentCluster); } } pl.update(); } pl.done(); return clusters; } private static void printDistribution(ArrayList> components, Formatter out) { TreeMap distribution = new TreeMap<>(); for (ArrayList component : components) { distribution.merge((long) component.size(), 1L, Long::sum); } for (Map.Entry entry : distribution.entrySet()) { out.format("%d %d\n", entry.getKey(), entry.getValue()); } } private static void printLargestComponent(ArrayList> components, Formatter out) { int indexLargest = 0; for (int i = 1; i < components.size(); ++i) { if (components.get(i).size() > components.get(indexLargest).size()) indexLargest = i; } ArrayList component = components.get(indexLargest); for (Long node : component) { out.format("%d\n", node); } } private static void printAllComponents(ArrayList> components, Formatter out) { for (int i = 1; i < components.size(); ++i) { ArrayList component = components.get(i); for (Long node : component) { out.format("%d ", node); } out.format("\n"); } } private void parseWhitelist(String path) { System.err.println("Loading whitelist " + path + " ..."); this.whitelist = LongArrayBitVector.ofLength(this.graph.numNodes()); Scanner scanner; try { scanner = new Scanner(new File(path)); while (scanner.hasNextLong()) { whitelist.set(scanner.nextLong()); } System.err.println("Whitelist loaded."); } catch (FileNotFoundException e) { e.printStackTrace(); } } public static void main(String[] args) { JSAPResult config = parse_args(args); String graphPath = config.getString("graphPath"); String whitelistPath = config.getString("whitelistPath"); String outdirPath = config.getString("outdir"); ForkCliques forkCliques = new ForkCliques(); try { forkCliques.load_graph(graphPath); } catch (IOException e) { System.out.println("Could not load graph: " + e); System.exit(2); } if (whitelistPath != null) { forkCliques.parseWhitelist(whitelistPath); } Logger rootLogger = (ch.qos.logback.classic.Logger) LoggerFactory.getLogger(org.slf4j.Logger.ROOT_LOGGER_NAME); rootLogger.setLevel(Level.DEBUG); ProgressLogger logger = new ProgressLogger(rootLogger); ArrayList> components = forkCliques.compute(logger); // noinspection ResultOfMethodCallIgnored new File(outdirPath).mkdirs(); try { printDistribution(components, new Formatter(outdirPath + "/distribution.txt")); printLargestComponent(components, new Formatter(outdirPath + "/largest_clique.txt")); printAllComponents(components, new Formatter(outdirPath + "/all_cliques.txt")); } catch (FileNotFoundException e) { e.printStackTrace(); } logger.done(); } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java index 5067c28..0ffb690 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/forks/ListEmptyOrigins.java @@ -1,95 +1,95 @@ /* * Copyright (c) 2019 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.forks; import com.martiansoftware.jsap.*; import it.unimi.dsi.big.webgraph.ImmutableGraph; import it.unimi.dsi.big.webgraph.LazyLongIterator; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.IOException; import java.util.ArrayList; public class ListEmptyOrigins { private SwhBidirectionalGraph graph; private Long emptySnapshot; private static JSAPResult parse_args(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(ListEmptyOrigins.class.getName(), "", new Parameter[]{new FlaggedOption("graphPath", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'g', "graph", "Basename of the compressed graph"),}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { e.printStackTrace(); } return config; } public static void main(String[] args) { JSAPResult config = parse_args(args); String graphPath = config.getString("graphPath"); ListEmptyOrigins leo = new ListEmptyOrigins(); try { leo.load_graph(graphPath); } catch (IOException e) { System.out.println("Could not load graph: " + e); System.exit(2); } ArrayList badlist = leo.compute(leo.graph); for (Long bad : badlist) { System.out.println(bad); } } private void load_graph(String graphBasename) throws IOException { System.err.println("Loading graph " + graphBasename + " ..."); this.graph = SwhBidirectionalGraph.loadMapped(graphBasename); System.err.println("Graph loaded."); this.emptySnapshot = null; } private boolean nodeIsEmptySnapshot(Long node) { System.err.println(this.graph.getNodeType(node) + " " + this.graph.outdegree(node) + " " + node); - if (this.emptySnapshot == null && this.graph.getNodeType(node) == Node.Type.SNP + if (this.emptySnapshot == null && this.graph.getNodeType(node) == SwhType.SNP && this.graph.outdegree(node) == 0) { System.err.println("Found empty snapshot: " + node); this.emptySnapshot = node; } return node.equals(this.emptySnapshot); } private ArrayList compute(ImmutableGraph graph) { final long n = graph.numNodes(); ArrayList bad = new ArrayList<>(); for (long i = 0; i < n; i++) { - Node.Type nt = this.graph.getNodeType(i); - if (nt != Node.Type.ORI) + SwhType nt = this.graph.getNodeType(i); + if (nt != SwhType.ORI) continue; final LazyLongIterator iterator = graph.successors(i); long succ; boolean found = false; while ((succ = iterator.nextLong()) != -1) { if (this.graph.outdegree(succ) > 0) { found = true; } } if (!found) bad.add(i); } return bad; } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java b/java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java index 9195560..558aa39 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/topology/ClusteringCoefficient.java @@ -1,332 +1,331 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.topology; import com.martiansoftware.jsap.*; import it.unimi.dsi.Util; import it.unimi.dsi.big.webgraph.ImmutableGraph; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.fastutil.BigArrays; import it.unimi.dsi.fastutil.longs.LongBigArrays; import it.unimi.dsi.logging.ProgressLogger; import it.unimi.dsi.util.XoRoShiRo128PlusRandom; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.*; import java.util.*; import java.util.concurrent.*; public class ClusteringCoefficient { private final SwhBidirectionalGraph graph; private final String outdirPath; private final ConcurrentHashMap result_full; private final ConcurrentHashMap result_dircnt; private final ConcurrentHashMap result_rev; private final ConcurrentHashMap result_revrel; private final ConcurrentHashMap result_orisnp; public ClusteringCoefficient(String graphBasename, String outdirPath) throws IOException { this.outdirPath = outdirPath; System.err.println("Loading graph " + graphBasename + " ..."); SwhBidirectionalGraph directedGraph = SwhBidirectionalGraph.loadMapped(graphBasename); this.graph = directedGraph.symmetrize(); System.err.println("Graph loaded."); result_full = new ConcurrentHashMap<>(); result_dircnt = new ConcurrentHashMap<>(); result_rev = new ConcurrentHashMap<>(); result_revrel = new ConcurrentHashMap<>(); result_orisnp = new ConcurrentHashMap<>(); } private static JSAPResult parse_args(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(AveragePaths.class.getName(), "", new Parameter[]{ new FlaggedOption("graphPath", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'g', "graph", "Basename of the compressed graph"), new FlaggedOption("outdir", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'o', "outdir", "Directory where to put the results"), new FlaggedOption("numThreads", JSAP.INTEGER_PARSER, "32", JSAP.NOT_REQUIRED, 't', "numthreads", "Number of threads"),}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { e.printStackTrace(); } return config; } private void run(int numThreads) throws InterruptedException { final long END_OF_QUEUE = -1L; ArrayBlockingQueue queue = new ArrayBlockingQueue<>(numThreads); ExecutorService service = Executors.newFixedThreadPool(numThreads + 1); service.submit(() -> { try { SwhBidirectionalGraph thread_graph = graph.copy(); long[][] randomPerm = Util.identity(thread_graph.numNodes()); LongBigArrays.shuffle(randomPerm, new XoRoShiRo128PlusRandom()); long n = thread_graph.numNodes(); ProgressLogger pl = new ProgressLogger(); pl.expectedUpdates = n; pl.itemsName = "nodes"; pl.start("Filling processor queue..."); for (long j = 0; j < n; ++j) { long node = BigArrays.get(randomPerm, j); queue.put(node); if (j % 10000 == 0) { printResult(); } pl.update(); } } catch (Exception e) { e.printStackTrace(); } finally { for (int i = 0; i < numThreads; ++i) { try { queue.put(END_OF_QUEUE); } catch (InterruptedException e) { e.printStackTrace(); } } } }); for (int i = 0; i < numThreads; ++i) { service.submit(() -> { try { SwhBidirectionalGraph thread_graph = graph.copy(); while (true) { Long node = null; try { node = queue.take(); } catch (InterruptedException e) { e.printStackTrace(); } if (node == null || node == END_OF_QUEUE) { return; } computeAt(thread_graph, node); } } catch (Exception e) { e.printStackTrace(); } }); } service.shutdown(); service.awaitTermination(365, TimeUnit.DAYS); } private void computeAt(SwhBidirectionalGraph graph, long node) { long d = graph.outdegree(node); if (d < 2) { return; } - Node.Type nodeType = graph.getNodeType(node); + SwhType nodeType = graph.getNodeType(node); HashSet neighborhood = new HashSet<>(); long succ; final LazyLongIterator iterator = graph.successors(node); while ((succ = iterator.nextLong()) != -1) { neighborhood.add(succ); } long triangles_full = 0; long triangles_dircnt = 0; long triangles_rev = 0; long triangles_revrel = 0; long triangles_orisnp = 0; for (Long neighbor : neighborhood) { - Node.Type neighborNodeType = graph.getNodeType(neighbor); + SwhType neighborNodeType = graph.getNodeType(neighbor); final LazyLongIterator it = graph.successors(neighbor); while ((succ = it.nextLong()) != -1) { if (neighborhood.contains(succ)) { - Node.Type succNodeType = graph.getNodeType(succ); + SwhType succNodeType = graph.getNodeType(succ); triangles_full++; - if ((nodeType == Node.Type.DIR || nodeType == Node.Type.CNT) - && (neighborNodeType == Node.Type.DIR || neighborNodeType == Node.Type.CNT) - && (succNodeType == Node.Type.DIR || succNodeType == Node.Type.CNT)) { + if ((nodeType == SwhType.DIR || nodeType == SwhType.CNT) + && (neighborNodeType == SwhType.DIR || neighborNodeType == SwhType.CNT) + && (succNodeType == SwhType.DIR || succNodeType == SwhType.CNT)) { triangles_dircnt++; - } else if ((nodeType == Node.Type.REV || nodeType == Node.Type.REL) - && (neighborNodeType == Node.Type.REV || neighborNodeType == Node.Type.REL) - && (succNodeType == Node.Type.REV || succNodeType == Node.Type.REL)) { + } else if ((nodeType == SwhType.REV || nodeType == SwhType.REL) + && (neighborNodeType == SwhType.REV || neighborNodeType == SwhType.REL) + && (succNodeType == SwhType.REV || succNodeType == SwhType.REL)) { triangles_revrel++; - if (nodeType == Node.Type.REV && neighborNodeType == Node.Type.REV - && succNodeType == Node.Type.REV) + if (nodeType == SwhType.REV && neighborNodeType == SwhType.REV && succNodeType == SwhType.REV) triangles_rev++; - } else if ((nodeType == Node.Type.ORI || nodeType == Node.Type.SNP) - && (neighborNodeType == Node.Type.ORI || neighborNodeType == Node.Type.SNP) - && (succNodeType == Node.Type.ORI || succNodeType == Node.Type.SNP)) { + } else if ((nodeType == SwhType.ORI || nodeType == SwhType.SNP) + && (neighborNodeType == SwhType.ORI || neighborNodeType == SwhType.SNP) + && (succNodeType == SwhType.ORI || succNodeType == SwhType.SNP)) { triangles_orisnp++; } } } } result_full.merge(triangles_full, 1L, Long::sum); result_dircnt.merge(triangles_dircnt, 1L, Long::sum); result_rev.merge(triangles_rev, 1L, Long::sum); result_revrel.merge(triangles_revrel, 1L, Long::sum); result_orisnp.merge(triangles_orisnp, 1L, Long::sum); } public void printSortedDistribution(String distribPath, Map distrib) throws IOException { PrintWriter f = new PrintWriter(new FileWriter(distribPath)); TreeMap sortedDistribution = new TreeMap<>(distrib); for (Map.Entry entry : sortedDistribution.entrySet()) { f.println(entry.getKey() + " " + entry.getValue()); } f.close(); } public void printResult() throws IOException { new File(outdirPath).mkdirs(); printSortedDistribution(outdirPath + "/distribution-full.txt", result_full); printSortedDistribution(outdirPath + "/distribution-dircnt.txt", result_dircnt); printSortedDistribution(outdirPath + "/distribution-rev.txt", result_rev); printSortedDistribution(outdirPath + "/distribution-relrev.txt", result_revrel); printSortedDistribution(outdirPath + "/distribution-orisnp.txt", result_orisnp); } public static void main(String[] args) throws IOException, InterruptedException { JSAPResult config = parse_args(args); String graphPath = config.getString("graphPath"); String outdir = config.getString("outdir"); int numThreads = config.getInt("numThreads"); ClusteringCoefficient cc = new ClusteringCoefficient(graphPath, outdir); cc.run(numThreads); cc.printResult(); } // Old unused functions private long oppositeEdges(ImmutableGraph graph, long node) { HashSet neighborhood = new HashSet<>(); long succ; final LazyLongIterator iterator = graph.successors(node); while ((succ = iterator.nextLong()) != -1) { neighborhood.add(succ); } long closed_triplets = 0; for (Long neighbor : neighborhood) { final LazyLongIterator it = graph.successors(neighbor); while ((succ = it.nextLong()) != -1) { if (neighborhood.contains(succ)) { closed_triplets++; } } } return closed_triplets; } public void compute(ProgressLogger pl, Formatter out_local, Formatter out_global) { final long n = this.graph.numNodes(); pl.expectedUpdates = n; pl.itemsName = "nodes"; long nodes_d2 = 0; double cum_lcc = 0; double cum_lcc_c0 = 0; double cum_lcc_c1 = 0; HashMap distribution = new HashMap<>(); for (long node = 0; node < n; node++) { long d = graph.outdegree(node); if (d >= 2) { double t = (d * (d - 1)); double m = oppositeEdges(graph, node); double lcc = m / t; distribution.merge(lcc, 1L, Long::sum); cum_lcc += lcc; nodes_d2++; } else { cum_lcc_c1++; } pl.update(); } pl.done(); for (Map.Entry entry : distribution.entrySet()) { out_local.format("%f %d\n", entry.getKey(), entry.getValue()); } double gC = cum_lcc / nodes_d2; double gC0 = cum_lcc_c0 / n; double gC1 = cum_lcc_c1 / n; out_global.format("C: %f\n", gC); out_global.format("C0: %f\n", gC0); out_global.format("C1: %f\n", gC1); } public void compute_approx(Formatter out_global) { final long n = this.graph.numNodes(); long trials = 0; long triangles = 0; while (true) { long node = ThreadLocalRandom.current().nextLong(0, n); long d = graph.outdegree(node); if (d >= 2) { Long u = null; Long v = null; long u_idx = ThreadLocalRandom.current().nextLong(0, d); long v_idx = ThreadLocalRandom.current().nextLong(0, d - 1); if (v_idx >= u_idx) { v_idx++; } long succ; final LazyLongIterator node_iterator = graph.successors(node); for (long succ_idx = 0; (succ = node_iterator.nextLong()) != -1; succ_idx++) { if (succ_idx == u_idx) { u = succ; } if (succ_idx == v_idx) { v = succ; } } final LazyLongIterator u_iterator = graph.successors(u); while ((succ = u_iterator.nextLong()) != -1) { if (succ == v) triangles++; } } trials++; if (trials % 100 == 0 || true) { double gC = (double) triangles / (double) trials; out_global.format("C: %f (triangles: %d, trials: %d)\n", gC, triangles, trials); System.out.format("C: %f (triangles: %d, trials: %d)\n", gC, triangles, trials); } } } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java b/java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java index b6f6072..36c9c52 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/topology/ConnectedComponents.java @@ -1,207 +1,204 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.topology; import com.google.common.primitives.Longs; import com.martiansoftware.jsap.*; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.Arrays; import it.unimi.dsi.io.ByteDiskQueue; import it.unimi.dsi.logging.ProgressLogger; -import org.softwareheritage.graph.AllowedNodes; -import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; -import org.softwareheritage.graph.Subgraph; +import org.softwareheritage.graph.*; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.PrintWriter; import java.util.*; public class ConnectedComponents { private Subgraph graph; private void load_graph(String graphBasename, String nodeTypes) throws IOException { System.err.println("Loading graph " + graphBasename + " ..."); var underlyingGraph = SwhBidirectionalGraph.loadMapped(graphBasename); var underlyingGraphSym = underlyingGraph.symmetrize(); graph = new Subgraph(underlyingGraphSym, new AllowedNodes(nodeTypes)); System.err.println("Graph loaded."); } private static JSAPResult parse_args(String[] args) { JSAPResult config = null; try { SimpleJSAP jsap = new SimpleJSAP(ConnectedComponents.class.getName(), "", new Parameter[]{ new FlaggedOption("graphPath", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'g', "graph", "Basename of the compressed graph"), new FlaggedOption("outdir", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'o', "outdir", "Directory where to put the results"), new Switch("byOrigins", JSAP.NO_SHORTFLAG, "by-origins", "Compute size of components by number of origins"), new FlaggedOption("nodeTypes", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'n', "nodetypes", "Allowed node types (comma-separated)"),}); config = jsap.parse(args); if (jsap.messagePrinted()) { System.exit(1); } } catch (JSAPException e) { e.printStackTrace(); } return config; } private HashMap /* ArrayList> */ compute(ProgressLogger pl, boolean byOrigin) throws IOException { final long n = graph.numNodes(); final long maxN = graph.maxNodeNumber(); // Allow enough memory to behave like in-memory queue int bufferSize = (int) Math.min(Arrays.MAX_ARRAY_SIZE & ~0x7, 8L * maxN); // Use a disk based queue to store BFS frontier final File queueFile = File.createTempFile(ConnectedComponents.class.getSimpleName(), "queue"); final ByteDiskQueue queue = ByteDiskQueue.createNew(queueFile, bufferSize, true); final byte[] byteBuf = new byte[Long.BYTES]; // WARNING: no 64-bit version of this data-structure, but it can support // indices up to 2^37 LongArrayBitVector visited = LongArrayBitVector.ofLength(maxN); pl.expectedUpdates = n; pl.itemsName = "nodes"; pl.start("Starting connected components visit..."); // ArrayList> components = new ArrayList<>(); HashMap componentDistribution = new HashMap<>(); var it = graph.nodeIterator(); while (it.hasNext()) { long i = it.nextLong(); if (visited.getBoolean(i)) continue; // ArrayList component = new ArrayList<>(); long componentNodes = 0; queue.enqueue(Longs.toByteArray(i)); visited.set(i); while (!queue.isEmpty()) { queue.dequeue(byteBuf); final long currentNode = Longs.fromByteArray(byteBuf); // component.add(currentNode); - if (!byOrigin || graph.getNodeType(currentNode) == Node.Type.ORI) + if (!byOrigin || graph.getNodeType(currentNode) == SwhType.ORI) componentNodes += 1; final LazyLongIterator iterator = graph.successors(currentNode); long succ; while ((succ = iterator.nextLong()) != -1) { if (visited.getBoolean(succ)) continue; visited.set(succ); queue.enqueue(Longs.toByteArray(succ)); } pl.update(); } /* * if (component.size() > 0) { components.add(component); } */ if (componentNodes > 0) componentDistribution.merge(componentNodes, 1L, Long::sum); } pl.done(); // return components; return componentDistribution; } private static void printDistribution(ArrayList> components, Formatter out) { TreeMap distribution = new TreeMap<>(); for (ArrayList component : components) { distribution.merge((long) component.size(), 1L, Long::sum); } for (Map.Entry entry : distribution.entrySet()) { out.format("%d %d\n", entry.getKey(), entry.getValue()); } out.close(); } private static void printLargestComponent(ArrayList> components, Formatter out) { int indexLargest = 0; for (int i = 1; i < components.size(); ++i) { if (components.get(i).size() > components.get(indexLargest).size()) indexLargest = i; } ArrayList component = components.get(indexLargest); for (Long node : component) { out.format("%d\n", node); } out.close(); } private static void printAllComponents(ArrayList> components, Formatter out) { for (int i = 1; i < components.size(); ++i) { ArrayList component = components.get(i); for (Long node : component) { out.format("%d ", node); } out.format("\n"); } out.close(); } public static void main(String[] args) { JSAPResult config = parse_args(args); String graphPath = config.getString("graphPath"); String outdirPath = config.getString("outdir"); String nodeTypes = config.getString("nodeTypes"); boolean byOrigin = config.getBoolean("byOrigins"); ConnectedComponents connectedComponents = new ConnectedComponents(); try { connectedComponents.load_graph(graphPath, nodeTypes); } catch (IOException e) { System.out.println("Could not load graph: " + e); System.exit(2); } ProgressLogger logger = new ProgressLogger(); // noinspection ResultOfMethodCallIgnored new File(outdirPath).mkdirs(); try { // ArrayList> components = connectedComponents.compute(logger); // components.sort(Comparator.comparing(ArrayList::size).reversed()); // printDistribution(components, new Formatter(outdirPath + "/distribution.txt")); // printLargestComponent(components, new Formatter(outdirPath + "/largest_component.txt")); // printAllComponents(components, new Formatter(outdirPath + "/all_components.txt")); HashMap componentDistribution = connectedComponents.compute(logger, byOrigin); PrintWriter f = new PrintWriter(new FileWriter(outdirPath + "/distribution.txt")); TreeMap sortedDistribution = new TreeMap<>(componentDistribution); for (Map.Entry entry : sortedDistribution.entrySet()) { f.println(entry.getKey() + " " + entry.getValue()); } f.close(); } catch (IOException e) { e.printStackTrace(); } logger.done(); } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java b/java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java index 54e53cb..603be4d 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/topology/InOutDegree.java @@ -1,246 +1,246 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.topology; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.PrintWriter; import java.lang.reflect.InvocationTargetException; import java.util.*; import com.martiansoftware.jsap.JSAP; import com.martiansoftware.jsap.JSAPException; import com.martiansoftware.jsap.JSAPResult; import com.martiansoftware.jsap.Parameter; import com.martiansoftware.jsap.SimpleJSAP; import com.martiansoftware.jsap.UnflaggedOption; import it.unimi.dsi.logging.ProgressLogger; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; public class InOutDegree { private InOutDegree() { } - private static final int NODE_ARRAY_SIZE = Node.Type.values().length + 1; - private static final int TYPE_ALL = Node.Type.values().length; - private static final int TYPE_CNT = Node.Type.toInt(Node.Type.CNT); - private static final int TYPE_DIR = Node.Type.toInt(Node.Type.DIR); - private static final int TYPE_REV = Node.Type.toInt(Node.Type.REV); - private static final int TYPE_REL = Node.Type.toInt(Node.Type.REL); - private static final int TYPE_SNP = Node.Type.toInt(Node.Type.SNP); - private static final int TYPE_ORI = Node.Type.toInt(Node.Type.ORI); + private static final int NODE_ARRAY_SIZE = SwhType.values().length + 1; + private static final int TYPE_ALL = SwhType.values().length; + private static final int TYPE_CNT = SwhType.toInt(SwhType.CNT); + private static final int TYPE_DIR = SwhType.toInt(SwhType.DIR); + private static final int TYPE_REV = SwhType.toInt(SwhType.REV); + private static final int TYPE_REL = SwhType.toInt(SwhType.REL); + private static final int TYPE_SNP = SwhType.toInt(SwhType.SNP); + private static final int TYPE_ORI = SwhType.toInt(SwhType.ORI); public static long[] outdegreeTypes(final SwhBidirectionalGraph graph, long node) { long[] out = new long[NODE_ARRAY_SIZE]; var successors = graph.successors(node); long neighbor; while ((neighbor = successors.nextLong()) != -1) { - out[Node.Type.toInt(graph.getNodeType(neighbor))]++; + out[SwhType.toInt(graph.getNodeType(neighbor))]++; out[TYPE_ALL]++; } return out; } public static long[] indegreeTypes(final SwhBidirectionalGraph graph, long node) { return outdegreeTypes(graph.transpose(), node); } public static void writeDistribution(HashMap distribution, String filename) throws IOException { PrintWriter f = new PrintWriter(new FileWriter(filename)); TreeMap sortedDistribution = new TreeMap<>(distribution); for (Map.Entry entry : sortedDistribution.entrySet()) { f.println(entry.getKey() + " " + entry.getValue()); } f.close(); } public static void run(final SwhBidirectionalGraph graph, String resultsDir) throws IOException { // Per-type var cnt_in_dir = new HashMap(); var dir_in_dir = new HashMap(); var dir_in_rev = new HashMap(); var dir_in_all = new HashMap(); var dir_out_all = new HashMap(); var dir_out_dir = new HashMap(); var dir_out_cnt = new HashMap(); var dir_out_rev = new HashMap(); var rev_in_dir = new HashMap(); var rev_in_rel = new HashMap(); var rev_in_rev = new HashMap(); var rev_in_snp = new HashMap(); var rev_in_all = new HashMap(); var rev_out_rev = new HashMap(); var rel_in_snp = new HashMap(); var snp_in_ori = new HashMap(); var snp_out_all = new HashMap(); var snp_out_rel = new HashMap(); var snp_out_rev = new HashMap(); var ori_out_snp = new HashMap(); // Aggregated per layer var full_in = new HashMap(); var full_out = new HashMap(); var dircnt_in = new HashMap(); var dircnt_out = new HashMap(); var orisnp_in = new HashMap(); var orisnp_out = new HashMap(); var relrev_in = new HashMap(); var relrev_out = new HashMap(); var rev_in = rev_in_rev; // alias for single-type layer var rev_out = rev_out_rev; final ProgressLogger pl = new ProgressLogger(); pl.itemsName = "nodes"; pl.expectedUpdates = graph.numNodes(); pl.start("Scanning..."); long[] in; long[] out; for (long i = graph.numNodes(); i-- != 0;) { long d_in = graph.indegree(i); long d_out = graph.outdegree(i); full_in.merge(d_in, 1L, Long::sum); full_out.merge(d_out, 1L, Long::sum); switch (graph.getNodeType(i)) { case CNT: cnt_in_dir.merge(d_in, 1L, Long::sum); dircnt_in.merge(d_in, 1L, Long::sum); dircnt_out.merge(0L, 1L, Long::sum); break; case DIR: in = indegreeTypes(graph, i); out = outdegreeTypes(graph, i); dir_in_all.merge(in[TYPE_ALL], 1L, Long::sum); dir_out_all.merge(out[TYPE_ALL], 1L, Long::sum); dir_in_dir.merge(in[TYPE_DIR], 1L, Long::sum); dir_in_rev.merge(in[TYPE_REV], 1L, Long::sum); dir_out_cnt.merge(out[TYPE_CNT], 1L, Long::sum); dir_out_dir.merge(out[TYPE_DIR], 1L, Long::sum); dir_out_rev.merge(out[TYPE_REV], 1L, Long::sum); dircnt_in.merge(in[TYPE_DIR] + in[TYPE_CNT], 1L, Long::sum); dircnt_out.merge(out[TYPE_DIR] + out[TYPE_CNT], 1L, Long::sum); break; case REV: in = indegreeTypes(graph, i); out = outdegreeTypes(graph, i); rev_in_all.merge(in[TYPE_ALL], 1L, Long::sum); rev_in_dir.merge(in[TYPE_DIR], 1L, Long::sum); rev_in_rev.merge(in[TYPE_REV], 1L, Long::sum); rev_in_rel.merge(in[TYPE_REL], 1L, Long::sum); rev_in_snp.merge(in[TYPE_SNP], 1L, Long::sum); rev_out_rev.merge(out[TYPE_REV], 1L, Long::sum); relrev_in.merge(in[TYPE_REL] + in[TYPE_REV], 1L, Long::sum); relrev_out.merge(out[TYPE_REL] + out[TYPE_REV], 1L, Long::sum); break; case REL: rel_in_snp.merge(d_in, 1L, Long::sum); relrev_in.merge(0L, 1L, Long::sum); relrev_out.merge(d_out, 1L, Long::sum); break; case SNP: out = outdegreeTypes(graph, i); snp_in_ori.merge(d_in, 1L, Long::sum); snp_out_all.merge(out[TYPE_ALL], 1L, Long::sum); snp_out_rel.merge(out[TYPE_REL], 1L, Long::sum); snp_out_rev.merge(out[TYPE_REV], 1L, Long::sum); orisnp_in.merge(d_in, 1L, Long::sum); orisnp_out.merge(out[TYPE_REL] + out[TYPE_REV], 1L, Long::sum); break; case ORI: ori_out_snp.merge(d_out, 1L, Long::sum); orisnp_in.merge(0L, 1L, Long::sum); orisnp_out.merge(d_out, 1L, Long::sum); break; default : pl.logger().warn("Invalid node type at pos {}", i); break; } pl.update(); } pl.done(); (new File(resultsDir)).mkdir(); writeDistribution(full_in, resultsDir + "/full_in.txt"); writeDistribution(full_out, resultsDir + "/full_out.txt"); writeDistribution(dircnt_in, resultsDir + "/dir+cnt_in.txt"); writeDistribution(dircnt_out, resultsDir + "/dir+cnt_out.txt"); writeDistribution(relrev_in, resultsDir + "/rel+rev_in.txt"); writeDistribution(relrev_out, resultsDir + "/rel+rev_out.txt"); writeDistribution(orisnp_in, resultsDir + "/ori+snp_in.txt"); writeDistribution(orisnp_out, resultsDir + "/ori+snp_out.txt"); writeDistribution(rev_in, resultsDir + "/rev_in.txt"); writeDistribution(rev_out, resultsDir + "/rev_out.txt"); String resultsTypeDir = resultsDir + "/per_type"; (new File(resultsTypeDir)).mkdir(); writeDistribution(cnt_in_dir, resultsTypeDir + "/cnt_in_dir.txt"); writeDistribution(dir_in_dir, resultsTypeDir + "/dir_in_dir.txt"); writeDistribution(dir_in_rev, resultsTypeDir + "/dir_in_rev.txt"); writeDistribution(dir_in_all, resultsTypeDir + "/dir_in_all.txt"); writeDistribution(dir_out_all, resultsTypeDir + "/dir_out_all.txt"); writeDistribution(dir_out_dir, resultsTypeDir + "/dir_out_dir.txt"); writeDistribution(dir_out_cnt, resultsTypeDir + "/dir_out_cnt.txt"); writeDistribution(dir_out_rev, resultsTypeDir + "/dir_out_rev.txt"); writeDistribution(rev_in_dir, resultsTypeDir + "/rev_in_dir.txt"); writeDistribution(rev_in_rel, resultsTypeDir + "/rev_in_rel.txt"); writeDistribution(rev_in_rev, resultsTypeDir + "/rev_in_rev.txt"); writeDistribution(rev_in_snp, resultsTypeDir + "/rev_in_snp.txt"); writeDistribution(rev_in_all, resultsTypeDir + "/rev_in_all.txt"); writeDistribution(rev_out_rev, resultsTypeDir + "/rev_out_rev.txt"); writeDistribution(rel_in_snp, resultsTypeDir + "/rel_in_snp.txt"); writeDistribution(snp_in_ori, resultsTypeDir + "/snp_in_ori.txt"); writeDistribution(snp_out_all, resultsTypeDir + "/snp_out_all.txt"); writeDistribution(snp_out_rel, resultsTypeDir + "/snp_out_rel.txt"); writeDistribution(snp_out_rev, resultsTypeDir + "/snp_out_rev.txt"); writeDistribution(ori_out_snp, resultsTypeDir + "/ori_out_snp.txt"); } static public void main(final String[] arg) throws IllegalArgumentException, SecurityException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, JSAPException, IOException, ClassNotFoundException { final SimpleJSAP jsap = new SimpleJSAP(InOutDegree.class.getName(), "Computes in and out degrees of the given SWHGraph", new Parameter[]{ new UnflaggedOption("basename", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, JSAP.NOT_GREEDY, "The basename of the graph."), new UnflaggedOption("resultsDir", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.NOT_REQUIRED, JSAP.NOT_GREEDY, "The directory of the resulting files."),}); final JSAPResult jsapResult = jsap.parse(arg); if (jsap.messagePrinted()) System.exit(1); final String basename = jsapResult.getString("basename"); final String resultsDir = jsapResult.userSpecified("resultsDir") ? jsapResult.getString("resultsDir") : basename; final ProgressLogger pl = new ProgressLogger(); SwhBidirectionalGraph graph = SwhBidirectionalGraph.loadMapped(basename); run(graph, resultsDir); } } diff --git a/java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java b/java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java index 3f55826..189b3af 100644 --- a/java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java +++ b/java/src/main/java/org/softwareheritage/graph/experiments/topology/SubdatasetSizeFunction.java @@ -1,105 +1,105 @@ /* * Copyright (c) 2020 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.experiments.topology; import com.google.common.primitives.Longs; import com.martiansoftware.jsap.*; import it.unimi.dsi.Util; import it.unimi.dsi.big.webgraph.LazyLongIterator; import it.unimi.dsi.bits.LongArrayBitVector; import it.unimi.dsi.fastutil.Arrays; import it.unimi.dsi.fastutil.BigArrays; import it.unimi.dsi.fastutil.longs.LongBigArrays; import it.unimi.dsi.io.ByteDiskQueue; import it.unimi.dsi.logging.ProgressLogger; import it.unimi.dsi.util.XoRoShiRo128PlusRandom; import org.softwareheritage.graph.SwhBidirectionalGraph; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import org.softwareheritage.graph.experiments.forks.ForkCC; import java.io.*; public class SubdatasetSizeFunction { private SubdatasetSizeFunction() { } public static void run(final SwhBidirectionalGraph graph) throws IOException { final ProgressLogger pl = new ProgressLogger(); pl.itemsName = "nodes"; pl.expectedUpdates = graph.numNodes(); long n = graph.numNodes(); LongArrayBitVector visited = LongArrayBitVector.ofLength(n); int bufferSize = (int) Math.min(Arrays.MAX_ARRAY_SIZE & ~0x7, 8L * n); final File queueFile = File.createTempFile(ForkCC.class.getSimpleName(), "queue"); final ByteDiskQueue queue = ByteDiskQueue.createNew(queueFile, bufferSize, true); final byte[] byteBuf = new byte[Long.BYTES]; long[][] randomPerm = Util.identity(graph.numNodes()); LongBigArrays.shuffle(randomPerm, new XoRoShiRo128PlusRandom()); long visitedNodes = 0; long visitedEdges = 0; long visitedOrigins = 0; long visitedContents = 0; pl.start("Running traversal starting from origins..."); for (long j = 0; j < n; ++j) { long i = BigArrays.get(randomPerm, j); - if (visited.getBoolean(i) || graph.getNodeType(i) != Node.Type.ORI) { + if (visited.getBoolean(i) || graph.getNodeType(i) != SwhType.ORI) { continue; } visitedOrigins++; queue.enqueue(Longs.toByteArray(i)); visited.set(i); while (!queue.isEmpty()) { queue.dequeue(byteBuf); final long currentNode = Longs.fromByteArray(byteBuf); visitedNodes++; - if (graph.getNodeType(currentNode) == Node.Type.CNT) + if (graph.getNodeType(currentNode) == SwhType.CNT) visitedContents++; final LazyLongIterator iterator = graph.successors(currentNode); long succ; while ((succ = iterator.nextLong()) != -1) { visitedEdges++; if (visited.getBoolean(succ)) continue; visited.set(succ); queue.enqueue(Longs.toByteArray(succ)); } pl.update(); } if (visitedOrigins % 10000 == 0) System.out.println(visitedNodes + " " + visitedEdges + " " + visitedContents); } pl.done(); } static public void main(final String[] arg) throws IllegalArgumentException, SecurityException, JSAPException, IOException { final SimpleJSAP jsap = new SimpleJSAP(SubdatasetSizeFunction.class.getName(), "Computes subdataset size functions using a random uniform order", new Parameter[]{new UnflaggedOption("basename", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, JSAP.NOT_GREEDY, "The basename of the graph."),}); final JSAPResult jsapResult = jsap.parse(arg); if (jsap.messagePrinted()) System.exit(1); final String basename = jsapResult.getString("basename"); SwhBidirectionalGraph graph = SwhBidirectionalGraph.loadMapped(basename); run(graph); } } diff --git a/java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java b/java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java index 3332607..40aafb1 100644 --- a/java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java +++ b/java/src/main/java/org/softwareheritage/graph/maps/NodeTypesMap.java @@ -1,62 +1,62 @@ /* * Copyright (c) 2019-2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.maps; import it.unimi.dsi.fastutil.io.BinIO; import it.unimi.dsi.fastutil.longs.LongBigList; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.IOException; /** * Mapping between long node id and SWH node type as described in the * data model. *

* The type mapping is pre-computed and dumped on disk in the * {@link org.softwareheritage.graph.compress.NodeMapBuilder} class, then it is loaded in-memory * here using fastutil LongBigList. To be * space-efficient, the mapping is stored as a bitmap using minimum number of bits per - * {@link Node.Type}. + * {@link SwhType}. * * @author The Software Heritage developers */ public class NodeTypesMap { /** File extension for the long node id to node type map */ public static final String NODE_TO_TYPE = ".node2type.map"; /** * Array storing for each node its type */ public LongBigList nodeTypesMap; /** * Constructor. * * @param graphPath path and basename of the compressed graph */ public NodeTypesMap(String graphPath) throws IOException { try { nodeTypesMap = (LongBigList) BinIO.loadObject(graphPath + NODE_TO_TYPE); } catch (ClassNotFoundException e) { throw new IllegalArgumentException("Unknown class object: " + e); } } /** * Returns node type from a node long id. * * @param nodeId node as a long id - * @return corresponding {@link Node.Type} value - * @see org.softwareheritage.graph.Node.Type + * @return corresponding {@link SwhType} value + * @see SwhType */ - public Node.Type getType(long nodeId) { + public SwhType getType(long nodeId) { long type = nodeTypesMap.getLong(nodeId); - return Node.Type.fromInt((int) type); + return SwhType.fromInt((int) type); } } diff --git a/java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java b/java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java index bf59b6f..7ff29b8 100644 --- a/java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java +++ b/java/src/main/java/org/softwareheritage/graph/utils/FindEarliestRevision.java @@ -1,120 +1,120 @@ /* * Copyright (c) 2021 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.utils; import it.unimi.dsi.big.webgraph.LazyLongIterator; import org.softwareheritage.graph.*; import java.io.IOException; import java.time.Duration; import java.util.HashSet; import java.util.Scanner; import java.util.Stack; /* sample invocation on granet.internal.softwareheritage.org for benchmarking * purposes, with the main swh-graph service already running: * * $ java -cp ~/swh-environment/swh-graph/java/target/swh-graph-0.3.0.jar -Xmx300G -XX:PretenureSizeThreshold=512M -XX:MaxNewSize=4G -XX:+UseLargePages -XX:+UseTransparentHugePages -XX:+UseNUMA -XX:+UseTLAB -XX:+ResizeTLAB org.softwareheritage.graph.utils.FindEarliestRevision --timing /dev/shm/swh-graph/default/graph * */ public class FindEarliestRevision { public static void main(String[] args) throws IOException, ClassNotFoundException { String graphPath = args[0]; boolean timing = false; long ts, elapsedNanos; Duration elapsed; if (args.length >= 2 && (args[0].equals("-t") || args[0].equals("--timing"))) { timing = true; graphPath = args[1]; System.err.println("started with timing option, will keep track of elapsed time"); } System.err.println("loading transposed graph..."); ts = System.nanoTime(); SwhBidirectionalGraph graph = SwhBidirectionalGraph.loadMapped(graphPath).transpose(); elapsed = Duration.ofNanos(System.nanoTime() - ts); System.err.println(String.format("transposed graph loaded (duration: %s).", elapsed)); System.err.println("loading revision timestamps..."); ts = System.nanoTime(); graph.loadCommitterTimestamps(); elapsed = Duration.ofNanos(System.nanoTime() - ts); System.err.println(String.format("revision timestamps loaded (duration: %s).", elapsed)); Scanner stdin = new Scanner(System.in); AllowedEdges edges = new AllowedEdges("cnt:dir,dir:dir,dir:rev"); String rawSWHID = null; SWHID srcSWHID = null; long lineCount = 0; long srcNodeId = -1; if (timing) { System.err.println("starting SWHID processing..."); elapsed = Duration.ZERO; } while (stdin.hasNextLine()) { if (timing) ts = System.nanoTime(); rawSWHID = stdin.nextLine().strip(); lineCount++; try { srcSWHID = new SWHID(rawSWHID); srcNodeId = graph.getNodeId(srcSWHID); } catch (IllegalArgumentException e) { System.err .println(String.format("skipping invalid or unknown SWHID %s on line %d", rawSWHID, lineCount)); continue; } if (timing) System.err.println("starting traversal for: " + srcSWHID.toString()); Stack stack = new Stack<>(); HashSet visited = new HashSet<>(); stack.push(srcNodeId); visited.add(srcNodeId); long minRevId = -1; long minTimestamp = Long.MAX_VALUE; while (!stack.isEmpty()) { long currentNodeId = stack.pop(); - if (graph.getNodeType(currentNodeId) == Node.Type.REV) { + if (graph.getNodeType(currentNodeId) == SwhType.REV) { long committerTs = graph.getCommitterTimestamp(currentNodeId); if (committerTs < minTimestamp) { minRevId = currentNodeId; minTimestamp = committerTs; } } LazyLongIterator it = graph.successors(currentNodeId); for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) { if (!edges.isAllowed(graph.getNodeType(currentNodeId), graph.getNodeType(neighborNodeId))) { continue; } if (!visited.contains(neighborNodeId)) { stack.push(neighborNodeId); visited.add(neighborNodeId); } } } if (minRevId == -1) { System.err.println("no revision found containing: " + srcSWHID.toString()); } else { System.out.println(srcSWHID.toString() + "\t" + graph.getSWHID(minRevId).toString()); } if (timing) { elapsedNanos = System.nanoTime() - ts; // processing time for current SWHID elapsed = elapsed.plus(Duration.ofNanos(elapsedNanos)); // cumulative processing time for all SWHIDs System.err.printf("visit time (s):\t%.6f\n", (double) elapsedNanos / 1_000_000_000); } } if (timing) System.err.printf("processed %d SWHIDs in %s (%s avg)\n", lineCount, elapsed, elapsed.dividedBy(lineCount)); } } diff --git a/java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java b/java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java index 022f2b6..6cd9b68 100644 --- a/java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java +++ b/java/src/test/java/org/softwareheritage/graph/AllowedEdgesTest.java @@ -1,120 +1,120 @@ /* * Copyright (c) 2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import org.junit.jupiter.api.Assertions; import org.junit.jupiter.api.Test; import java.util.ArrayList; public class AllowedEdgesTest extends GraphTest { static class EdgeType { - Node.Type src; - Node.Type dst; + SwhType src; + SwhType dst; - public EdgeType(Node.Type src, Node.Type dst) { + public EdgeType(SwhType src, SwhType dst) { this.src = src; this.dst = dst; } @Override public boolean equals(Object otherObj) { if (otherObj == this) return true; if (!(otherObj instanceof EdgeType)) return false; EdgeType other = (EdgeType) otherObj; return src == other.src && dst == other.dst; } } void assertEdgeRestriction(AllowedEdges edges, ArrayList expectedAllowed) { - Node.Type[] nodeTypes = Node.Type.values(); - for (Node.Type src : nodeTypes) { - for (Node.Type dst : nodeTypes) { + SwhType[] nodeTypes = SwhType.values(); + for (SwhType src : nodeTypes) { + for (SwhType dst : nodeTypes) { EdgeType edge = new EdgeType(src, dst); boolean isAllowed = edges.isAllowed(src, dst); boolean isExpected = false; for (EdgeType expected : expectedAllowed) { if (expected.equals(edge)) { isExpected = true; break; } } Assertions.assertEquals(isAllowed, isExpected, "Edge type: " + src + " -> " + dst); } } } @Test public void dirToDirDirToCntEdges() { AllowedEdges edges = new AllowedEdges("dir:dir,dir:cnt"); ArrayList expected = new ArrayList<>(); - expected.add(new EdgeType(Node.Type.DIR, Node.Type.DIR)); - expected.add(new EdgeType(Node.Type.DIR, Node.Type.CNT)); + expected.add(new EdgeType(SwhType.DIR, SwhType.DIR)); + expected.add(new EdgeType(SwhType.DIR, SwhType.CNT)); assertEdgeRestriction(edges, expected); } @Test public void relToRevRevToRevRevToDirEdges() { AllowedEdges edges = new AllowedEdges("rel:rev,rev:rev,rev:dir"); ArrayList expected = new ArrayList<>(); - expected.add(new EdgeType(Node.Type.REL, Node.Type.REV)); - expected.add(new EdgeType(Node.Type.REV, Node.Type.REV)); - expected.add(new EdgeType(Node.Type.REV, Node.Type.DIR)); + expected.add(new EdgeType(SwhType.REL, SwhType.REV)); + expected.add(new EdgeType(SwhType.REV, SwhType.REV)); + expected.add(new EdgeType(SwhType.REV, SwhType.DIR)); assertEdgeRestriction(edges, expected); } @Test public void revToAllDirToDirEdges() { AllowedEdges edges = new AllowedEdges("rev:*,dir:dir"); ArrayList expected = new ArrayList<>(); - for (Node.Type dst : Node.Type.values()) { - expected.add(new EdgeType(Node.Type.REV, dst)); + for (SwhType dst : SwhType.values()) { + expected.add(new EdgeType(SwhType.REV, dst)); } - expected.add(new EdgeType(Node.Type.DIR, Node.Type.DIR)); + expected.add(new EdgeType(SwhType.DIR, SwhType.DIR)); assertEdgeRestriction(edges, expected); } @Test public void allToCntEdges() { AllowedEdges edges = new AllowedEdges("*:cnt"); ArrayList expected = new ArrayList<>(); - for (Node.Type src : Node.Type.values()) { - expected.add(new EdgeType(src, Node.Type.CNT)); + for (SwhType src : SwhType.values()) { + expected.add(new EdgeType(src, SwhType.CNT)); } assertEdgeRestriction(edges, expected); } @Test public void allEdges() { AllowedEdges edges = new AllowedEdges("*:*"); ArrayList expected = new ArrayList<>(); - for (Node.Type src : Node.Type.values()) { - for (Node.Type dst : Node.Type.values()) { + for (SwhType src : SwhType.values()) { + for (SwhType dst : SwhType.values()) { expected.add(new EdgeType(src, dst)); } } assertEdgeRestriction(edges, expected); // Special null value used to quickly bypass edge check when no restriction AllowedEdges edges2 = new AllowedEdges("*"); Assertions.assertNull(edges2.restrictedTo); } @Test public void noEdges() { AllowedEdges edges = new AllowedEdges(""); AllowedEdges edges2 = new AllowedEdges(null); ArrayList expected = new ArrayList<>(); assertEdgeRestriction(edges, expected); assertEdgeRestriction(edges2, expected); } } diff --git a/java/src/test/java/org/softwareheritage/graph/AllowedNodesTest.java b/java/src/test/java/org/softwareheritage/graph/AllowedNodesTest.java index 7d66391..4da3c59 100644 --- a/java/src/test/java/org/softwareheritage/graph/AllowedNodesTest.java +++ b/java/src/test/java/org/softwareheritage/graph/AllowedNodesTest.java @@ -1,60 +1,59 @@ /* * Copyright (c) 2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph; import org.junit.jupiter.api.Assertions; import org.junit.jupiter.api.Test; import java.util.Set; public class AllowedNodesTest extends GraphTest { - void assertNodeRestriction(AllowedNodes nodes, Set expectedAllowed) { - Node.Type[] nodeTypes = Node.Type.values(); - for (Node.Type t : nodeTypes) { + void assertNodeRestriction(AllowedNodes nodes, Set expectedAllowed) { + SwhType[] nodeTypes = SwhType.values(); + for (SwhType t : nodeTypes) { boolean isAllowed = nodes.isAllowed(t); boolean isExpected = expectedAllowed.contains(t); Assertions.assertEquals(isAllowed, isExpected, "Node type: " + t); } } @Test public void dirCntNodes() { AllowedNodes edges = new AllowedNodes("dir,cnt"); - Set expected = Set.of(Node.Type.DIR, Node.Type.CNT); + Set expected = Set.of(SwhType.DIR, SwhType.CNT); assertNodeRestriction(edges, expected); } @Test public void revDirNodes() { AllowedNodes edges = new AllowedNodes("rev,dir"); - Set expected = Set.of(Node.Type.DIR, Node.Type.REV); + Set expected = Set.of(SwhType.DIR, SwhType.REV); assertNodeRestriction(edges, expected); } @Test public void relSnpCntNodes() { AllowedNodes edges = new AllowedNodes("rel,snp,cnt"); - Set expected = Set.of(Node.Type.REL, Node.Type.SNP, Node.Type.CNT); + Set expected = Set.of(SwhType.REL, SwhType.SNP, SwhType.CNT); assertNodeRestriction(edges, expected); } @Test public void allNodes() { AllowedNodes edges = new AllowedNodes("*"); - Set expected = Set.of(Node.Type.REL, Node.Type.SNP, Node.Type.CNT, Node.Type.DIR, Node.Type.REV, - Node.Type.ORI); + Set expected = Set.of(SwhType.REL, SwhType.SNP, SwhType.CNT, SwhType.DIR, SwhType.REV, SwhType.ORI); assertNodeRestriction(edges, expected); } @Test public void noNodes() { AllowedNodes edges = new AllowedNodes(""); - Set expected = Set.of(); + Set expected = Set.of(); assertNodeRestriction(edges, expected); } } diff --git a/java/src/test/java/org/softwareheritage/graph/compress/ExtractNodesTest.java b/java/src/test/java/org/softwareheritage/graph/compress/ExtractNodesTest.java index 4576aae..9338a2f 100644 --- a/java/src/test/java/org/softwareheritage/graph/compress/ExtractNodesTest.java +++ b/java/src/test/java/org/softwareheritage/graph/compress/ExtractNodesTest.java @@ -1,113 +1,113 @@ /* * Copyright (c) 2022 The Software Heritage developers * See the AUTHORS file at the top-level directory of this distribution * License: GNU General Public License version 3, or any later version * See top-level LICENSE file for more information */ package org.softwareheritage.graph.compress; import org.apache.commons.codec.digest.DigestUtils; import org.junit.jupiter.api.Assertions; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.io.TempDir; import org.softwareheritage.graph.GraphTest; -import org.softwareheritage.graph.Node; +import org.softwareheritage.graph.SwhType; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.util.List; import java.util.TreeSet; public class ExtractNodesTest extends GraphTest { /** Generate a fake SWHID for a given node type and numeric ID */ private static byte[] f(String type, int id) { String hash = new String(DigestUtils.sha1Hex(type + id).getBytes()); return String.format("swh:1:%s:%s", type, hash).getBytes(); } static class FakeDataset implements GraphDataset { @Override public void readEdges(NodeCallback nodeCb, EdgeCallback edgeCb) throws IOException { // For each node type, write nodes {1..4} as present in the graph - for (Node.Type type : Node.Type.values()) { + for (SwhType type : SwhType.values()) { for (int i = 1; i <= 4; i++) { byte[] node = f(type.toString().toLowerCase(), i); nodeCb.onNode(node); } } edgeCb.onEdge(f("ori", 1), f("snp", 1), null, -1); edgeCb.onEdge(f("ori", 2), f("snp", 2), null, -1); edgeCb.onEdge(f("ori", 3), f("snp", 3), null, -1); edgeCb.onEdge(f("ori", 4), f("snp", 404), null, -1); edgeCb.onEdge(f("snp", 1), f("rev", 1), "dup1".getBytes(), -1); edgeCb.onEdge(f("snp", 1), f("rev", 1), "dup2".getBytes(), -1); edgeCb.onEdge(f("snp", 3), f("cnt", 1), "c1".getBytes(), -1); edgeCb.onEdge(f("snp", 4), f("rel", 1), "r1".getBytes(), -1); edgeCb.onEdge(f("rel", 1), f("rel", 2), null, -1); edgeCb.onEdge(f("rel", 2), f("rev", 1), null, -1); edgeCb.onEdge(f("rel", 3), f("rev", 2), null, -1); edgeCb.onEdge(f("rel", 4), f("dir", 1), null, -1); edgeCb.onEdge(f("rev", 1), f("rev", 1), null, -1); edgeCb.onEdge(f("rev", 1), f("rev", 1), null, -1); edgeCb.onEdge(f("rev", 1), f("rev", 2), null, -1); edgeCb.onEdge(f("rev", 2), f("rev", 404), null, -1); edgeCb.onEdge(f("rev", 3), f("rev", 2), null, -1); edgeCb.onEdge(f("rev", 4), f("dir", 1), null, -1); edgeCb.onEdge(f("dir", 1), f("cnt", 1), "c1".getBytes(), 42); edgeCb.onEdge(f("dir", 1), f("dir", 1), "d1".getBytes(), 1337); edgeCb.onEdge(f("dir", 1), f("rev", 1), "r1".getBytes(), 0); } } @Test public void testExtractNodes(@TempDir Path outputDir, @TempDir Path sortTmpDir) throws IOException, InterruptedException { FakeDataset dataset = new FakeDataset(); ExtractNodes.extractNodes(dataset, outputDir.toString() + "/graph", "2M", sortTmpDir.toFile()); // Check count files Long nodeCount = Long.parseLong(Files.readString(outputDir.resolve("graph.nodes.count.txt")).strip()); Long edgeCount = Long.parseLong(Files.readString(outputDir.resolve("graph.edges.count.txt")).strip()); Long labelCount = Long.parseLong(Files.readString(outputDir.resolve("graph.labels.count.txt")).strip()); Assertions.assertEquals(26L, nodeCount); Assertions.assertEquals(21L, edgeCount); Assertions.assertEquals(5L, labelCount); // Check stat files List nodeStats = Files.readAllLines(outputDir.resolve("graph.nodes.stats.txt")); List edgeStats = Files.readAllLines(outputDir.resolve("graph.edges.stats.txt")); Assertions.assertEquals(nodeStats, List.of("cnt 4", "dir 4", "ori 4", "rel 4", "rev 5", "snp 5")); Assertions.assertEquals(edgeStats, List.of("dir:cnt 1", "dir:dir 1", "dir:rev 1", "ori:snp 4", "rel:dir 1", "rel:rel 1", "rel:rev 2", "rev:dir 1", "rev:rev 5", "snp:cnt 1", "snp:rel 1", "snp:rev 2")); // Build ordered set of expected node IDs TreeSet expectedNodes = new TreeSet<>(); - for (Node.Type type : Node.Type.values()) { + for (SwhType type : SwhType.values()) { for (int i = 1; i <= 4; i++) { byte[] node = f(type.toString().toLowerCase(), i); expectedNodes.add(new String(node)); } } expectedNodes.add(new String(f("snp", 404))); expectedNodes.add(new String(f("rev", 404))); String[] nodeLines = readZstFile(outputDir.resolve("graph.nodes.csv.zst")); Assertions.assertArrayEquals(expectedNodes.toArray(new String[0]), nodeLines); // Build ordered set of expected label IDs TreeSet expectedLabels = new TreeSet<>(); expectedLabels.add("dup1"); expectedLabels.add("dup2"); expectedLabels.add("c1"); expectedLabels.add("r1"); expectedLabels.add("d1"); String[] labelLines = readZstFile(outputDir.resolve("graph.labels.csv.zst")); Assertions.assertArrayEquals(expectedLabels.toArray(new String[0]), labelLines); } }