Differential D7839 Diff 28444 docs/java-api.rst

Changeset View

Standalone View

View Options

docs/java-api.rst

This file was added.

.. _swh-graph-java-api:

Using the Java API

==================

.. highlight:: java

While the :ref:`HTTP API <swh-graph-api>` is useful for many common use-cases,

it is often not sufficient to implement more complex algorithms. This section

describes the low-level Java API that ``swh-graph`` provides on top of the

WebGraph framework to manipulate the compressed graph of Software Heritage.

A cursory understanding of the `WebGraph framework

<https://webgraph.di.unimi.it/>`_ and its API is helpful to understand the

notions detailed here.

.. _swh-graph-java-basics:

Basics

------

In the WebGraph framework, graphs are generally subclasses of

`ImmutableGraph

<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_,

the abstract class providing the core API to manipulate and iterate on graphs.

Under the hood, compressed graphs are stored as the `BVGraph

<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_

class, which contains the actual codec used to compress and decompress

adjacency lists.

Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]`

where *n* is the total number of nodes in the graph.

JaredR26Unsubmitted

Done

Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]`

- where *n* is the total number of nodes in the graph).

+ where *n* is the total number of nodes in the graph.

Each node has an associated *adjacency list*, i.e., a list of nodes going from

JaredR26:

Each node has an associated *adjacency list*, i.e., a list of nodes going from

that source node to a destination node. This list represents the **edges** (or

**arcs**) of the graph.

**Note**: edges are always directed. Undirected graphs are internally stored as

a pair of directed edges (src → dst, dst → src), and are called "symmetric"

JaredR26Unsubmitted

Done

Does this ever apply to SWH, anywhere? It is interesting to know but might confuse people who haven't yet read that everything in SWH is unidirectional.

JaredR26: Does this ever apply to SWH, anywhere? It is interesting to know but might confuse people who…

seirlAuthorUnsubmitted

Done

For any algorithm where you don't care about the direction, yes. For instance if you want to compute connected components, for LLP, or even for a BFS if you want.

seirl: For any algorithm where you don't care about the direction, yes. For instance if you want to…

graphs.

On disk, a simple BVGraph with the basename ``graph`` would be represented as

the following set of files:

- ``graph.graph``: contains the compressed adjacency lists of each node, which

can be decompressed by the BVGraph codec.

- ``graph.properties``: contains metadata on the graph, such as the number of

nodes and arcs, as well as additional loading information needed by the

BVGraph codec.

- ``graph.offsets``: a list of offsets of where the adjacency list of each node

is stored in the main graph file.

- ``graph.obl``: optionally, an "offset big-list file" which can be used to

load graphs faster.

JaredR26Unsubmitted

Done

I think this section needs to be lower down., or maybe in an advanced details page. It's an implementation detail that is useful to know when someone is down to the nuts and bolts, but not information they need to know or understand when doing more basic or early work. And when someone gets that far down, they probably want to know about more of the files than just these.

JaredR26: I think this section needs to be lower down., or maybe in an advanced details page. It's an…

seirlAuthorUnsubmitted

Done

I disagree, it's useful to know in advance because it tells you what are the files you need to download for your particular use case.

seirl: I disagree, it's useful to know in advance because it tells you what are the files you need to…

JaredR26Unsubmitted

Not Done

That's a good point. Then I guess more to Zack's point, putting the full list in one spot would help instead of having to scroll and learn piecewise. It's the kind of info people will reference but not learn.

JaredR26: That's a good point. Then I guess more to Zack's point, putting the full list in one spot…

An ImmutableGraph can be loaded using different *load methods*, which have each

different performance implications:

- ``load()``: the entire graph is loaded in RAM and supports random access.

- ``loadMapped()``: the graph is loaded by memory-mapping it from disk (see

``mmap(1)``), at the cost of being potentially slower, especially when doing

random access on slow storage.

- ``loadOffline()``: no data is actually loaded is memory, only sequential

iteration is possible.

The following code loads a graph stored on disk under the ``compressed/graph``

basename, using the memory-mapped loading mode, and stores it as a generic

ImmutableGraph:

.. code-block:: java

ImmutableGraph graph = ImmutableGraph.loadMapped("compressed/graph");

JaredR26Unsubmitted

Done

Shouldn't we encourage people to use the SWH wrappers w/ properties support, etc, instead of the ImmutableGraph directly? And above, the ImmutableGraph info is helpful when someone gets stuck or tries to push the limits as I have, but initially people more just want to know how to utilize the SWH wrappers you've built.

JaredR26: Shouldn't we encourage people to use the SWH wrappers w/ properties support, etc, instead of…

seirlAuthorUnsubmitted

Done

Again, it depends. The Swh*Graph wrappers are great, but they require other files. We want people to be able to only fetch the minimum they need for their analysis. I'm leaving that as-is but I'm writing a note that says most of the time they'll want to use SwhUnidirectionalGraph.

seirl: Again, it depends. The Swh*Graph wrappers are great, but they require other files. We want…

Note that most of the time you will want to use the SWH-specific subclass

**SwhUnidirectionalGraph** instead, which has the same API as an ImmutableGraph

except it adds other SWH-specific methods. It is described later in the

:ref:`swh-graph-java-node-mappings` section.

Running the code

----------------

JaredR26Unsubmitted

Done

It's more work but I'd suggest a simple "hello world" java project quickstart for this. Barebones like load a graph, convert a swhid to a nodeId, step forward arbitrarily, print the new nodeid or SWHID, exit. Include imports, compilation and execution example commands.

I say this as someone frequently tackling projects of varying sizes. The more of a foundational example I have to build from, the more likely I am to start things off right (or not get frustrated and switch to a different solution/approach/source entirely). I hate trying to wrangle with project setup, dependency hell, path and environment variables setup, and every language & project is different. So if there's a simple working "hello world java" thing, I'm going to do exactly that before I start writing my own stuff.

For example I've hacked my rather complicated(at this point) project into the SWH utils directory, just because I hadn't worked with java-linux-CLI stuff like this before and wanted to get something going quickly (and it worked!). If I were to redo it, I'd wrap it into a separate project as you describe above.

JaredR26: It's more work but I'd suggest a simple "hello world" java project quickstart for this.

seirlAuthorUnsubmitted

Done

I think having an "examples" directory is a great idea. I'll do that in another diff.

seirl: I think having an "examples" directory is a great idea. I'll do that in another diff.

To run a piece of Java code written using the Java API, you need to run it with

all the dependencies in your classpath (the WebGraph libraries and the

swh-graph library). The easiest way to do it is to add the *fat jar*

shipped in the swh-graph library on PyPI, which contains all the required

dependencies.

.. code-block:: console

$ java -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java

Note that to load bigger graphs, the default heap size of the JVM is likely to

be insufficient to load entire graphs in memory. It is advised to increase this

heap size with the ``-Xmx`` flag:

.. code-block:: console

$ java -Xmx300G -cp venv/share/swh-graph/swh-graph-0.5.2.jar MyAlgo.java

For more information on performance tuning and memory considerations, you

should also read the :ref:`swh-graph-memory` page, in which we recommend

additional JVM options for loading large graphs.

Simple traversal

----------------

The ImmutableGraph class provides primitives to iterate and traverse graphs. It

contains the following methods:

- ``graph.numNodes()`` returns the number of nodes in the graph (*n*).

JaredR26Unsubmitted

Done

And, given a node ID :math:`k \in [0, n - 1]`:

- - ``graph.successors(k)`` returns an LazyLongIterator on the nodes that are

+ - ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are

*adjacent* to *k* (i.e., its *neighbors*).

JaredR26:

- ``graph.numArcs()`` returns the number of arcs in the graph.

JaredR26Unsubmitted

Done

- ``graph.successors(k)`` returns an LazyLongIterator on the nodes that are

- *adjacent* to *k* (i.e., its *neighbors*).

+ *adjacent* to *k* (i.e., subsequent steps in the directed graph).

- ``graph.outdegree(k)`` returns the number of neighbors of *k*.

JaredR26:

seirlAuthorUnsubmitted

Done

I don't understand this comment, the goal here is to introduce the "neighbor" terminology.

seirl: I don't understand this comment, the goal here is to introduce the "neighbor" terminology.

JaredR26Unsubmitted

Done

I think that trying to use neighbors as terminology is going to confuse people, and adjacent is almost as bad. Neighbors don't have a direction; Successors and predecessors ALWAYS do. The graph ALWAYS does. Even if a person is swapping between viewing it in reverse or forward to find connections to a node, they still have to do two separate operations to find the 'before neighbors' and 'after neighbors', so they still need to be aware of the distinction that everything is either a before neighbor or an after neighbor, at least at first.

Maybe subsequent steps isn't the best word, but neighbors drops the directionality entirely. For me learning the structure of this early on, I constantly had to remind myself that "adjacent" nodes always meant before or after (or as above it means "before plus after" but the distinction is still there).

JaredR26: I think that trying to use neighbors as terminology is going to confuse people, and adjacent is…

seirlAuthorUnsubmitted

Done

I think "outgoing neighbors" should fix the issue.

seirl: I think "outgoing neighbors" should fix the issue.

And, given a node ID :math:`k \in [0, n - 1]`:

- ``graph.successors(k)`` returns a LazyLongIterator on the nodes that are

*adjacent* to *k* (i.e., its outgoing *neighbors*).

- ``graph.outdegree(k)`` returns the number of outgoing neighbors of *k*.

Example: Average outdegree

~~~~~~~~~~~~~~~~~~~~~~~~~~

The following code can be used to compute the average

outdegree of a graph, which is a useful measure of its density:

.. code-block:: java

public static long averageOutdegree(ImmutableGraph graph) {

return ((long) graph.numArcs()) / graph.numNodes();

}

Example: Degree distributions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the ``outdegree()`` primitive, we can compute the outdegree distribution

of the graph by iterating on all its nodes. The distribution will be returned

as a map that associates to each degree *d* the number of nodes with outdegree

*d*.

.. code-block:: java

public static Map<Long, Long> outdegreeDistribution(ImmutableGraph graph) {

HashMap<Long, Long> distribution = new HashMap<Long, Long>();

for (long k = 0; k < graph.numNodes(); ++k) {

distribution.merge(graph.outdegree(k), 1L, Long::sum);

}

return distribution;

}

Example: Depth-First Traversal

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``successors`` primitive can be used to write a simple stack-based DFS

traversal on the graph which starts from a given node and prints all the

descendant nodes in its transitive closure:

.. code-block:: java

:emphasize-lines: 10

public static void visitNodesDFS(ImmutableGraph graph, long srcNodeId) {

Stack<Long> stack = new Stack<>();

HashSet<Long> visited = new HashSet<Long>();

stack.push(srcNodeId);

visited.add(srcNodeId);

while (!stack.isEmpty()) {

long currentNodeId = stack.pop();

System.out.println(currentNodeId);

LazyLongIterator it = graph.successors(currentNodeId);

for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {

if (!visited.contains(neighborNodeId)) {

stack.push(neighborNodeId);

visited.add(neighborNodeId);

}

Example: Breadth-First Traversal

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Swapping the stack for a queue changes the traversal order from depth-first

to breadth-first:

.. code-block:: java

JaredR26Unsubmitted

Not Done

I'll add a comment on the other page, but the performance & tuning page really needs to mention the DSI fastutils classes and give some examples. HashSet<Long> won't scale to the big dataset for most usecases, but they won't realize that until they try to run it and fail.

I'll try to note some good examples when I get to that page.

JaredR26: I'll add a comment on the other page, but the performance & tuning page really needs to mention…

:emphasize-lines: 2

public static void visitNodesBFS(ImmutableGraph graph, long srcNodeId) {

Queue<Long> queue = new ArrayDeque<>();

HashSet<Long> visited = new HashSet<Long>();

queue.add(srcNodeId);

visited.add(srcNodeId);

while (!queue.isEmpty()) {

long currentNodeId = queue.poll();

System.out.println(currentNodeId);

LazyLongIterator it = graph.successors(currentNodeId);

for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {

if (!visited.contains(neighborNodeId)) {

queue.add(neighborNodeId);

visited.add(neighborNodeId);

}

.. _swh-graph-java-node-mappings:

Node types and SWHIDs

---------------------

In the Software Heritage archive, nodes are not represented by a simple

integer, but by a :ref:`SWHID <persistent-identifiers>`, which contain both the

*type* of the node (revision, directory, blob...) and its unique identifier. We

use **node mappings** which allow us to translate between SWHIDs and the

compact node IDs used in the compressed graph.

Most notably, we use a MPH (Minimal Perfect Hash) function implemented in the

`GOVMinimalPerfectHashFunction

<http://sux.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMinimalPerfectHashFunction.html>`_

zackUnsubmitted

Done

Given this info is split throughout this file (and referenced from outside), I'm wondering if it should be put in a separate, self-contained page ("compressed graph on-disk format", or something such) and referenced where it's needed. I suspect a lot of people will want to read/reference that, even if they don't need to understand the Java API.

zack: Given this info is split throughout this file (and referenced from outside), I'm wondering if…

seirlAuthorUnsubmitted

Done

For me, it's really important that the files are documented in this file, because we want to encourage people only fetching the files they need for their own research goal. We could add another page, but the information is going to be even more spread out. I feel like the current version is a good compromise, but if someone has a better idea maybe we could do that in a subsequent diff.

seirl: For me, it's really important that the files are documented in this file, because we want to…

JaredR26Unsubmitted

Not Done

I'd suggest moving it to a lower section and hyperlinking to it. You can list the relevant ones inline if you want, without bullet points, and people can find the info about them when they need it. As I mentioned, it's the kind of information they won't "learn" when reading, but they will want to reference when they get to that part of their project (I did and am!).

JaredR26: I'd suggest moving it to a lower section and hyperlinking to it. You can list the relevant…

class of the Sux4J library, which maps N keys to N consecutive integers with no

collisions.

The following files are used to store the mappings between the nodes and their

types:

- ``graph.mph``: contains a serialized minimal perfect hash function computed

on the list of all the SWHIDs in the graph.

- ``graph.order``: contains the permutation that associates with each output of

the MPH the node ID to which it corresponds

- ``graph.node2swhid.bin``: contains a compact binary representation of all the

SWHIDs in the graph, ordered by their rank in the graph file.

JaredR26Unsubmitted

Done

This mention / introduction can be much higher. There's many reasons people would want to use SwhUnidirectionalGraph, and relatively few I can think of where they should use ImmutableGraph directly instead. It might be better to pretend ImmutableGraph doesn't exist until deeper in the page.

JaredR26: This mention / introduction can be much higher. There's many reasons people would want to use…

seirlAuthorUnsubmitted

Done

Disagree for the reasons mentioned above, but I did leave a note earlier.

seirl: Disagree for the reasons mentioned above, but I did leave a note earlier.

- ``graph.node2type.bin``: contains a `LongBigArrayBitVector

<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongBigArrayBitVector.html>`_

which stores the type of each node.

To use these mappings easily, we provide the class **SwhUnidirectionalGraph**,

an ImmutableGraph which wraps the underlying graph and adds a few

utility methods to obtain SWH-specific information on the graph.

A SwhUnidirectionalGraph can be loaded in a similar way to any ImmutableGraph,

as long as the mapping files listed above are present::

SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);

This class exposes the same graph primitives as an ImmutableGraph, but it

additionally contains the following methods:

- ``SWHID getSWHID(long nodeId)``: returns the SWHID associated with a given

node ID. This function does a lookup of the SWHID at offset *i* in the file

``graph.node2swhid.bin``.

- ``long getNodeID(SWHID swhid)``: returns the node ID associated with a given

SWHID. It works by hashing the SWHID with the function stored in

``graph.mph``, then permuting it using the permutation stored in

``graph.order``. It does additional domain-checking by calling ``getSWHID()``

on its own result to check that the input SWHID was valid.

- ``Node.Type getNodeType(long nodeID)``: returns the type of a given node, as

an enum of all the different object types in the Software Heritage data

model. It does so by looking up the value at offset *i* in the bit vector

stored in ``graph.node2type.bin``.

Example: Find the target directory of a revision

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As an example, we use the methods mentioned above to perform the

following task: "given a revision, return its target directory". To do so, we

first look up the node ID of the given revision in the compressed graph. We

iterate on the successors of that node, and return the SWHID of the first

destination node that has the "directory" type.

.. code-block:: java

:emphasize-lines: 2

public SWHID findDirectoryOfRevision(SwhUnidirectionalGraph graph, SWHID revSwhid) {

long src = graph.getNodeId(revSwhid);

assert graph.getNodeType(src) == Node.Type.REV;

LazyLongIterator it = graph.successors(currentNodeId);

for (long dst; (dst = it.nextLong()) != -1;) {

if (graph.getNodeType(dst) == Node.Type.DIR) {

return graph.getSWHID(dst);

}

throw new RuntimeError("Revision has no target directory");

}

.. _swh-graph-java-node-properties:

Node properties

---------------

The Software Heritage Graph is a *property graph*, which means it has various

properties associated with its nodes and edges (e.g., commit timestamps,

authors, messages, ...). We compress these properties and store them in files

alongside the compressed graph. This allows you to write traversal algorithms

that depend on these properties.

By default, properties are not assumed to be present are are not loaded when

the graph itself is loaded. If you want to use a property, you need to

explicitly load it first. As an example, this is how you load the "content

length" property to get the length of a given blob::

SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.load(basename);

graph.loadContentLength();

long blobSize = graph.getContentLength(graph.getNodeID(swhid));

The documentation of the SwhGraphProperties class (**TODO: link**) lists all

the different properties, their types, and the methods used to load them and to get

their value for a specific node.

A few things of note:

- A single loading call can load multiple properties at once; this is because

they are stored in the same file to be more space efficient.

JaredR26Unsubmitted

Done

I'd also mention getUrl and that origin-url's are stored as a node property, not an edge property, The difference gets confusing when someone is trying to understand directory & revision labels.

JaredR26: I'd also mention getUrl and that origin-url's are stored as a node property, not an edge…

JaredR26Unsubmitted

Done

Also something, somewhere, needs to document the whole swh:1:ori:LONGHASHSTRING issue; The website doesn't seem to support ANY ori:HASH lookups, whereas the graph ONLY supports ori:HASH lookups. A reverse mapping might be very helpful, and the ability to lookup by ori:HASH on the website would be helpful, but at least documenting the difference is a must here.

Sorry if I mentioned this twice, can't recall if I mentioned it already.

JaredR26: Also something, somewhere, needs to document the whole swh:1:ori:LONGHASHSTRING issue; The…

- Persons (authors, committers etc) are exported as a single pseudonymized

integer ID, that uniquely represents a full name + email.

- Timestamps are stored as a long integer (for the timestamp itself) and a

short integer (for the UTC offset).

.. _swh-graph-java-edge-properties:

Edge labels

-----------

While looking up graph properties on the *nodes* of the graph is relatively

straightforward, doing so for labels on the *arcs* is comparatively more

difficult. These include the names and permissions of directory entries, as

well as the branch names of snapshots.

The `ArcLabelledImmutableGraph

<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledImmutableGraph.html>`_

class in WebGraph wraps an ImmutableGraph, but augments its iterators by making them

*labelled iterators*, which essentially allow us to look up the label of the

arcs while iterating on them.

This labelled graph is stored in the following files:

- ``graph-labelled.properties``: a property file describing the graph, notably

containing the basename of the wrapped graph.

- ``graph-labelled.labels``: the compressed labels

JaredR26Unsubmitted

Done

I haven't tested it recently but loadLabelledMapped didn't work for me a month or two ago, FYI. Crashed/exception infinite loop when trying to load it.

JaredR26: I haven't tested it recently but loadLabelledMapped didn't work for me a month or two ago, FYI.

seirlAuthorUnsubmitted

Done

Fixed in https://github.com/vigna/webgraph-big/pull/5 we just need to wait for a new release with this change

seirl: Fixed in https://github.com/vigna/webgraph-big/pull/5 we just need to wait for a new release…

- ``graph-labelled.labeloffsets``: the offsets used to access the labels in

random order.

The SwhUnidirectionalGraph class contains *labelled* loading methods

(``loadLabelled()``, ``loadLabelledMapped()``, ...). When these loading methods

are used instead of the standard non-labelled ones, the graph is loaded as an

ArcLabelledImmutableGraph instead of an ImmutableGraph. The following methods

can then be used:

- ``labelledSuccessors(k)`` returns a `LabelledArcIterator

<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.LabelledArcIterator.html>`_

which is used in the same way as a LazyLongIterator except it also contains a

``label()`` method to get the label of the currently traversed arc.

- ``labelledNodeIterator()`` returns an `ArcLabelledNodeIterator

<https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/labelling/ArcLabelledNodeIterator.html>`_

of all the nodes in the graph, which replaces the LazyLongIterator of the

``successor()`` function by a LabelledArcIterator similar to above.

Label format

~~~~~~~~~~~~

The labels of each arc are returned as a ``DirEntry[]`` array. They encode

both the name of a directory entry and its permissions. For snapshot branches,

only the "name" field is useful.

Arc label names are encoded as an integer ID representing each unique

entry/branch name present in the graph. To retrieve the actual name associated

with a given label ID, one needs to load the reverse mapping similar to how you

would do for a normal property::

SwhUnidirectionalGraph graph = SwhUnidirectionalGraph.loadLabelled(basename);

JaredR26Unsubmitted

Done

The byte array representing the actual label name can then be loaded with::

- byte[] name = graph.getLabelName(label.filenameId);

+ byte[] nameBytes = graph.getLabelName(label.filenameId);

+ String name = nameBytes== null ? new String() : new String(nameBytes);

Multiedges

Helpful to just show converting it into what people will want to use.

JaredR26: Helpful to just show converting it into what people will want to use.

seirlAuthorUnsubmitted

Done

We don't want people to assume that these bytestrings have a defined encoding though. It could be arbitrary bytes.

seirl: We don't want people to assume that these bytestrings have a defined encoding though. It could…

JaredR26Unsubmitted

Done

Are there any edges that have arbitrary bytes right now? No, right, they are all strings at least?

I think in that case, that's a very unlikely exception to a rule, kinda like having a directory named "/" or " ", or a file named "."

As an exception to the general rule, it would be worth mentioning it in a bullet point or casually, but I'd still suggest showing the normal-case handling.

If the structure changes enough to make "arbitrary bytes" a very common case, then update the documentation. I found "bytes" to be confusing personally when I knew I wanted strings and knew I was (?almost always?) dealing with strings.

JaredR26: Are there any edges that have arbitrary bytes right now? No, right, they are all strings at…

seirlAuthorUnsubmitted

Done

All the edges are arbitrary bytes, actually. Git doesn't normalize to unicode, and neither do we. These should really be handled as bytes, unless you're willing to lose data by ignoring encodings you can't decode.

seirl: All the edges are arbitrary bytes, actually. Git doesn't normalize to unicode, and neither do…

graph.loadLabelNames();

The byte array representing the actual label name can then be loaded with::

byte[] name = graph.getLabelName(label.filenameId);

Multiedges

~~~~~~~~~~

JaredR26Unsubmitted

Done

to the same two vertices. Consider for instance the case of a single directory

- ``test/`` containing twice the same file blob (e.g., the empty file), under two

+ ``test/`` containing twice the same file blob (e.g., a placeholder file), under two

different names (e.g., ``a`` and ``b``). The simple graph view of this

JaredR26:

The Software Heritage is not a *simple graph*, where at most one edge can exist

between two vertices, but a *multigraph*, where multiple edges can be incident

JaredR26Unsubmitted

Done

``test/`` containing twice the same file blob (e.g., the empty file), under two

- different names (e.g., ``a`` and ``b``). The simple graph view of this

- directory will represent it as a single edge ``test`` → *empty file*, while the

+ different names (e.g., ``item.c`` and ``net.c``). The simple graph view of this

+ directory will represent it as a single edge ``test`` → *placeholder file*, while the

multigraph view will represent it as *two* edges between the same nodes.

Due to the copy-list model of compression, WebGraph only stores simple graphs,

More clear with real example files I think.

JaredR26: More clear with real example files I think.

to the same two vertices. Consider for instance the case of a single directory

``test/`` containing twice the same file blob (e.g., the empty file), under two

different names (e.g., ``ISSUES.txt`` and ``TODO.txt``, both completely empty).

The simple graph view of this directory will represent it as a single edge

JaredR26Unsubmitted

Done

multigraph view will represent it as *two* edges between the same nodes.

- Due to the copy-list model of compression, WebGraph only stores simple graphs,

- and thus stores multiedges as single edges, to which we cannot associate a

- single file name.

+ WebGraph only stores simple graphs,

+ and thus stores multiedges as single edges, which cannot both have the name ``item.c`` and ``net.c``.

To represent this possibility of having multiple file names for a single arc,

JaredR26:

``test`` → *empty file*, while the multigraph view will represent it as *two*

edges between the same nodes.

Due to the copy-list model of compression, WebGraph only stores simple graphs,

and thus stores multiedges as single edges, to which we cannot associate

a single label name (in our example, we need to associate both names

``ISSUES.txt`` and ``TODO.txt``).

To represent this possibility of having multiple file names for a single arc,

in the case of multiple relationships between two identical nodes, each arc label is

stored as an *array* of DirEntry, each record representing one relationship

between two nodes.

Example: Printing all the entries of a directory

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following code showcases how one can print all the entries (name,

permission and target SWHID) of a given directory, using the labelled methods

seen above.

.. code-block:: java

public static void printEntries(ImmutableGraph g, long dirNode) {

LabelledArcIterator s = g.labelledSuccessors(dirNode);

for (long dst; (dst = it.nextLong()) >= 0;) {

DirEntry[] labels = (DirEntry[]) s.label().get();

for (DirEntry label : labels) {

System.out.format(

"%s %s %d\n",

graph.getSWHID(dst);

new String(graph.getLabelName(label.filenameId)),

label.permission

);

}

// Usage: $PROGRAM <GRAPH_BASENAME> <DIR_SWHID>

public static void main(String[] args) {

SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadLabelledMapped(args[0]);

g.loadLabelNames();

long dirNode = g.getNodeID(new SWHID(args[1]));

printEntries(g, dirNode);

}

Transposed graph

----------------

Up until now, we have only looked at how to traverse the *forward* graph, i.e.,

the directed graph whose edges are in the same direction as the Merkle DAG of

the Software Heritage archive.

For many purposes, especially that of finding the *provenance* of software

artifacts, it is useful to query the *backward* (or *transposed*) graph

instead, which is the same as the forward graph except all the edges are

reversed.

The transposed graph has its own set of files, counterparts to the files needed

for the forward graph:

- ``graph-transposed.graph``

- ``graph-transposed.properties``

- ``graph-transposed.offsets``

- ``graph-transposed.obl``

- ``graph-transposed-labelled.labels``

- ``graph-transposed-labelled.labeloffsets``

- ``graph-transposed-labelled.properties``

However, because node IDs are the same in the forward and the backward graph,

all the files that pertain to mappings between the node IDs and various

properties (SWHIDs, property data, node permutations etc) remain the same.

Example: Earliest revision containing a given blob

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following code loads all the committer timestamps of the revisions in the

graph, then walks the *transposed* graph to return the earliest revision

containing a given object.

.. code-block:: java

public static long findEarliestRevisionContaining(SwhUnidirectionalGraph g, long src) {

long oldestRev = -1;

long oldestRevTs = Long.MAX_VALUE;

Stack<Long> stack = new Stack<>();

HashSet<Long> visited = new HashSet<Long>();

stack.push(src);

visited.add(src);

while (!stack.isEmpty()) {

long currentNodeId = stack.pop();

LazyLongIterator it = graph.successors(currentNodeId);

for (long neighborNodeId; (neighborNodeId = it.nextLong()) != -1;) {

if (!visited.contains(neighborNodeId)) {

stack.push(neighborNodeId);

visited.add(neighborNodeId);

if (g.getNodeType(neighborNodeId) == Node.Type.REV) {

Long ts = g.getCommitterTimestamp(neighborNodeId);

if (ts != null && ts < oldestRevTs) {

oldestRev = neighborNodeId;

oldestRevTs = ts;

}

return oldestRev;

}

// Usage: $PROGRAM <GRAPH_BASENAME> <BLOB_SWHID>

public static void main(String[] args) {

// Load the backward (= transposed) graph as a SwhUnidirectionalGraph

SwhUnidirectionalGraph g = SwhUnidirectionalGraph.loadMapped(args[0] + "-transposed");

g.loadCommitterTimestamps();

long node = g.getNodeID(new SWHID(args[1]));

JaredR26Unsubmitted

Done

I think somewhere it would be good to mention or list some of the utilities that are in java/src/utils. They're both potentially useful for debugging and also good references for how to do some things. In particular, DumpProperties.java, FindEarliestRevision.java, and MPHTranslate.java.

JaredR26: I think somewhere it would be good to mention or list some of the utilities that are in…

seirlAuthorUnsubmitted

Done

These should probably go in the examples dir! Out of the scope of this diff I think, but good idea regardless.

seirl: These should probably go in the examples dir! Out of the scope of this diff I think, but good…

long oldestRev = findEarliestRevisionContaining(g, node);

System.out.println(g.getSWHID(oldestRev));

}

Bidirectional Graph

-------------------

BidirectionalImmutableGraph

~~~~~~~~~~~~~~~~~~~~~~~~~~~

While ``graph-transposed`` can be loaded as a simple SwhUnidirectionalGraph and

then manipulated just like the forward graph, it is often convenient to have

*both* the forward and the backward graph in memory. Some traversal algorithms

require first going down in the forward graph to select some nodes, then going

up to find their provenance.

To achieve that, we use the `BidirectionalImmutableGraph

<https://webgraph.di.unimi.it/docs-big/it/unimi/dsi/big/webgraph/BidirectionalImmutableGraph.html>`_

class from WebGraph, which stores both a graph and its transpose.

This class provides the following methods to iterate on the **backward** graph,

shown here with their counterparts for the forward graph:

.. list-table::

:header-rows: 1

* - Forward graph operation

- Backward graph operation

* - ``outdegree(k)``

- ``indegree(k)``

* - ``successors(k)``

- ``predecessors(k)``

In addition, the class offers a few convenience methods which are generally

useful when you have both a graph and its transpose:

- ``transpose()`` returns the transpose of the BidirectionalImmutableGraph by

inverting the references to the forward and the backward graphs. Successors

become predecessors, and vice-versa.

- ``symmetrize()`` returns the symmetric (= undirected) version of the

bidirectional graph. It is implemented by a union between the forward and the

backward graph, which basically boils down to removing the directionality of

the edges (the successors of a node are also its predecessors).

SwhBidirectionalGraph

~~~~~~~~~~~~~~~~~~~~~

Like for ImmutableGraph, we extend the BidirectionalImmutableGraph with

SWH-specific methods, in the subclass ``SwhBidirectionalGraph``. Notably, it

contains the method ``labelledPredecessors()``, the equivalent of

``labelledSuccessors()`` but on the backward graph.

Because SwhUnidirectionalGraph inherits from ImmutableGraph, and

SwhBidirectionalGraph inherits from BidirectionalImmutableGraph, we put the

common behavior between the two classes in a SwhGraph interface, which can

represent either an unidirectional or a bidirectional graph.

To avoid loading the node properties two times (once for each direction), they

are stored in a separate class called SwhGraphProperties. In a

SwhBidirectionalGraph, the two SwhUnidirectionalGraph share their node

properties in memory by storing references to the same SwhGraphProperty

object.

.. code-block:: text

┌──────────────┐

│ImmutableGraph◄────────┐

└────▲─────────┘ │extends

│ │

│ ┌──────────┴────────────────┐

extends│ │BidirectionalImmutableGraph│

│ └────────────▲──────────────┘

│ │extends

┌──────────────┴───────┐ ┌──────┴──────────────┐

│SwhUnidirectionalGraph│◄────┤SwhBidirectionalGraph│

└──┬──────────────┬────┘ └────────┬───────────┬┘

│ │ contains x2 │ │

│ │ │ │

│ implements│ │implements │

│ ┌▼──────────┐ │ │

│ │SwhGraph(I)◄────────┘ │

contains│ └───────────┘ │contains

│ │

│ ┌──────────────────┐ │

└────────────►SwhGraphProperties◄──────────────┘

└──────────────────┘

Example: Find all the shared-commit forks of a given origin

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is possible to define the *forks* of an origin as being the set of origins

which share at least one revision with that origin.

The following code loads the graph in both directions using a

SwhBidirectionalGraph. Given an origin SWHID, it first walks the *forward*

graph to find all its root revisions. It then walks the *backward* graph to

find all the origins containing these root revisions, i.e., its *forks*.

.. code-block:: java

public static void findSharedCommitForks(SwhUnidirectionalGraph g, long srcOrigin) {

Stack<Long> forwardStack = new Stack<>();

HashSet<Long> forwardVisited = new HashSet<Long>();

Stack<Long> backwardStack = new Stack<>();

HashSet<Long> backwardVisited = new HashSet<Long>();

// First traversal (forward graph): find all the root revisions of the

// origin

forwardStack.push(srcOrigin);

forwardVisited.add(srcOrigin);

while (!forwardStack.isEmpty()) {

long curr = forwardStack.pop();

LazyLongIterator it = graph.successors(curr);

boolean isRootRevision = true;

for (long succ; (succ = it.nextLong()) != -1;) {

Node.Type nt = g.getNodeType(succ);

if (!forwardVisited.contains(succ)

&& nt != Node.Type.DIR && nt != Node.Type.CNT) {

forwardStack.push(succ);

forwardVisited.add(succ);

isRootRevision = false;

}

if (g.getNodeType(curr) == Node.Type.REV && isRootRevision) {

// Found a root revision, add it to the second stack

backwardStack.push(curr);

backwardVisited.add(curr);

}

// Second traversal (backward graph): find all the origins containing

// any of these root revisions and print them

while (!backwardStack.isEmpty()) {

long curr = backwardStack.pop();

LazyLongIterator it = graph.predecessors(curr);

boolean isRootRevision = true;

for (long succ; (succ = it.nextLong()) != -1;) {

Node.Type nt = g.getNodeType(succ);

if (!backwardVisited.contains(succ)) {

backwardStack.push(succ);

backwardVisited.add(succ);

if (nt == Node.Type.ORI) {

// Found an origin, print it.

System.out.println(g.getSWHID(succ));

}

// Usage: $PROGRAM <GRAPH_BASENAME> <ORI_SWHID>

public static void main(String[] args) {

// Load both forward and backward graphs as a SwhBidirectionalGraph

SwhBidirectionalGraph g = SwhBidirectionalGraph.loadMapped(args[0]);

long node = g.getNodeID(new SWHID(args[1]));

findSharedCommitForks(g, node);

}

Large-scale processing

----------------------

Multithreading

~~~~~~~~~~~~~~

ImmutableGraph is not thread-safe. When writing multithreaded algorithms,

calling ``successors()`` on the same graph from multiple threads will return

garbage.

Instead, each thread should create its own "lightweight copy" of the graph by

calling ``.copy()``. This will not actually copy the entire graph data, which

will remain shared across threads, but it will create new instances of the

iterators so that each thread can independently iterate on the graph data.

Data structures for large traversals

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When doing very large traversals, such as a BFS on the entire graph, the

usual data structures (HashSet, Stack, ArrayDeque, etc.) will be quite

inefficient. If you know you are going to traverse large parts of the graph,

it's better to use more appropriate data structures, a lot of which can be

found in the dsiutils library. In particular:

- `LongArrayBitVector

<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/bits/LongArrayBitVector.html>`_

is an efficient bit-vector implementation, which can be used to store the

nodes that have already been seen in the visit. Its memory footprint is too

big to use for small traversals, but it is very efficient to traverse the

full graph, as every node only takes a single bit.

- `ByteDiskQueue

<https://dsiutils.di.unimi.it/docs/it/unimi/dsi/io/ByteDiskQueue.html>`_ can

be used to efficiently store the queue of nodes to visit on disk, when it is

too large to fit in RAM.

Other types in dsiutils and fastutil can save significant memory:

``LongArrayList`` saves at least 8 bytes per entry over ``ArrayList<Long>``,

and ``Long2LongOpenHashMap`` saves at least 16 bytes for every entry over

``HashMap<Long, Long>``. We strongly recommend reading the documentation of the

unimi libraries and looking at the code for usage examples.

BigArrays

~~~~~~~~~

When working with the Software Heritage graph, is often necessary to store

large arrays of values, with a size exceeding 2^32 items. Unfortunately,

standard Java arrays do not support this.

To circumvent this, WebGraph uses the `BigArrays scheme

<https://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/BigArrays.html>`_ from

the fastutil library: "big arrays" are stored as arrays of arrays, supporting

quadrillions of records.

A BigArray ``long[][] a`` can be used with the following methods:

- ``BigArrays.get(a, i)`` to get the value at index *i*

- ``BigArrays.set(a, i, v)`` to set the value at index *i* to *v*.

- ``BigArrays.length(a)`` to get the total length of the bigarray.