Page MenuHomeSoftware Heritage

support graph export for the cassandra backend
Closed, MigratedEdits Locked

Description

We want support for unsupervised export from the archive content (as captured by swh-storage) to its graph structure (as required as input by swh-graph). Currently this is not easily doable with the postgres backend (due to the huge join imposed by the directory entry layer), but it should be doable with the Cassandra backend.

The required export output is a pair of compressed textual files:

  • swh.nodes.csv.zst: one Merkle DAG node per line, represented as a SWH PID + one origin per line (as SWH PIDs too, using the "ori" qualifier); this file should be alphabetically sorted
  • swh.edges.csv.zst: a pair of Merkle DAG nodes (or origins) per line, represented as SWH PIDs and separated by a space. First element of the pair is the edge "from" node, second is the edge "to" node.
  • bonus point: also export swh.{nodes,edges}.count files, containing the total count of nodes/edges respectively

For Merkle DAG nodes the edges match the Merkle structure; for origin nodes outgoing edges point to the known snapshots of a given origin.

Examples

  • the most recent nodes/edges export can be found here: https://annex.softwareheritage.org/public/dataset/graph/latest/edges/ (files all.{nodes,edges}.{csv.gz,count}
  • SQL queries to export from Postgres to the above format can be found in snippets (warning: they do not work on the full Postgres DB, so don't try that; also: they are incomplete and do not export some edges, as noted down in comments in the SQL)

Related Objects

Event Timeline

zack triaged this task as Normal priority.Oct 31 2019, 2:09 PM
zack created this task.

I wrote a prototype for exporting revisions: https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/cassandra_stream_graph.py

It performs at 23k revisions per second. It's not parallel at all, but shouldn't be too hard to parallelize.
I just don't want to parallelize it now because I'm using it to troubleshoot https://issues.apache.org/jira/browse/CASSANDRA-15358

Looks good, thanks !

It's already a pretty good start --- it would take ~12 hours to export all revisions with no (internal) parallelism.

As a minor update I've updated the spec: to change compression format from gz to zstd and adding the requirement that the nodes file should be sorted.
That stuff can (and probably should) definitely be outsourced to external tools, e.g., by writing to nodes/edges FIFOs.

Throughput improved to 34k/s just by not querying unneeded fields.

Added parallelism. 450k/s with 16 workers and no compression. I won't try with 32 workers because Python processes would use too much CPU on my machine.

Parallelism is done by spawning multiple Python processes from bash: https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/cassandra_stream_graph.sh

Added parallelism. 450k/s with 16 workers and no compression. I won't try with 32 workers because Python processes would use too much CPU on my machine.

<3

Do you think the export throughput will be roughly the same across different object types or not?

(because if yes, we can already estimate the total export time here; if not, we need per-type benchmarks)

Probably not. I'm working on adding support for other objects.

Forgot to mention it here, but it's done now.

I just ran it on Azure. It has a different schema (the "revision" table with split into "revision" and "revision_parent") so the benchmarks are not exactly comparable.
I still use 16 workers, all running on the same machine, and with no compression

Exporting the list of revision nodes + links to directories: ~300k/s
Exporting the list of parents of each revision: ~650k/s (there are about 1 parent per revision in average)