We want support for unsupervised export from the archive content (as captured by swh-storage) to its graph structure (as required as input by swh-graph). Currently this is not easily doable with the postgres backend (due to the huge join imposed by the directory entry layer), but it should be doable with the Cassandra backend.
The required export output is a pair of compressed textual files:
- `swh.nodes.csv.gz`: one Merkle DAG node per line, represented as a SWH PID + one origin per line (as SWH PIDs too, using the "ori" qualifier)
- `swh.edges.csv.gz`: a pair of Merkle DAG nodes (or origins) per line, represented as SWH PIDs and separated by a space. First element of the pair is the edge "from" node, second is the edge "to" node.
- bonus point: also export `swh.{nodes,edges}.count` files, containing the total count of nodes/edges respectively
For Merkle DAG nodes the edges match the Merkle structure; for origin nodes outgoing edges point to the known snapshots of a given origin.
Examples
- the most recent nodes/edges export can be found here: https://annex.softwareheritage.org/public/dataset/graph/latest/edges/ (files `all.{nodes,edges}.{csv.gz,count}`
- SQL queries to export from Postgres to the above format can be found [[ https://forge.softwareheritage.org/source/snippets/browse/master/sql/swh-graph/export/ | in snippets ]] (warning: they do not work on the full Postgres DB, so don't try that; also: they are incomplete and do not export some edges, as noted down in comments in the SQL)