# Edge dataset

The dataset in this folder only contains informations about the **edges** of the
Software Heritage Graph (and none of the associated metadata). This is useful
for studying the **topology** of the graph.

- Each `.edges.csv.gz` file contains all edges of a given `<src, dst>` type. The
  format is a compressed textual file with one edge per line, as a `"SRC_ID
  SPACE DST_ID"` string, where identifiers are the intrinsic SHA1 checksums of
  each node (hex-encoded, as usual).
- Each `.nodes.csv.gz` file contains a sorted list of unique node identifiers
  appearing in the corresponding `.edges.csv.gz` file. The format is a
  compressed text file with one hex-encoded SHA1 checksum per line.
- Each `.count` text file contains the number of lines of its matching file.

If you want to have the entire graph and ignore division by edge types, it
should be enough to cat all files together and process them as if it were a
single file.

If you want to pay attention to the edge types, the files are named as follow:

- `origin_to_snapshot.edges.csv.gz`: the edges from the origin ID (integer) to
  the snapshot ID (sha1).
- `snapshot_to_obj.edges.csv.gz`: the edges from the snapshot ID (sha1) to the
  object it points to (sha1), that can be either a release, a snapshot, a
  revision, a directory or a content.
- `release_to_obj.edges.csv.gz`: the edges from the release ID (sha1) to the
  object it points to (sha1), that can be either a release, a revision, a
  directory or a content.
- `rev_to_rev.edges.csv.gz`: the edges from each revision (sha1) to its parent
  revisions (sha1). This is the full **development history** of the dataset.
- `rev_to_dir.edges.csv.gz`: the edges from each revision (sha1) to the
  directory it points to (sha1).
- `dir_to_dir.edges.csv.gz`: the edges from each directory (sha1) to its
  children directories (sha1).
- `dir_to_file.edges.csv.gz`: the edges from each directory (sha1) to its
  children files (sha1).
- `dir_to_rev.edges.csv.gz`: the edges from each directory (sha1) to its
  children revisions (sha1).