Page MenuHomeSoftware Heritage
Paste P399

Annex edge dataset README
ActivePublic

Authored by haltode on May 12 2019, 1:56 PM.
# Edge dataset
The dataset in this folder only contains informations about the **edges** of the
Software Heritage Graph (and none of the associated metadata). This is useful
for studying the **topology** of the graph.
- Each `.edges.csv.gz` file contains all edges of a given `<src, dst>` type. The
format is a compressed textual file with one edge per line, as a `"SRC_ID
SPACE DST_ID"` string, where identifiers are the intrinsic SHA1 checksums of
each node (hex-encoded, as usual).
- Each `.nodes.csv.gz` file contains a sorted list of unique node identifiers
appearing in the corresponding `.edges.csv.gz` file. The format is a
compressed text file with one hex-encoded SHA1 checksum per line.
- Each `.count` text file contains the number of lines of its matching file.
If you want to have the entire graph and ignore division by edge types, it
should be enough to cat all files together and process them as if it were a
single file.
If you want to pay attention to the edge types, the files are named as follow:
- `origin_to_snapshot.edges.csv.gz`: the edges from the origin ID (integer) to
the snapshot ID (sha1).
- `snapshot_to_obj.edges.csv.gz`: the edges from the snapshot ID (sha1) to the
object it points to (sha1), that can be either a release, a snapshot, a
revision, a directory or a content.
- `release_to_obj.edges.csv.gz`: the edges from the release ID (sha1) to the
object it points to (sha1), that can be either a release, a revision, a
directory or a content.
- `rev_to_rev.edges.csv.gz`: the edges from each revision (sha1) to its parent
revisions (sha1). This is the full **development history** of the dataset.
- `rev_to_dir.edges.csv.gz`: the edges from each revision (sha1) to the
directory it points to (sha1).
- `dir_to_dir.edges.csv.gz`: the edges from each directory (sha1) to its
children directories (sha1).
- `dir_to_file.edges.csv.gz`: the edges from each directory (sha1) to its
children files (sha1).
- `dir_to_rev.edges.csv.gz`: the edges from each directory (sha1) to its
children revisions (sha1).

Event Timeline

haltode created this object in space S1 Public.