# Edge dataset The dataset in this folder only contains informations about the **edges** of the Software Heritage Graph (and none of the associated metadata). This is useful for studying the **topology** of the graph. - Each `.edges.csv.gz` file contains all edges of a given `` type. The format is a compressed textual file with one edge per line, as a `"SRC_ID SPACE DST_ID"` string, where identifiers are the intrinsic SHA1 checksums of each node (hex-encoded, as usual). - Each `.nodes.csv.gz` file contains a sorted list of unique node identifiers appearing in the corresponding `.edges.csv.gz` file. The format is a compressed text file with one hex-encoded SHA1 checksum per line. - Each `.count` text file contains the number of lines of its matching file. If you want to have the entire graph and ignore division by edge types, it should be enough to cat all files together and process them as if it were a single file. If you want to pay attention to the edge types, the files are named as follow: - `origin_to_snapshot.edges.csv.gz`: the edges from the origin ID (integer) to the snapshot ID (sha1). - `snapshot_to_obj.edges.csv.gz`: the edges from the snapshot ID (sha1) to the object it points to (sha1), that can be either a release, a snapshot, a revision, a directory or a content. - `release_to_obj.edges.csv.gz`: the edges from the release ID (sha1) to the object it points to (sha1), that can be either a release, a revision, a directory or a content. - `rev_to_rev.edges.csv.gz`: the edges from each revision (sha1) to its parent revisions (sha1). This is the full **development history** of the dataset. - `rev_to_dir.edges.csv.gz`: the edges from each revision (sha1) to the directory it points to (sha1). - `dir_to_dir.edges.csv.gz`: the edges from each directory (sha1) to its children directories (sha1). - `dir_to_file.edges.csv.gz`: the edges from each directory (sha1) to its children files (sha1). - `dir_to_rev.edges.csv.gz`: the edges from each directory (sha1) to its children revisions (sha1).