Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Paste
P399
Annex edge dataset README
Active
Public
Actions
Authored by
haltode
on May 12 2019, 1:56 PM.
Edit Paste
Archive Paste
View Raw File
Subscribe
Mute Notifications
Award Token
Flag For Later
Tags
None
Subscribers
None
# Edge dataset
The dataset in this folder only contains informations about the **edges** of the
Software Heritage Graph (and none of the associated metadata). This is useful
for studying the **topology** of the graph.
- Each `.edges.csv.gz` file contains all edges of a given `<src, dst>` type. The
format is a compressed textual file with one edge per line, as a `"SRC_ID
SPACE DST_ID"` string, where identifiers are the intrinsic SHA1 checksums of
each node (hex-encoded, as usual).
- Each `.nodes.csv.gz` file contains a sorted list of unique node identifiers
appearing in the corresponding `.edges.csv.gz` file. The format is a
compressed text file with one hex-encoded SHA1 checksum per line.
- Each `.count` text file contains the number of lines of its matching file.
If you want to have the entire graph and ignore division by edge types, it
should be enough to cat all files together and process them as if it were a
single file.
If you want to pay attention to the edge types, the files are named as follow:
- `origin_to_snapshot.edges.csv.gz`: the edges from the origin ID (integer) to
the snapshot ID (sha1).
- `snapshot_to_obj.edges.csv.gz`: the edges from the snapshot ID (sha1) to the
object it points to (sha1), that can be either a release, a snapshot, a
revision, a directory or a content.
- `release_to_obj.edges.csv.gz`: the edges from the release ID (sha1) to the
object it points to (sha1), that can be either a release, a revision, a
directory or a content.
- `rev_to_rev.edges.csv.gz`: the edges from each revision (sha1) to its parent
revisions (sha1). This is the full **development history** of the dataset.
- `rev_to_dir.edges.csv.gz`: the edges from each revision (sha1) to the
directory it points to (sha1).
- `dir_to_dir.edges.csv.gz`: the edges from each directory (sha1) to its
children directories (sha1).
- `dir_to_file.edges.csv.gz`: the edges from each directory (sha1) to its
children files (sha1).
- `dir_to_rev.edges.csv.gz`: the edges from each directory (sha1) to its
children revisions (sha1).
Event Timeline
haltode
created this paste.
May 12 2019, 1:56 PM
2019-05-12 13:56:24 (UTC+2)
haltode
created this object in space
S1 Public
.
Log In to Comment