Page MenuHomeSoftware Heritage

graph dataset: update to use persistent identifiers everywhere
Closed, ResolvedPublic


The graph dataset uses SHA1s as identifiers and file names to identify the type of node.
That is inconsistent and leads to ambiguities, e.g., in the edge lists that can point to multiple types of nodes (e.g., snapshot_to_obj and release_to_obj).

We should redo the exports (or hot patch the existing ones) to use SWH PIDs as identifiers.

Event Timeline

zack triaged this task as Normal priority.May 23 2019, 2:32 PM
zack created this task.

We no longer export edges per file type.