Page MenuHomeSoftware Heritage

Write a script to generate qualified SWHID from swh-graph
Closed, MigratedEdits Locked

Description

Idea from @rdicosmo: getting the shortest path from a content to an origin is a good heuristic to get the first revision that contains that content.

It could be nice to be able to extract a fully qualified SWHID from that path.

Event Timeline

vlorentz renamed this task from Writa a script t generate qualified SWHID from swh-graph to Writa a script to generate qualified SWHID from swh-graph.Sep 22 2022, 2:53 PM
vlorentz triaged this task as Normal priority.
vlorentz created this task.
zack renamed this task from Writa a script to generate qualified SWHID from swh-graph to Write a script to generate qualified SWHID from swh-graph.Sep 22 2022, 4:06 PM
zack added a project: Compressed graph service.
This comment was removed by zack.

Here is a dump of my design notes on this task.

Getting the shortest path from a content to an origin

For a content and a given origin, finding the shortest provenance path can be done in a single GRPC call.
Supposing the graph service is available on localhost:50091, and that a previous call has returned swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86 as the swh-graph id of an origin containing
the content swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0, here is the relevant part of what we get from the GRPC call:

grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween     "src: 'swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86', dst: 'swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0', mask: {paths: ['swhid','ori.url']}" | egrep 'swhid|url'
connecting to localhost:50091
swhid: "swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86"
  url: "https://github.com/rdicosmo/parmap"
swhid: "swh:1:snp:1527a93b039d70f6a781b05d76b77c6209912887"
swhid: "swh:1:rev:82df563aecf86b9164eee7d10d40f2d8cbd1c78d"
swhid: "swh:1:dir:484db39bb2825886191837bb0960b7450f9099bb"
swhid: "swh:1:dir:4d15e44b378fe39dd23817abee756cd47ad14575"
swhid: "swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0"
Rpc succeeded with OK status
Build the fully qualified SWHID corresponding to a shortest provenance path
  • core id = swhid of the cnt node
  • qualifiers
    • visit = the only snapshot in the path
    • anchor = the revision in the path (usually one, otherwise pick only the last one)
    • origin = the ori.url value
    • path = currently not provided by swh-graph, can be rebuilt client-side using the SWH API

For the example above we can build super easily:
swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0;visit=swh:1:snp:1527a93b039d70f6a781b05d76b77c6209912887;anchor=swh:1:rev:82df563aecf86b9164eee7d10d40f2d8cbd1c78d;origin=https://github.com/rdicosmo/parmap
what is missing is the path element

Hmm, strangely, file/dir names are missing from the response even when omitting the mask when querying the graph server on granet; but based on the .proto file, they should be available via the successor field of Node.

Hmm, strangely, file/dir names are missing from the response even when omitting the mask when querying the graph server on granet; but based on the .proto file, they should be available via the successor field of Node.

AFAIK, from a chat with @seirl, this is not (yet) implemented, but it's really not a blocker for a script: we can rebuild the path by navigating the directory SWHIDs via the usual API on the client side.

rdicosmo claimed this task.

We should add some tests so the code doesn't break when we change other stuff. I'll try to do it this week