Idea from @rdicosmo: getting the shortest path from a content to an origin is a good heuristic to get the first revision that contains that content.
It could be nice to be able to extract a fully qualified SWHID from that path.
Idea from @rdicosmo: getting the shortest path from a content to an origin is a good heuristic to get the first revision that contains that content.
It could be nice to be able to extract a fully qualified SWHID from that path.
Here is a dump of my design notes on this task.
For a content and a given origin, finding the shortest provenance path can be done in a single GRPC call.
Supposing the graph service is available on localhost:50091, and that a previous call has returned swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86 as the swh-graph id of an origin containing
the content swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0, here is the relevant part of what we get from the GRPC call:
grpc_cli call localhost:50091 swh.graph.TraversalService.FindPathBetween "src: 'swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86', dst: 'swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0', mask: {paths: ['swhid','ori.url']}" | egrep 'swhid|url' connecting to localhost:50091 swhid: "swh:1:ori:8903a90cff8f07159be7aed69f19d66d33db3f86" url: "https://github.com/rdicosmo/parmap" swhid: "swh:1:snp:1527a93b039d70f6a781b05d76b77c6209912887" swhid: "swh:1:rev:82df563aecf86b9164eee7d10d40f2d8cbd1c78d" swhid: "swh:1:dir:484db39bb2825886191837bb0960b7450f9099bb" swhid: "swh:1:dir:4d15e44b378fe39dd23817abee756cd47ad14575" swhid: "swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0" Rpc succeeded with OK status
For the example above we can build super easily:
swh:1:cnt:8722d84d658e5e11519b807abb5c05bfbfc531f0;visit=swh:1:snp:1527a93b039d70f6a781b05d76b77c6209912887;anchor=swh:1:rev:82df563aecf86b9164eee7d10d40f2d8cbd1c78d;origin=https://github.com/rdicosmo/parmap
what is missing is the path element
Hmm, strangely, file/dir names are missing from the response even when omitting the mask when querying the graph server on granet; but based on the .proto file, they should be available via the successor field of Node.
AFAIK, from a chat with @seirl, this is not (yet) implemented, but it's really not a blocker for a script: we can rebuild the path by navigating the directory SWHIDs via the usual API on the client side.
This is now fully implemented in https://forge.softwareheritage.org/source/swh-graph/browse/master/tools/swh-graph-lookup/swh-graph-lookup.py
We should add some tests so the code doesn't break when we change other stuff. I'll try to do it this week