Page MenuHomeSoftware Heritage

swh-graph: add random walk endpoint
ClosedPublic

Authored by zack on Nov 11 2019, 7:05 PM.

Details

Summary

probably still a lot to be improved in terms of code cleanness, but the main
functionality seems to work and is both pretty robust and fast

Closes T2077

Diff Detail

Repository
rDGRPH Compressed graph representation
Branch
feature/random-walk
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 8934
Build 13048: tox-on-jenkinsJenkins
Build 13047: arc lint + arc unit

Event Timeline

seirl requested changes to this revision.Nov 12 2019, 1:12 AM
seirl added inline comments.
docs/api.rst
199

Not the correct format but this should be fixed in all the methods, so I think it's fine to have the wrong one until the rest of the doc is fixed.

java/src/main/java/org/softwareheritage/graph/algo/Traversal.java
309

I'm not quite sure this is the semantic we want here, as we do access the edges even when we don't pick them. (Feel free to disagree with my idea of what nbEdgesAccessed should represent.)

349

Asserts are disabled in release mode, maybe we want to throw an actual error here?

swh/graph/server/app.py
92

Maybe simpler:

if random:
    it = backend.random_walk(...)
else:
    it = backend.walk(...)

for res_node in it:
    ...
This revision now requires changes to proceed.Nov 12 2019, 1:12 AM
zack marked 4 inline comments as done.Nov 12 2019, 8:47 AM
zack added inline comments.
java/src/main/java/org/softwareheritage/graph/algo/Traversal.java
309

ah, good point, I'm updating to also count the skipped edges

(not that it matters much, as we are no longer returning the stats and we should probably remove that accounting anyway; but while it's in it should indeed be correct)

349

ah, that's why I didn't hit this while debugging :-)
fixed

zack marked 2 inline comments as done.
  • skeleton for random walk endpoint
  • REST client: add binding for /randomwalk
  • test_api_client: add test case for /randomwalk
  • REST doc: update /randomwalk doc to match what is returned
  • access edge stats: also count skipped edges for /randomwalk
  • /randomwalk: use proper runtime errors instead of assertions
  • app.py: add TODO about making RANDOM_RETRIES configurable
  • app.py: refactor random v. non-random walk iteration logic
  • randomwalk: do not bother randomizing singleton sets

swh-graph: add random walk endpoint

  • random walk: include starting node into returned path
  • random walk: use reservoir sampling to pick random successor
This revision is now accepted and ready to land.Nov 13 2019, 5:44 PM
  • add random walk endpoint
  • random walk: include starting node into returned path
  • random walk: use reservoir sampling to pick random successor

merged in 97e79c289aff6625ebdd9068a006c10aca28fc12