Page MenuHomeSoftware Heritage

graph: use an sqlite3 on-disk set to avoid processing nodes twice
ClosedPublic

Authored by seirl on May 4 2020, 9:48 PM.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz added a subscriber: vlorentz.

Some things you could try to improve perfs after you land this diff:

  • WITHOUT ROWID https://sqlite.org/withoutrowid.html
  • using a cursor, adding IF NOT EXISTS ... to the query and checking cursor.total_changes
  • alternatively, just use IF NOT EXISTS ... without checking the changes, remove the creation of nodes.csv from this process, and create it from an other process from the sqlite DB
swh/dataset/utils.py
46

a short docstring plz

58

here too, for the return type

This revision is now accepted and ready to land.May 4 2020, 10:03 PM
olasd requested changes to this revision.May 4 2020, 11:18 PM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/dataset/graph.py
49–53

I think you need origin and the visit id here, or you'll only get one visit per origin

This revision now requires changes to proceed.May 4 2020, 11:18 PM
swh/dataset/graph.py
49–53

And you probably need to filter visits out to only keep the ones whose states are "final"

swh/dataset/graph.py
49–53

Good catch for the visit ID, thanks!

  • rebase
  • graph: do not deduplicate different visits from the same origin
This revision is now accepted and ready to land.May 5 2020, 6:47 PM