Page MenuHomeSoftware Heritage

graph: use an sqlite3 on-disk set to avoid processing nodes twice
ClosedPublic

Authored by seirl on Mon, May 4, 9:48 PM.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

seirl created this revision.Mon, May 4, 9:48 PM
vlorentz accepted this revision.EditedMon, May 4, 10:03 PM
vlorentz added a subscriber: vlorentz.

Some things you could try to improve perfs after you land this diff:

  • WITHOUT ROWID https://sqlite.org/withoutrowid.html
  • using a cursor, adding IF NOT EXISTS ... to the query and checking cursor.total_changes
  • alternatively, just use IF NOT EXISTS ... without checking the changes, remove the creation of nodes.csv from this process, and create it from an other process from the sqlite DB
swh/dataset/utils.py
47

a short docstring plz

59

here too, for the return type

This revision is now accepted and ready to land.Mon, May 4, 10:03 PM
olasd requested changes to this revision.Mon, May 4, 11:18 PM
olasd added a subscriber: olasd.
olasd added inline comments.
swh/dataset/graph.py
49–53

I think you need origin and the visit id here, or you'll only get one visit per origin

This revision now requires changes to proceed.Mon, May 4, 11:18 PM
olasd added inline comments.Mon, May 4, 11:20 PM
swh/dataset/graph.py
49–53

And you probably need to filter visits out to only keep the ones whose states are "final"

seirl added inline comments.Tue, May 5, 5:04 PM
swh/dataset/graph.py
49–53

Good catch for the visit ID, thanks!

seirl updated this revision to Diff 11109.Tue, May 5, 6:43 PM
  • rebase
  • graph: do not deduplicate different visits from the same origin
olasd accepted this revision.Tue, May 5, 6:47 PM
This revision is now accepted and ready to land.Tue, May 5, 6:47 PM