Paths

Table of Contentst

Differential D3121

graph: use an sqlite3 on-disk set to avoid processing nodes twice
ClosedPublic
Actions

Authored by seirl on May 4 2020, 9:48 PM.

Tags

None

Subscribers

Details

Reviewers

vlorentz
olasd

Group Reviewers

Commits

rDDATASET56fee87710ce: graph: do not deduplicate different visits from the same origin
rDDATASET7a34fb115d38: graph: use an sqlite3 on-disk set to avoid processing nodes twice

Diff Detail

Repository

rDDATASET Datasets

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

seirl created this revision.May 4 2020, 9:48 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptMay 4 2020, 9:48 PM

Harbormaster completed remote builds in B12259: Diff 11089.May 4 2020, 9:48 PM

Some things you could try to improve perfs after you land this diff:

WITHOUT ROWID https://sqlite.org/withoutrowid.html
using a cursor, adding IF NOT EXISTS ... to the query and checking cursor.total_changes
alternatively, just use IF NOT EXISTS ... without checking the changes, remove the creation of nodes.csv from this process, and create it from an other process from the sqlite DB

swh/dataset/utils.py
46	a short docstring plz
58	here too, for the return type

This revision is now accepted and ready to land.May 4 2020, 10:03 PM

olasd requested changes to this revision.May 4 2020, 11:18 PM

olasd added a subscriber: olasd.

olasd added inline comments.

swh/dataset/graph.py
49–53	I think you need origin and the visit id here, or you'll only get one visit per origin

This revision now requires changes to proceed.May 4 2020, 11:18 PM

olasd added inline comments.May 4 2020, 11:20 PM

swh/dataset/graph.py
49–53	And you probably need to filter visits out to only keep the ones whose states are "final"

seirl added inline comments.May 5 2020, 5:04 PM

swh/dataset/graph.py
49–53	Good catch for the visit ID, thanks!

rebase
graph: do not deduplicate different visits from the same origin

Harbormaster completed remote builds in B12282: Diff 11109.May 5 2020, 6:43 PM

olasd accepted this revision.May 5 2020, 6:47 PM

This revision is now accepted and ready to land.May 5 2020, 6:47 PM

Closed by commit rDDATASET7a34fb115d38: graph: use an sqlite3 on-disk set to avoid processing nodes twice (authored by seirl). · Explain WhyMay 5 2020, 6:50 PM

This revision was automatically updated to reflect the committed changes.

seirl added a commit: rDDATASET7a34fb115d38: graph: use an sqlite3 on-disk set to avoid processing nodes twice.

seirl added a commit: rDDATASET56fee87710ce: graph: do not deduplicate different visits from the same origin.

Revision Contents
Changeset List

Path

Size

swh/

dataset/

32 lines

test/

22 lines

15 lines

44 lines

Diff 11110

swh/dataset/graph.py

Loading...

swh/dataset/test/test_graph.py

Loading...

swh/dataset/test/test_utils.py

Loading...

swh/dataset/utils.py

Loading...