Page MenuHomeSoftware Heritage

dataset: add graph export based on kafka
ClosedPublic

Authored by seirl on Apr 10 2020, 3:45 PM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers

Diff Detail

Repository
rDDATASET Datasets
Branch
wip_graph_export
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 12255
Build 18587: arc lint + arc unit

Event Timeline

vlorentz added a subscriber: vlorentz.

Could you add some docstrings and tests?

swh/dataset/exporter.py
43

when is it raised?

122

remove this print?

swh/dataset/graph.py
97–121

isn't it needlessly expansive to take the list of edges as input and deduplicate it? why not output nodes directly from the journal client?

99–104

why not just env={}?

This revision now requires changes to proceed.Apr 10 2020, 4:05 PM
swh/dataset/graph.py
97–121

I will add a docstring to explain this step, but the idea is that we also need nodes that are referenced as edge targets. For instance a rev_entry in a directory that points to a revision that we don't have should be a node in the graph.

99–104

The manual of sort specifically suggests to set this locale to C to do a byte-wise sorting. Also we probably still need PATH etc.

  • Add docstrings and types
seirl marked 4 inline comments as done.
seirl requested review of this revision.Apr 21 2020, 7:40 PM
  • Rework graph export pipeline
  • Graph export: add unit tests

it's not a blocker, and can be added in a subsequent commit, but can you please also add documentation of the CLI, in the style of https://docs.softwareheritage.org/devel/swh-model/cli.html ?

swh/dataset/graph.py
99–104

It says "The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values."

if you clear the environment, then it's not needed.

This revision is now accepted and ready to land.Apr 24 2020, 10:08 AM

looks fine to me, but the few minor comments. I'll let @vlorentz gives his green stamp on this.

swh/dataset/exporter.py
64

remove this pass I guess

145–146

is this black-legit?

157

the get_offset() usage here looks a bit weird. IMHO it would be clearer to use the returned value from get_offset() each time, or make it a property.

swh/dataset/graph.py
38–39

weird, a few chunks above you are using

(lo, hi) = xxx

but no parenthesis here :-)

seirl marked 4 inline comments as done.May 4 2020, 7:05 PM
seirl added inline comments.
swh/dataset/exporter.py
145–146

No, this branch is pre-black.

157

The problem is that get_offsets is really something I just want to do once and keep consistent afterwards (because it takes time to fetch and can change over time). I'm changing this so it looks less weird in this instance, but the underlying logic will stay the same.

swh/dataset/graph.py
99–104

Right, but I still need PATH, so setting LOCALE=C is the easiest way anyway.

seirl marked 3 inline comments as done.May 4 2020, 7:07 PM
seirl added inline comments.
swh/dataset/graph.py
38–39

Just a matter of taste, I consider (lo, hi) to be a single "watermark offset" object, but src/dst are just different stuff crammed into one argument. I can change one or the other if it bugs you :-P

  • tests/graph: check the presence of duplicate nodes
  • exporter: minor style changes

Rebase and update against swh-storage/master

Don't know why this hasn't been autoclosed, but it's merged in master.