Page MenuHomeSoftware Heritage

exporters/edges: Make swhid() format directly instead of instantiating ExtendedSWHID
ClosedPublic

Authored by vlorentz on Aug 6 2021, 12:40 PM.

Details

Summary

Before this commit, between 30 and 40% of the run time was spent in this
function (especially ExtendedSWHID.__init__).

Now, it is under 10%.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.

Great. One suggestion inline.

This diff also adds type but it's not mentioned in the description ;)

Also what tool did you use to measure?

swh/dataset/exporters/edges.py
22–25

maybe it's worth a comment explaining the tradeoff here.
So someone does not refactor this into the previous implementation by mistake.

This revision is now accepted and ready to land.Aug 6 2021, 2:29 PM

Also what tool did you use to measure?

$ pip3 install pyprof2calltree
$ sudo apt install kcachegrind
$ vim swh-dataset/swh/dataset/journalprocessor.py  # to make it single-process and single-thread
$ python3 -m cProfile -o ~/dataset_export.pyprof $(which swh) dataset -C graph.yml graph export /tmp/g --processes=8 --formats=edges
[...]
^C
$ pyprof2calltree -i ~/dataset_export.pyprof -k

explain rationale in a comment