bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID
ClosedPublic
Actions

Authored by vlorentz on Aug 10 2021, 12:23 PM.

Details

Reviewers

seirl
douardda

Group Reviewers

Reviewers

Commits

rDGRPH424e75a9d0f8: bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID

Summary

This is a 40 to 70% speed-up of the overall run time (wall clock).

Diff Detail

Repository

rDGRPH Compressed graph representation

Branch

optim

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 22951
Build 35785: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 35784: arc lint + arc unit

Event Timeline

vlorentz created this revision.Aug 10 2021, 12:23 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptAug 10 2021, 12:23 PM

Build is green

Patch application report for D6073 (id=21987)

Could not rebase; Attempt merge onto a48b5be584...

Updating a48b5be..424e75a
Fast-forward
 swh/graph/server/app.py | 14 ++++++++++++--
 swh/graph/swhid.py      | 13 +++++++++----
 2 files changed, 21 insertions(+), 6 deletions(-)

Changes applied before test

commit 424e75a9d0f888c43c75fa7e9fef8b7d46716514
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 10 12:23:00 2021 +0200

    bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID
    
    This is a 40 to 70% speed-up of the overall run time (wall clock).

commit b54ed982e2039e0bca87cbe17dd63aa667db6d40
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 10 12:03:36 2021 +0200

    StreamingGraphView: Buffer lines before writing
    
    Most of the time is spent maxing out the CPU in the Python process.
    This change has two effects:
    
    1. lines are joined before being encoded (instead of encoding them one-by-one)
    2. larger network packets are sent, instead of a single packet per line
    
    I don't know which affects the performance, but overall, this is
    a consistent 25 to 35% speed-up to the overall run time of
    SimpleTraversalView.

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/132/ for more details.

Harbormaster completed remote builds in B22951: Diff 21987.Aug 10 2021, 12:25 PM

vlorentz requested review of this revision.Aug 10 2021, 12:25 PM

zack added a reviewer: seirl.Aug 10 2021, 12:26 PM

douardda added a subscriber: douardda.Aug 12 2021, 1:57 PM

douardda added inline comments.

swh/graph/swhid.py

if you want speed, why not also cut the hash_to_hex call and simply use .hex() ?

quick stupid test showed a x2 factor between the 2 on my laptop (just a timeit in ipython of building 1k swhid list)

In [21]: %timeit z = [str(ExtendedSWHID(object_type=ExtendedObjectType.REVISION, object_id=v)) for v in h]
6.14 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit z = [f"swh:1:{ExtendedObjectType.REVISION.value}:{hash_to_hex(v)}" for v in h]
624 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit z = [f"swh:1:{ExtendedObjectType.REVISION.value}:{v.hex()}" for v in h]
359 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vlorentz added inline comments.Aug 12 2021, 5:15 PM

swh/graph/swhid.py
97	`hash_to_hex` is cached

douardda added inline comments.Aug 13 2021, 9:54 AM

swh/graph/swhid.py
97	yeah, well, it's currently cached with default lru_cache maxsize, which is a very small 128, so I'm not sure it's a lifesaver here. And you can just lru_cache this `byte_to_str` function :-) Do we have an idea of the average cache-hit ratio we have when used in swh-graph?