Page MenuHomeSoftware Heritage

bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID
ClosedPublic

Authored by vlorentz on Aug 10 2021, 12:23 PM.

Details

Summary

This is a 40 to 70% speed-up of the overall run time (wall clock).

Diff Detail

Repository
rDGRPH Compressed graph representation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6073 (id=21987)

Could not rebase; Attempt merge onto a48b5be584...

Updating a48b5be..424e75a
Fast-forward
 swh/graph/server/app.py | 14 ++++++++++++--
 swh/graph/swhid.py      | 13 +++++++++----
 2 files changed, 21 insertions(+), 6 deletions(-)
Changes applied before test
commit 424e75a9d0f888c43c75fa7e9fef8b7d46716514
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 10 12:23:00 2021 +0200

    bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID
    
    This is a 40 to 70% speed-up of the overall run time (wall clock).

commit b54ed982e2039e0bca87cbe17dd63aa667db6d40
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Aug 10 12:03:36 2021 +0200

    StreamingGraphView: Buffer lines before writing
    
    Most of the time is spent maxing out the CPU in the Python process.
    This change has two effects:
    
    1. lines are joined before being encoded (instead of encoding them one-by-one)
    2. larger network packets are sent, instead of a single packet per line
    
    I don't know which affects the performance, but overall, this is
    a consistent 25 to 35% speed-up to the overall run time of
    SimpleTraversalView.

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/132/ for more details.

douardda added inline comments.
swh/graph/swhid.py
96

if you want speed, why not also cut the hash_to_hex call and simply use .hex() ?

quick stupid test showed a x2 factor between the 2 on my laptop (just a timeit in ipython of building 1k swhid list)

In [21]: %timeit z = [str(ExtendedSWHID(object_type=ExtendedObjectType.REVISION, object_id=v)) for v in h]
6.14 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit z = [f"swh:1:{ExtendedObjectType.REVISION.value}:{hash_to_hex(v)}" for v in h]
624 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit z = [f"swh:1:{ExtendedObjectType.REVISION.value}:{v.hex()}" for v in h]
359 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
swh/graph/swhid.py
96

hash_to_hex is cached

swh/graph/swhid.py
96

yeah, well, it's currently cached with default lru_cache maxsize, which is a very small 128, so I'm not sure it's a lifesaver here. And you can just lru_cache this byte_to_str function :-)

Do we have an idea of the average cache-hit ratio we have when used in swh-graph?

but anyway, it looks fine to me

This revision is now accepted and ready to land.Aug 13 2021, 9:55 AM
swh/graph/swhid.py
96

I don't, but it's probably very low.