Paths

Table of Contentst

Differential D6319

model: optimize dictify()
AbandonedPublic
Actions

Authored by vlorentz on Sep 22 2021, 11:39 AM.

Details

Reviewers

None

Group Reviewers

Reviewers

Summary

reorder conditionals to minimize the number of tests needed
use hasattr() instead of the expensive isinstance()

Motivation: Ironically, this is the bottleneck of my checksumming
script.

Benchmark:

In [1]: from swh.storage import get_storage

In [2]: from swh.model.identifiers import CoreSWHID, _BaseSWHID

In [3]: s = get_storage('remote', url='http://moma.internal.softwareheritage.org:5002/')

In [4]: rev = s.revision_get([bytes.fromhex("747675816d815e86b7482b5a0acb9110eeeec590")])[0]

Before this commit:

In [18]: %timeit rev.to_dict()
10000 loops, best of 5: 70.4 µs per loop

In [19]: %timeit rev.to_dict()
10000 loops, best of 5: 69.3 µs per loop

After this commit:

In [5]: %timeit rev.to_dict()
10000 loops, best of 5: 48.4 µs per loop

In [6]: %timeit rev.to_dict()
10000 loops, best of 5: 45.7 µs per loop

In [7]: %timeit rev.to_dict()
10000 loops, best of 5: 47.5 µs per loop

Unfortunately there isn't much more we can do, 90% of the time is spent
constructing a dict (even when replacing the dictcomp
{k: dictify(v) for k, v in value.items()} with map() + dict()).

Diff Detail

Repository

rDMOD Data model

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 23886
Build 37249: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 37248: arc lint + arc unit

Event Timeline

vlorentz created this revision.Sep 22 2021, 11:39 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 22 2021, 11:39 AM

Build is green

Patch application report for D6319 (id=22955)

Rebasing onto b6f5e30b53...

Current branch diff-target is up to date.

Changes applied before test

commit 25e0c71dafa7e0e4540adfb5ea492930264f14ba
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 11:33:42 2021 +0200

    model: optimize dictify()
    
    1. reorder conditionals to minimize the number of tests needed
    2. use hasattr() instead of the expensive isinstance()
    
    Motivation: Ironically, this is the bottleneck of my checksumming
    script.
    
    Benchmark:

In [1]: from swh.storage import get_storage

In [2]: from swh.model.identifiers import CoreSWHID, _BaseSWHID

In [3]: s = get_storage('remote', url='http://moma.internal.softwareheritage.org:5002/')

In [4]: rev = s.revision_get([bytes.fromhex("747675816d815e86b7482b5a0acb9110eeeec590")])[0]
```

Before this commit:

```
In [18]: %timeit rev.to_dict()
10000 loops, best of 5: 70.4 µs per loop

In [19]: %timeit rev.to_dict()
10000 loops, best of 5: 69.3 µs per loop
```

After this commit:

```
In [5]: %timeit rev.to_dict()
10000 loops, best of 5: 48.4 µs per loop

In [6]: %timeit rev.to_dict()
10000 loops, best of 5: 45.7 µs per loop

In [7]: %timeit rev.to_dict()
10000 loops, best of 5: 47.5 µs per loop
```

Unfortunately there isn't much more we can do, 90% of the time is spent
constructing a dict (even when replacing the dictcomp
`{k: dictify(v) for k, v in value.items()}` with `map()` + `dict()`).

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/370/ for more details.

Harbormaster completed remote builds in B23886: Diff 22955.Sep 22 2021, 11:41 AM

vlorentz requested review of this revision.Sep 22 2021, 11:41 AM

On second thought, this is a bad idea. Let's solve the bottleneck by computing identifiers directly using model classes instead of having to dictify them.

vlorentz abandoned this revision.Sep 22 2021, 2:02 PM

Revision Contents
Changeset List

Path

Size

Coverage (All)

Coverage (Touched)

swh/

model/

model.py

18 lines

97%

Diff 22955

View Options

model: optimize dictify()AbandonedPublicActions