Page MenuHomeSoftware Heritage

model: optimize dictify()
AbandonedPublic

Authored by vlorentz on Sep 22 2021, 11:39 AM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary
  1. reorder conditionals to minimize the number of tests needed
  2. use hasattr() instead of the expensive isinstance()

Motivation: Ironically, this is the bottleneck of my checksumming
script.

Benchmark:

In [1]: from swh.storage import get_storage

In [2]: from swh.model.identifiers import CoreSWHID, _BaseSWHID

In [3]: s = get_storage('remote', url='http://moma.internal.softwareheritage.org:5002/')

In [4]: rev = s.revision_get([bytes.fromhex("747675816d815e86b7482b5a0acb9110eeeec590")])[0]

Before this commit:

In [18]: %timeit rev.to_dict()
10000 loops, best of 5: 70.4 µs per loop

In [19]: %timeit rev.to_dict()
10000 loops, best of 5: 69.3 µs per loop

After this commit:

In [5]: %timeit rev.to_dict()
10000 loops, best of 5: 48.4 µs per loop

In [6]: %timeit rev.to_dict()
10000 loops, best of 5: 45.7 µs per loop

In [7]: %timeit rev.to_dict()
10000 loops, best of 5: 47.5 µs per loop

Unfortunately there isn't much more we can do, 90% of the time is spent
constructing a dict (even when replacing the dictcomp
{k: dictify(v) for k, v in value.items()} with map() + dict()).

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23886
Build 37249: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37248: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6319 (id=22955)

Rebasing onto b6f5e30b53...

Current branch diff-target is up to date.
Changes applied before test
commit 25e0c71dafa7e0e4540adfb5ea492930264f14ba
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 11:33:42 2021 +0200

    model: optimize dictify()
    
    1. reorder conditionals to minimize the number of tests needed
    2. use hasattr() instead of the expensive isinstance()
    
    Motivation: Ironically, this is the bottleneck of my checksumming
    script.
    
    Benchmark:
In [1]: from swh.storage import get_storage

In [2]: from swh.model.identifiers import CoreSWHID, _BaseSWHID

In [3]: s = get_storage('remote', url='http://moma.internal.softwareheritage.org:5002/')

In [4]: rev = s.revision_get([bytes.fromhex("747675816d815e86b7482b5a0acb9110eeeec590")])[0]
```

Before this commit:

```
In [18]: %timeit rev.to_dict()
10000 loops, best of 5: 70.4 µs per loop

In [19]: %timeit rev.to_dict()
10000 loops, best of 5: 69.3 µs per loop
```

After this commit:

```
In [5]: %timeit rev.to_dict()
10000 loops, best of 5: 48.4 µs per loop

In [6]: %timeit rev.to_dict()
10000 loops, best of 5: 45.7 µs per loop

In [7]: %timeit rev.to_dict()
10000 loops, best of 5: 47.5 µs per loop
```

Unfortunately there isn't much more we can do, 90% of the time is spent
constructing a dict (even when replacing the dictcomp
`{k: dictify(v) for k, v in value.items()}` with `map()` + `dict()`).
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/370/ for more details.

On second thought, this is a bad idea. Let's solve the bottleneck by computing identifiers directly using model classes instead of having to dictify them.