Before this change we would do the following : 1) translate from_disk's object into `dict`, 2) sort these dict, 3) feed the list to `Directory.from_dict`, 4) create DirectoryEntry from these dict. Skipping the directory creating and directly creating the DirectoryEntries provide us with a small but stable and noticeable performance win. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before: 11 minute 56 seconds aftere: 11 minute 50 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 17% after: 15%
Details
- Reviewers
olasd vlorentz - Group Reviewers
Reviewers - Maniphest Tasks
- T4595: Improve common tooling for loading
- Commits
- rDMODa2e8f18c2a68: from_disk: skip intermediate dictionnary creation when building model
Run tox, profiling and timing
Diff Detail
- Repository
- rDMOD Data model
- Branch
- D8525
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 31704 Build 49603: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 49602: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D8525 (id=30706)
Could not rebase; Attempt merge onto 9ce6feb9d6...
Updating 9ce6feb..ad7774b Fast-forward CONTRIBUTORS | 1 + swh/model/from_disk.py | 33 ++++++++++++++++++++++++++++++--- swh/model/git_objects.py | 14 +++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
Changes applied before test
commit ad7774bec2b48af2ee74d59e34dc6de02751896a Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:24:46 2022 +0200 from_disk: skip intermediate dictionnary creation when building model This should speed things up. commit 41500db7ea8856f75812f4fe8d8a66c6fdc43a82 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:12:43 2022 +0200 model: avoid another extra creation of Model object Do not create model object while sorting entry before creating model object. commit 36185705f1805465d565b00b0df917b0e4f1c8e4 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Tue Sep 20 14:26:17 2022 +0200 from_disk: only build a model object once Before this change, a Directory object was built to compute the `id` of we fed to the Directory object we built for `to_model`.
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/511/ for more details.
Build is green
Patch application report for D8525 (id=30714)
Could not rebase; Attempt merge onto 9ce6feb9d6...
Updating 9ce6feb..ad7774b Fast-forward CONTRIBUTORS | 1 + swh/model/from_disk.py | 33 ++++++++++++++++++++++++++++++--- swh/model/git_objects.py | 14 +++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
Changes applied before test
commit ad7774bec2b48af2ee74d59e34dc6de02751896a Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:24:46 2022 +0200 from_disk: skip intermediate dictionnary creation when building model This should speed things up. commit 41500db7ea8856f75812f4fe8d8a66c6fdc43a82 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:12:43 2022 +0200 model: avoid another extra creation of Model object Do not create model object while sorting entry before creating model object. commit 36185705f1805465d565b00b0df917b0e4f1c8e4 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Tue Sep 20 14:26:17 2022 +0200 from_disk: only build a model object once Before this change, a Directory object was built to compute the `id` of we fed to the Directory object we built for `to_model`.
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/516/ for more details.
Build is green
Patch application report for D8525 (id=30722)
Could not rebase; Attempt merge onto 9ce6feb9d6...
Updating 9ce6feb..ad7774b Fast-forward CONTRIBUTORS | 1 + swh/model/from_disk.py | 33 ++++++++++++++++++++++++++++++--- swh/model/git_objects.py | 14 +++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
Changes applied before test
commit ad7774bec2b48af2ee74d59e34dc6de02751896a Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:24:46 2022 +0200 from_disk: skip intermediate dictionnary creation when building model This should speed things up. commit 41500db7ea8856f75812f4fe8d8a66c6fdc43a82 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:12:43 2022 +0200 model: avoid another extra creation of Model object Do not create model object while sorting entry before creating model object. commit 36185705f1805465d565b00b0df917b0e4f1c8e4 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Tue Sep 20 14:26:17 2022 +0200 from_disk: only build a model object once Before this change, a Directory object was built to compute the `id` of we fed to the Directory object we built for `to_model`.
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/520/ for more details.
I am in the process to adding more timing/profile data to the commit message (then to fight to get phabricator to display them)
Build is green
Patch application report for D8525 (id=30727)
Could not rebase; Attempt merge onto 9ce6feb9d6...
Updating 9ce6feb..a2e8f18 Fast-forward CONTRIBUTORS | 1 + swh/model/from_disk.py | 33 ++++++++++++++++++++++++++++++--- swh/model/git_objects.py | 14 +++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
Changes applied before test
commit a2e8f18c2a685c7d4aad53345c77a7177190854e Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:24:46 2022 +0200 from_disk: skip intermediate dictionnary creation when building model Before this change we would do the following : 1) translate from_disk's object into `dict`, 2) sort these dict, 3) feed the list to `Directory.from_dict`, 4) create DirectoryEntry from these dict. Skipping the directory creating and directly creating the DirectoryEntries provide us with a small but stable and noticeable performance win. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before: 11 minute 56 seconds aftere: 11 minute 50 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 17% after: 15% commit ad3ecac9beae015b9e9144c1da180097958b2827 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:12:43 2022 +0200 model: avoid another extra creation of Model object Do not create model object while sorting entry before creating model object. This is another case of "let us create object X to prepare the creation of object X", slowing things down. In practice, we will likely skip this code-path after the next changeset, however this seems useful to get this performance footgun out the way. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before 12 minutes 59 seconds after: 11 minute 56 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 24% after: 17% commit 814a6c8416d56f5f8b3e590d419d5aea7a888ab2 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Tue Sep 20 14:26:17 2022 +0200 from_disk: only build a model object once Before this change, a Directory object was built to compute the `id` of we fed to the Directory object we built for `to_model`. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before: 17 minutes 48 seconds after: 12 minutes 59 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 43% after: 24%
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/524/ for more details.
You are planning to merge this new code with child_to_directory_entry's in a future diff, right?
Not really. child_to_directory_entry take a single from_disk's child and turn it into a dictionary. Here we are processing the list of child directly to model's DirectoryEntry
Build is green
Patch application report for D8525 (id=30734)
Could not rebase; Attempt merge onto 9ce6feb9d6...
Updating 9ce6feb..a2e8f18 Fast-forward CONTRIBUTORS | 1 + swh/model/from_disk.py | 33 ++++++++++++++++++++++++++++++--- swh/model/git_objects.py | 14 +++++++++----- 3 files changed, 40 insertions(+), 8 deletions(-)
Changes applied before test
commit a2e8f18c2a685c7d4aad53345c77a7177190854e Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:24:46 2022 +0200 from_disk: skip intermediate dictionnary creation when building model Before this change we would do the following : 1) translate from_disk's object into `dict`, 2) sort these dict, 3) feed the list to `Directory.from_dict`, 4) create DirectoryEntry from these dict. Skipping the directory creating and directly creating the DirectoryEntries provide us with a small but stable and noticeable performance win. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before: 11 minute 56 seconds aftere: 11 minute 50 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 17% after: 15% commit ad3ecac9beae015b9e9144c1da180097958b2827 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Thu Sep 22 18:12:43 2022 +0200 model: avoid another extra creation of Model object Do not create model object while sorting entry before creating model object. This is another case of "let us create object X to prepare the creation of object X", slowing things down. In practice, we will likely skip this code-path after the next changeset, however this seems useful to get this performance footgun out the way. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before 12 minutes 59 seconds after: 11 minute 56 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 24% after: 17% commit 814a6c8416d56f5f8b3e590d419d5aea7a888ab2 Author: Pierre-Yves David <pierre-yves.david@ens-lyon.org> Date: Tue Sep 20 14:26:17 2022 +0200 from_disk: only build a model object once Before this change, a Directory object was built to compute the `id` of we fed to the Directory object we built for `to_model`. We tested this change on simple information of the Mercurial loader, with a noop-loader stockage: swh loader run mercurial https://foss.heptapod.net/mercurial/mercurial-devel directory=/data/repos/mercurial-devel = Median time of 3 run = before: 17 minutes 48 seconds after: 12 minutes 59 seconds On a profile of the same run, the `to_model` call of the from_disk's `Directory` class took the following percentage: before: 43% after: 24%
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/528/ for more details.