Page MenuHomeSoftware Heritage

Add fast-path to the bzr loader
ClosedPublic

Authored by Alphare on Feb 9 2022, 2:23 PM.

Details

Summary

Computing a delta from the currently stored snapshot for each revision
is much more efficient than always reading every single file.

We intentionally don't handle the cases where renames or copies are
involved. The complexity of the logic needed to figure out nested or
otherwise conflicting results coupled with the lack of nice primitives
in Breezy itself is really too much to be worth the effort, as revisions
with renames are far from being the majority.

Loading lp:glamour on my laptop before this change takes 59 minutes,
less than 3 minutes with this change.

Diff Detail

Repository
rDLDBZR BZR loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26750
Build 41826: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 41825: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7130 (id=25849)

Rebasing onto 078fb50296...

Current branch diff-target is up to date.
Changes applied before test
commit 22056d5a9e14e0e65986a08a540b28ef47daaff0
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

Link to build: https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/29/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/29/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 9 2022, 2:28 PM
Harbormaster failed remote builds in B26691: Diff 25849!

Build is green

Patch application report for D7130 (id=25852)

Rebasing onto 078fb50296...

Current branch diff-target is up to date.
Changes applied before test
commit 407bc08b550ef8c880fefc232247fbcefd1ebfbf
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/31/ for more details.

ardumont added 1 blocking reviewer(s): Reviewers.
ardumont added a subscriber: ardumont.

lgtm, it'd be good if another would also have a look.

One typing suggestion.

swh/loader/bzr/loader.py
133

or bytes, not sure so you tell me ;)

Build is green

Patch application report for D7130 (id=25864)

Rebasing onto 080db96f11...

Current branch diff-target is up to date.
Changes applied before test
commit 25cb8dcbc0b3389dea2a7b1c9208ebeedf2e7be8
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/34/ for more details.

Build is green

Patch application report for D7130 (id=25866)

Rebasing onto 080db96f11...

Current branch diff-target is up to date.
Changes applied before test
commit 96fb8e773e1d0e4aa9a41b007436ec774d031b82
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/35/ for more details.

vlorentz added inline comments.
swh/loader/bzr/loader.py
145–148

this seems to be equivalent (assuming source_path cannot be the empty string).

Or do I misunderstand this completely?

462

are these lists typically long? if yes, try benchmarking this:

for change in itertools.chain(delta.kind_changed, delta.added, delta.modified):

to avoid two copies of delta.kind_changed and delta.added, and one copy of delta.modified.

swh/loader/bzr/loader.py
145–148

There used to be more a complex logic due to renames before, but I think you're correct, this is now overly complicated.

462

I don't think these changes are *ever* more that a few thousand entries in rare cases. But why not take this change anyway, it does not make the code harder to understand and is faster. :)

Simplify sorting, use itertools.chain and limit the lru_cache

Build is green

Patch application report for D7130 (id=25891)

Rebasing onto 5077af52cb...

First, rewinding head to replay your work on top of it...
Applying: Add fast-path to the bzr loader
Changes applied before test
commit d32d9d308075931488721885db3d4413d741a0aa
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/38/ for more details.

Add special case for empty snapshots

This revision is now accepted and ready to land.Feb 10 2022, 3:11 PM

Build is green

Patch application report for D7130 (id=25903)

Rebasing onto 5077af52cb...

First, rewinding head to replay your work on top of it...
Applying: Add fast-path to the bzr loader
Changes applied before test
commit 997b8759561d00704a693b360c9881fc98b42629
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/39/ for more details.

Build is green

Patch application report for D7130 (id=25906)

Rebasing onto 5077af52cb...

Current branch diff-target is up to date.
Changes applied before test
commit 0ffc124709cf63bfc279e64530d6c3d7b823f55e
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Feb 9 12:37:33 2022 +0100

    Add fast-path to the bzr loader
    
    Computing a delta from the currently stored snapshot for each revision
    is much more efficient than always reading every single file.
    
    We intentionally don't handle the cases where renames or copies are
    involved. The complexity of the logic needed to figure out nested or
    otherwise conflicting results coupled with the lack of nice primitives
    in Breezy itself is really too much to be worth the effort, as revisions
    with renames are far from being the majority.
    
    Loading `lp:glamour` on my laptop before this change takes 59 minutes,
    less than 3 minutes with this change.

See https://jenkins.softwareheritage.org/job/DLDBZR/job/tests-on-diff/40/ for more details.

This revision was automatically updated to reflect the committed changes.