Details

Reviewers

None

Group Reviewers

Commits

rDLDCVSc23d42501546: loader: Yield only modified objects in process_cvs_changesets
rDLDCVSb976aa6a1f80: loader: Reconstruct repo filesystem incrementally at each revision

Summary

That diff contains two commits that should greatly improve the loading
of large CVS repositories in terms of performance.

loader: Reconstruct repo filesystem incrementally at each revision

Instead of creating a from_disk.Directory instance after each replayed
CVS revision by recursively scanning all directories of the repository,
prefer to have a single one as class member kept synchronized with the
recontructed filesystem after each revision replay.

This should improve loader in terms of performance, especially when
delaing with large repositories.

loader: Yield only modified objects in process_cvs_changesets

Previously, after each revision replay all files and directories of the
CVS repository being loaded were collected and sent to the storage.
This is a real bottleneck in terms of loading performances as it delegates
the filtering of new objects to archive to the storage filtering proxy.

As we known exactly the set of paths that have been modified in a CVS
revision, prefer to do that filtering on the loader side and only
send modified objects to storage instead of the whole set of contents
and directories from the reconstructed filesystem.

This should greatly improve loading performance for large repositories
but also reduce loader memory consumption.

Diff Detail

Repository

rDLDCVS CVS Loader

Branch

cvs-loader-performance-optimisation

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 32310
Build 50610: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 50609: arc lint + arc unit

Event Timeline

anlambert created this revision.Oct 13 2022, 7:33 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 13 2022, 7:33 PM

Build is green

Patch application report for D8682 (id=31355)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.

Changes applied before test

commit 633a9f8e73851cd3e8f9f13c58c43d1f31dd6a24
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/131/ for more details.

Harbormaster completed remote builds in B32303: Diff 31355.Oct 13 2022, 7:35 PM

anlambert requested review of this revision.Oct 13 2022, 7:35 PM

anlambert mentioned this in T4625: staging: ingest netbsd.org cvs forge.Oct 13 2022, 7:36 PM

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

In D8682#226118, @olasd wrote:

In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oh nice, I did not know we already have such feature in swh-model. I will try to use it and adapt implementation if needed.

In D8682#226119, @anlambert wrote:

In D8682#226118, @olasd wrote:

In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oh nice, I did not know we already have such feature in swh-model. I will try to use it and adapt implementation if needed.

So I gave a shot using the MerkleNode.collect method in CVS loader and it works like a charm.

I did the adaptation for the collect method to return a set of MerkleNode instead in D8686.

Use from_disk.Directory.collect to get added/modified objects instead of maintaining a set of paths.

Build has FAILED

Patch application report for D8682 (id=31362)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.

Changes applied before test

commit b47790a2c8260e5b4e1c3ef8981a76db6563c139
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

Link to build: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/console

Harbormaster failed remote builds in B32310: Diff 31362!Oct 14 2022, 4:26 PM

In D8682#226178, @swh-public-ci wrote:

Build has FAILED

Patch application report for D8682 (id=31362)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.

Changes applied before test

commit b47790a2c8260e5b4e1c3ef8981a76db6563c139
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

Build will fail until D8682 gets landed.

That's a surprisingly small diff for such a change, nice!

What speedup do you get with this?

swh/loader/cvs/loader.py
322	`else: assert False, obj_type` just to be safe

In D8682#226232, @vlorentz wrote:

That's a surprisingly small diff for such a change, nice!

What speedup do you get with this?

Speedup will depend on the number of directories and files stored in the repository, the bigger it is the faster the loading will.

But we can also notice a small speedup for lightweight repos, for instance after three consecutive loading of rsync://a.cvs.sourceforge.net/cvsroot/titano/titano,
I got the following mean loading time:

before these changes: 29,770945431 seconds
after these changes: 22,8639589363 seconds

Bump swh.model

Build is green

Patch application report for D8682 (id=31394)

Rebasing onto 734207ba58...

Current branch diff-target is up to date.

Changes applied before test

commit e918d3ef80948fb07d295ff7adff7b39dc526e14
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit b976aa6a1f805d8c36bacefaed0273d3e991956d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/136/ for more details.

Harbormaster completed remote builds in B32341: Diff 31394.Oct 17 2022, 7:15 PM

Add assert fallback

Build is green

Patch application report for D8682 (id=31395)

Rebasing onto 734207ba58...

Current branch diff-target is up to date.

Changes applied before test

commit c23d425015467ebbdc143227d8320ba39bebb058
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit b976aa6a1f805d8c36bacefaed0273d3e991956d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/137/ for more details.

Harbormaster completed remote builds in B32342: Diff 31395.Oct 17 2022, 7:21 PM

This revision was not accepted when it landed; it landed in state Needs Review.Oct 17 2022, 7:26 PM

Closed by commit rDLDCVSb976aa6a1f80: loader: Reconstruct repo filesystem incrementally at each revision (authored by anlambert). · Explain Why

This revision was automatically updated to reflect the committed changes.

anlambert added a commit: rDLDCVSb976aa6a1f80: loader: Reconstruct repo filesystem incrementally at each revision.

anlambert added a commit: rDLDCVSc23d42501546: loader: Yield only modified objects in process_cvs_changesets.

Improve CVS loader performances
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D8682 (id=31355)

Changes applied before test

Patch application report for D8682 (id=31362)

Changes applied before test

Patch application report for D8682 (id=31362)

Changes applied before test

Patch application report for D8682 (id=31394)

Changes applied before test

Patch application report for D8682 (id=31395)

Changes applied before test

Revision Contents
Changeset List

Diff 31362

swh/loader/cvs/loader.py

Improve CVS loader performancesClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D8682 (id=31355)

Changes applied before test

Patch application report for D8682 (id=31362)

Changes applied before test

Patch application report for D8682 (id=31362)

Changes applied before test

Patch application report for D8682 (id=31394)

Changes applied before test

Patch application report for D8682 (id=31395)

Changes applied before test

Revision ContentsChangeset List

Diff 31362

swh/loader/cvs/loader.py

Improve CVS loader performances
ClosedPublic
Actions

Revision Contents
Changeset List