Page MenuHomeSoftware Heritage

Improve CVS loader performances
ClosedPublic

Authored by anlambert on Oct 13 2022, 7:33 PM.

Details

Summary

That diff contains two commits that should greatly improve the loading
of large CVS repositories in terms of performance.

loader: Reconstruct repo filesystem incrementally at each revision

Instead of creating a from_disk.Directory instance after each replayed
CVS revision by recursively scanning all directories of the repository,
prefer to have a single one as class member kept synchronized with the
recontructed filesystem after each revision replay.

This should improve loader in terms of performance, especially when
delaing with large repositories.
loader: Yield only modified objects in process_cvs_changesets

Previously, after each revision replay all files and directories of the
CVS repository being loaded were collected and sent to the storage.
This is a real bottleneck in terms of loading performances as it delegates
the filtering of new objects to archive to the storage filtering proxy.

As we known exactly the set of paths that have been modified in a CVS
revision, prefer to do that filtering on the loader side and only
send modified objects to storage instead of the whole set of contents
and directories from the reconstructed filesystem.

This should greatly improve loading performance for large repositories
but also reduce loader memory consumption.

Diff Detail

Repository
rDLDCVS CVS Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8682 (id=31355)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.
Changes applied before test
commit 633a9f8e73851cd3e8f9f13c58c43d1f31dd6a24
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/131/ for more details.

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

In D8682#226118, @olasd wrote:
In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oh nice, I did not know we already have such feature in swh-model. I will try to use it and adapt implementation if needed.

In D8682#226118, @olasd wrote:
In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oh nice, I did not know we already have such feature in swh-model. I will try to use it and adapt implementation if needed.

So I gave a shot using the MerkleNode.collect method in CVS loader and it works like a charm.

I did the adaptation for the collect method to return a set of MerkleNode instead in D8686.

Use from_disk.Directory.collect to get added/modified objects instead of maintaining a set of paths.

Build has FAILED

Patch application report for D8682 (id=31362)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.
Changes applied before test
commit b47790a2c8260e5b4e1c3ef8981a76db6563c139
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

Link to build: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/console

Build has FAILED

Patch application report for D8682 (id=31362)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.
Changes applied before test
commit b47790a2c8260e5b4e1c3ef8981a76db6563c139
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

Link to build: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/console

Build will fail until D8682 gets landed.

That's a surprisingly small diff for such a change, nice!

What speedup do you get with this?

swh/loader/cvs/loader.py
334

else: assert False, obj_type just to be safe

That's a surprisingly small diff for such a change, nice!

What speedup do you get with this?

Speedup will depend on the number of directories and files stored in the repository, the bigger it is the faster the loading will.

But we can also notice a small speedup for lightweight repos, for instance after three consecutive loading of rsync://a.cvs.sourceforge.net/cvsroot/titano/titano,
I got the following mean loading time:

  • before these changes: 29,770945431 seconds
  • after these changes: 22,8639589363 seconds

Build is green

Patch application report for D8682 (id=31394)

Rebasing onto 734207ba58...

Current branch diff-target is up to date.
Changes applied before test
commit e918d3ef80948fb07d295ff7adff7b39dc526e14
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit b976aa6a1f805d8c36bacefaed0273d3e991956d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/136/ for more details.

Build is green

Patch application report for D8682 (id=31395)

Rebasing onto 734207ba58...

Current branch diff-target is up to date.
Changes applied before test
commit c23d425015467ebbdc143227d8320ba39bebb058
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit b976aa6a1f805d8c36bacefaed0273d3e991956d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/137/ for more details.