Page MenuHomeSoftware Heritage

loader: Compress dump file and rework truncated dump handling
ClosedPublic

Authored by anlambert on Oct 27 2022, 3:49 PM.

Details

Summary

When dumping a subversion repository to file before loading it, compress
that file using gzip while producing it. It enables to save significant
disk space while dumping a large repository.

Also rework the way how truncated dump is handled now dump file is
compressed by providing the expected max revision number to be loaded
by svnadmin. If the number of loaded revisions matches, we can safely
continue the partial loading of the repository.

Prio this change, the size of dump file for svn://tug.org/texlive was
more than 80Gb while its compressed version has a size of 46,9Gb.

Depends on D8786

Diff Detail

Repository
rDLDSVN Subversion (SVN) loader
Branch
svnrdump-gzip
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32633
Build 51123: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51122: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8787 (id=31670)

Rebasing onto 8c709079ce...

Current branch diff-target is up to date.
Changes applied before test
commit b46ee14525187afad80f066adfe9a113c495b37f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/345/ for more details.

Build is green

Patch application report for D8787 (id=31673)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..e0e971d
Fast-forward
 swh/loader/svn/loader.py            | 44 ++++++++++++++-----------------------
 swh/loader/svn/svn.py               | 16 +++++++++-----
 swh/loader/svn/tests/test_loader.py | 27 ++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  | 41 +++++++++++++++++++++++++++++++++-
 swh/loader/svn/utils.py             | 23 ++++++++++++++-----
 5 files changed, 109 insertions(+), 42 deletions(-)
Changes applied before test
commit e0e971ddda8d503d75ae734e76a2fc43157be9e4
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit 991e2b4ffced44d5483b115298e3f56d1f34c90b
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted or an assertion will be raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/346/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/loader/svn/tests/test_utils.py
130

assert dump_lines > 150 to make sure the test is useful

This revision is now accepted and ready to land.Oct 27 2022, 6:54 PM

Build is green

Patch application report for D8787 (id=31681)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..c9f006e
Fast-forward
 swh/loader/svn/loader.py            | 44 ++++++++++++++-----------------------
 swh/loader/svn/replay.py            |  2 +-
 swh/loader/svn/svn.py               | 14 +++++++-----
 swh/loader/svn/tests/test_loader.py | 27 ++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  | 41 +++++++++++++++++++++++++++++++++-
 swh/loader/svn/utils.py             | 23 ++++++++++++++-----
 6 files changed, 109 insertions(+), 42 deletions(-)
Changes applied before test
commit c9f006e1f4abfce0ad8da4bebcc738dc1a3689f6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/348/ for more details.

ardumont added a subscriber: ardumont.

lgtm

one non-blocking suggestion to explicit the compressed nature of the dump inline.

swh/loader/svn/loader.py
662

Build is green

Patch application report for D8787 (id=31699)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..d24ba1a
Fast-forward
 swh/loader/svn/loader.py            | 54 +++++++++++++++----------------------
 swh/loader/svn/replay.py            |  2 +-
 swh/loader/svn/svn.py               | 14 ++++++----
 swh/loader/svn/tests/test_loader.py | 27 ++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  | 42 ++++++++++++++++++++++++++++-
 swh/loader/svn/utils.py             | 23 +++++++++++-----
 6 files changed, 115 insertions(+), 47 deletions(-)
Changes applied before test
commit d24ba1a5ccd6c825cc362a62d673446c3176ab7c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/353/ for more details.