Page MenuHomeSoftware Heritage

SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
ClosedPublic

Authored by ardumont on Nov 9 2021, 10:40 AM.

Details

Summary

This will decrease the disk pressure currently existing when ingesting a svn repository
out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
copies which takes lots of disk space.

With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
possible (when the svn repository is mounted which we ingest).

[1] implementation detail for speed, we iterate over the commit log, apply deltas (and
computing swh hashes) along the way on disk.

This also:

  • improves the existing docstrings
  • add types
  • add tests on missing utils part (currently tested indirectly)

Related to T3719

Test Plan

tox (still happy)

Diff Detail

Repository
rDLDSVN Subversion (SVN) loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6622 (id=24054)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 5dec13a169437cea39f02f3325e69151ab7eb22c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required
    
    This will reduce the disk pressure [1] we have when a repository is large.
    
    [1] up to 3 copies of the svn repository: 1 dump, 1 svn repository, and the svn disk
    copy we are creating along the ingestion.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/171/ for more details.

Improve implementations:

  • add tests
  • add types

Build has FAILED

Patch application report for D6622 (id=24056)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit e6ab42d05f2334c895147f78288d48ba745ea0f7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1].
    
    So we are up to 3 copies which takes lots of disk space. With the following, we take
    down the 1 unnecessary copy as soon as possible.
    
    [1] implementation detail for speed.
    
    Related to T3719

Link to build: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/172/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/172/console

Fix tests (i forgot one adaptation in another test)

Build is green

Patch application report for D6622 (id=24057)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 69d8c72e9d675af39e2444d1f3b3cdd0e1eee864
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/173/ for more details.

ardumont retitled this revision from SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required to SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted.Nov 9 2021, 11:35 AM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D6622 (id=24058)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 57eeb866a435353a76281e8ed52347fd8dbf06a3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/174/ for more details.

Build is green

Patch application report for D6622 (id=24060)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit b73db0485ebd35ea8c4dea8fd0e9e8be16bb9ff7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/176/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/loader/svn/utils.py
152

Can os.remove fail without an error?

This revision is now accepted and ready to land.Nov 9 2021, 12:07 PM
swh/loader/svn/utils.py
152

i'll check

swh/loader/svn/utils.py
152

I guess it could happen say in the scenario where the exception already got raised.
I'll wrap this in a try except. Diff on its way.

Wrap the remove instruction in case failure to remove happens.

swh/loader/svn/utils.py
152

diff "update" on its way.

Build is green

Patch application report for D6622 (id=24062)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit e36caff0a4f639da835583b49564908392ff87dc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/177/ for more details.