Page MenuHomeSoftware Heritage

SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
ClosedPublic

Authored by ardumont on Nov 9 2021, 10:40 AM.

Details

Summary

This will decrease the disk pressure currently existing when ingesting a svn repository
out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
copies which takes lots of disk space.

With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
possible (when the svn repository is mounted which we ingest).

[1] implementation detail for speed, we iterate over the commit log, apply deltas (and
computing swh hashes) along the way on disk.

This also:

  • improves the existing docstrings
  • add types
  • add tests on missing utils part (currently tested indirectly)

Related to T3719

Test Plan

tox (still happy)

Diff Detail

Unit TestsFailed

TimeTest
28 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.svn.tests.test_utils::test_init_svn_repo_from_archive_dump
datadir = '/var/lib/jenkins/workspace/DLDSVN/tests-on-diff/.tox/py3/lib/python3.7/site-packages/swh/loader/svn/tests/data' tmp_path = PosixPath('/tmp/pytest-of-jenkins/pytest-0/test_init_svn_repo_from_archiv0')
1,753 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.svn.tests.test_utils::test_init_svn_repo_from_archive_dump_and_cleanup
datadir = '/var/lib/jenkins/workspace/DLDSVN/tests-on-diff/.tox/py3/lib/python3.7/site-packages/swh/loader/svn/tests/data' tmp_path = PosixPath('/tmp/pytest-of-jenkins/pytest-0/test_init_svn_repo_from_archiv1')
26 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.svn.tests.test_utils::test_init_svn_repo_from_dump
datadir = '/var/lib/jenkins/workspace/DLDSVN/tests-on-diff/.tox/py3/lib/python3.7/site-packages/swh/loader/svn/tests/data' tmp_path = PosixPath('/tmp/pytest-of-jenkins/pytest-0/test_init_svn_repo_from_dump0')
2 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.svn.tests.test_utils::test_init_svn_repo_from_dump_and_cleanup
datadir = '/var/lib/jenkins/workspace/DLDSVN/tests-on-diff/.tox/py3/lib/python3.7/site-packages/swh/loader/svn/tests/data' tmp_path = PosixPath('/tmp/pytest-of-jenkins/pytest-0/test_init_svn_repo_from_dump_a0')
1 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.svn.tests.test_converters::test_build_swh_revision_default
View Full Test Results (4 Failed · 41 Passed)

Event Timeline

Build is green

Patch application report for D6622 (id=24054)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 5dec13a169437cea39f02f3325e69151ab7eb22c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required
    
    This will reduce the disk pressure [1] we have when a repository is large.
    
    [1] up to 3 copies of the svn repository: 1 dump, 1 svn repository, and the svn disk
    copy we are creating along the ingestion.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/171/ for more details.

Improve implementations:

  • add tests
  • add types

Build has FAILED

Patch application report for D6622 (id=24056)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit e6ab42d05f2334c895147f78288d48ba745ea0f7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1].
    
    So we are up to 3 copies which takes lots of disk space. With the following, we take
    down the 1 unnecessary copy as soon as possible.
    
    [1] implementation detail for speed.
    
    Related to T3719

Link to build: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/172/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/172/console

Fix tests (i forgot one adaptation in another test)

Build is green

Patch application report for D6622 (id=24057)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 69d8c72e9d675af39e2444d1f3b3cdd0e1eee864
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/173/ for more details.

ardumont retitled this revision from SvnLoaderFromRemoteDump: Drop dump as soon as it's no longer required to SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted.Nov 9 2021, 11:35 AM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D6622 (id=24058)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit 57eeb866a435353a76281e8ed52347fd8dbf06a3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/174/ for more details.

Build is green

Patch application report for D6622 (id=24060)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit b73db0485ebd35ea8c4dea8fd0e9e8be16bb9ff7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/176/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/loader/svn/utils.py
149

Can os.remove fail without an error?

This revision is now accepted and ready to land.Nov 9 2021, 12:07 PM
swh/loader/svn/utils.py
149

i'll check

swh/loader/svn/utils.py
149

I guess it could happen say in the scenario where the exception already got raised.
I'll wrap this in a try except. Diff on its way.

Wrap the remove instruction in case failure to remove happens.

swh/loader/svn/utils.py
149

diff "update" on its way.

Build is green

Patch application report for D6622 (id=24062)

Rebasing onto 189dfd5300...

Current branch diff-target is up to date.
Changes applied before test
commit e36caff0a4f639da835583b49564908392ff87dc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Nov 9 10:35:01 2021 +0100

    SvnLoaderFromRemoteDump: Drop dump when svn repository is mounted
    
    This will decrease the disk pressure currently existing when ingesting a svn repository
    out of a remote dump. We first fetch a dump, then mount a svn repository out of it, and
    at last we ingest the repository and have a growing copy on disk [1]. So we are up to 3
    copies which takes lots of disk space.
    
    With the following commit, we take down the 1 unnecessary copy (svn dump) as soon as
    possible (when the svn repository is mounted which we ingest).
    
    [1] implementation detail for speed, we iterate over the commit log, apply deltas (and
    computing swh hashes) along the way on disk.
    
    Related to T3719

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/177/ for more details.