Page MenuHomeSoftware Heritage

replay: Ensure copyfrom operations are properly handled
ClosedPublic

Authored by anlambert on Oct 28 2022, 6:13 PM.

Details

Summary

A subversion revision can contain new directories and files copied from
ancestor revisions but those were not perfectly handled in the commit
editor used to reconstruct the repository filesystem when replaying
revisions.

In particular previous implementation could not handle the case where a
path copied from an ancestor revision is replaced in a same commit (for
instance replacing a directory by a file with the same name).

These changes ensure that info about source path and source revision from
which a path is copied is passed to the commit editor methods in order to
let them handle the copies but also that the replace operations will be
correctly replayed.

It also prevents OS error Too many open files when a really large files
tree is copied from an ancestor revision.

Depends on D8787

It fixes the loading of https://svn.code.sf.net/p/mp3splt/code reported in SWH-LOADER-SVN-2Y.

It also fixes the loading of svn://tug.org/texlive (two previous diffs in that stack are also required for fixing the loading of this big one)
as before these changes the loading were encountering the following issue:

ocker-swh-loader-1  | [2022-10-28 14:08:25,177: DEBUG/ForkPoolWorker-1] rev: 3972, swhrev: 192cedd037af713fb92fa12b09425d3a773c19e2, dir: da2338bdad222b133cc4e2378e18b33678999f49
docker-swh-loader-1  | [2022-10-28 14:08:36,398: ERROR/ForkPoolWorker-1] [Errno 24] Can't open file '/home/anlambert/tmp/texlive_repo/db/revs/0/90': Too many open files
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.Y8nDYSdruk/swh-loader-svn/swh/loader/svn/loader.py", line 446, in fetch_data
docker-swh-loader-1  |     data = next(self.swh_revision_gen)
docker-swh-loader-1  |   File "/tmp/tmp.Y8nDYSdruk/swh-loader-svn/swh/loader/svn/loader.py", line 342, in process_svn_revisions
docker-swh-loader-1  |     for rev, commit, new_objects, root_directory in gen_revs:
docker-swh-loader-1  |   File "/tmp/tmp.Y8nDYSdruk/swh-loader-svn/swh/loader/svn/svn.py", line 542, in swh_hash_data_per_revision
docker-swh-loader-1  |     objects = self.swhreplay.compute_objects(rev)
docker-swh-loader-1  |   File "/tmp/tmp.Y8nDYSdruk/swh-loader-svn/swh/loader/svn/replay.py", line 963, in compute_objects
docker-swh-loader-1  |     self.replay(rev)
docker-swh-loader-1  |   File "/tmp/tmp.Y8nDYSdruk/swh-loader-svn/swh/loader/svn/replay.py", line 945, in replay
docker-swh-loader-1  |     self.conn.replay(rev, rev + 1, self.editor)
docker-swh-loader-1  | OSError: [Errno 24] Can't open file '/home/anlambert/tmp/texlive_repo/db/revs/0/90': Too many open files

Diff Detail

Repository
rDLDSVN Subversion (SVN) loader
Branch
replay-handle-copy-from-operations
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32643
Build 51137: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51136: arc lint + arc unit

Event Timeline

anlambert retitled this revision from replay: Ensure copyfrom operations are handled to replay: Ensure copyfrom operations are properly handled.Oct 28 2022, 6:15 PM

Build is green

Patch application report for D8793 (id=31690)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..d64dfa7
Fast-forward
 swh/loader/svn/loader.py            |  44 +++++---------
 swh/loader/svn/replay.py            |  56 +++++++++++++----
 swh/loader/svn/svn.py               |  35 ++++++++---
 swh/loader/svn/tests/test_loader.py | 116 +++++++++++++++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  |  41 ++++++++++++-
 swh/loader/svn/tests/utils.py       |  10 +++-
 swh/loader/svn/utils.py             |  23 +++++--
 7 files changed, 267 insertions(+), 58 deletions(-)
Changes applied before test
commit d64dfa7221331628e4f8971c388fb6491981310f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 21 18:09:28 2022 +0200

    replay: Ensure copyfrom operations are handled
    
    A subversion revision can contain new directories and files copied from
    ancestor revisions but those were not perfectly handled in the commit
    editor used to reconstruct the repository filesystem when replaying
    revisions.
    
    In particular previous implementation could not handle the case where a
    path copied from an ancestor revision is replaced in a same commit (for
    instance replacing a directory by a file with the same name).
    
    These changes ensure that info about source path and source revision from
    which a path is copied is passed to the commit editor methods in order to
    let it handle the copies but also that the replace operations will be
    correctly replayed.
    
    It also prevents OS error "Too many open files" when a really large files
    tree is copied from an ancestor revision.

commit c9f006e1f4abfce0ad8da4bebcc738dc1a3689f6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/349/ for more details.

Build is green

Patch application report for D8793 (id=31691)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..165dffd
Fast-forward
 swh/loader/svn/loader.py            |  44 +++++---------
 swh/loader/svn/replay.py            |  56 +++++++++++++----
 swh/loader/svn/svn.py               |  35 ++++++++---
 swh/loader/svn/tests/test_loader.py | 116 +++++++++++++++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  |  41 ++++++++++++-
 swh/loader/svn/tests/utils.py       |  10 +++-
 swh/loader/svn/utils.py             |  23 +++++--
 7 files changed, 267 insertions(+), 58 deletions(-)
Changes applied before test
commit 165dffd1c52813d2954a1364f88c0d1899c62819
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 21 18:09:28 2022 +0200

    replay: Ensure copyfrom operations are handled
    
    A subversion revision can contain new directories and files copied from
    ancestor revisions but those were not perfectly handled in the commit
    editor used to reconstruct the repository filesystem when replaying
    revisions.
    
    In particular previous implementation could not handle the case where a
    path copied from an ancestor revision is replaced in a same commit (for
    instance replacing a directory by a file with the same name).
    
    These changes ensure that info about source path and source revision from
    which a path is copied is passed to the commit editor methods in order to
    let them handle the copies but also that the replace operations will be
    correctly replayed.
    
    It also prevents OS error "Too many open files" when a really large files
    tree is copied from an ancestor revision.

commit c9f006e1f4abfce0ad8da4bebcc738dc1a3689f6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/350/ for more details.

Build is green

Patch application report for D8793 (id=31692)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..f224366
Fast-forward
 swh/loader/svn/loader.py            |  44 +++++---------
 swh/loader/svn/replay.py            |  56 +++++++++++++----
 swh/loader/svn/svn.py               |  35 ++++++++---
 swh/loader/svn/tests/test_loader.py | 116 +++++++++++++++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  |  41 ++++++++++++-
 swh/loader/svn/tests/utils.py       |  10 +++-
 swh/loader/svn/utils.py             |  23 +++++--
 7 files changed, 267 insertions(+), 58 deletions(-)
Changes applied before test
commit f224366c0077daa2b8b81b5340ed6c726b9cb65c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 21 18:09:28 2022 +0200

    replay: Ensure copyfrom operations are properly handled
    
    A subversion revision can contain new directories and files copied from
    ancestor revisions but those were not perfectly handled in the commit
    editor used to reconstruct the repository filesystem when replaying
    revisions.
    
    In particular previous implementation could not handle the case where a
    path copied from an ancestor revision is replaced in a same commit (for
    instance replacing a directory by a file with the same name).
    
    These changes ensure that info about source path and source revision from
    which a path is copied is passed to the commit editor methods in order to
    let them handle the copies but also that the replace operations will be
    correctly replayed.
    
    It also prevents OS error "Too many open files" when a really large files
    tree is copied from an ancestor revision.

commit c9f006e1f4abfce0ad8da4bebcc738dc1a3689f6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/351/ for more details.

Update:

  • rename some variables
  • fix some typing
  • harmonize replay.DirEditor.add_[file|directory] implementation

Build is green

Patch application report for D8793 (id=31693)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..12ce928
Fast-forward
 swh/loader/svn/loader.py            |  44 +++++---------
 swh/loader/svn/replay.py            |  56 +++++++++++++----
 swh/loader/svn/svn.py               |  35 ++++++++---
 swh/loader/svn/tests/test_loader.py | 116 +++++++++++++++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  |  41 ++++++++++++-
 swh/loader/svn/tests/utils.py       |  10 +++-
 swh/loader/svn/utils.py             |  23 +++++--
 7 files changed, 267 insertions(+), 58 deletions(-)
Changes applied before test
commit 12ce928bd205085bafdbe2e598764d42e38b0303
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 21 18:09:28 2022 +0200

    replay: Ensure copyfrom operations are properly handled
    
    A subversion revision can contain new directories and files copied from
    ancestor revisions but those were not perfectly handled in the commit
    editor used to reconstruct the repository filesystem when replaying
    revisions.
    
    In particular previous implementation could not handle the case where a
    path copied from an ancestor revision is replaced in a same commit (for
    instance replacing a directory by a file with the same name).
    
    These changes ensure that info about source path and source revision from
    which a path is copied is passed to the commit editor methods as paramaters
    in order to let them handle the copies but also that the replace operations
    will be correctly replayed.
    
    It also prevents OS error "Too many open files" when a really large files
    tree is copied from an ancestor revision.

commit c9f006e1f4abfce0ad8da4bebcc738dc1a3689f6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/352/ for more details.

This revision is now accepted and ready to land.Oct 31 2022, 2:38 PM

Build is green

Patch application report for D8793 (id=31700)

Could not rebase; Attempt merge onto 8c709079ce...

Updating 8c70907..04566a7
Fast-forward
 swh/loader/svn/loader.py            |  54 +++++++----------
 swh/loader/svn/replay.py            |  56 +++++++++++++----
 swh/loader/svn/svn.py               |  35 ++++++++---
 swh/loader/svn/tests/test_loader.py | 116 +++++++++++++++++++++++++++++++++++-
 swh/loader/svn/tests/test_utils.py  |  42 ++++++++++++-
 swh/loader/svn/tests/utils.py       |  10 +++-
 swh/loader/svn/utils.py             |  23 +++++--
 7 files changed, 273 insertions(+), 63 deletions(-)
Changes applied before test
commit 04566a7f3616fb94063355aae491a4e5ad4a832f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 21 18:09:28 2022 +0200

    replay: Ensure copyfrom operations are properly handled
    
    A subversion revision can contain new directories and files copied from
    ancestor revisions but those were not perfectly handled in the commit
    editor used to reconstruct the repository filesystem when replaying
    revisions.
    
    In particular previous implementation could not handle the case where a
    path copied from an ancestor revision is replaced in a same commit (for
    instance replacing a directory by a file with the same name).
    
    These changes ensure that info about source path and source revision from
    which a path is copied is passed to the commit editor methods as paramaters
    in order to let them handle the copies but also that the replace operations
    will be correctly replayed.
    
    It also prevents OS error "Too many open files" when a really large files
    tree is copied from an ancestor revision.

commit d24ba1a5ccd6c825cc362a62d673446c3176ab7c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 12:17:22 2022 +0200

    loader: Compress dump file and rework truncated dump handling
    
    When dumping a subversion repository to file before loading it, compress
    that file using gzip while producing it. It enables to save significant
    disk space while dumping a large repository.
    
    Also rework the way how truncated dump is handled now dump file is
    compressed by providing the expected max revision number to be loaded
    by svnadmin. If the number of loaded revisions matches, we can safely
    continue the partial loading of the repository.

commit c6d39b7bb70bb5896840a0fc22f8763d5d8b35cf
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 27 11:29:36 2022 +0200

    svn: Ensure to quote URLs provided as parameters to client methods
    
    URLs provided as parameters to subvertpy.client.Client methods must
    be quoted when it contains space characters or an assertion will be
    raised by libsvn otherwise.

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/354/ for more details.