Page MenuHomeSoftware Heritage

ra: Send modified objects only to storage after replaying a revision
ClosedPublic

Authored by anlambert on Jan 14 2022, 12:17 PM.

Details

Summary

Previously all contents and directories of the reconstructed filesystem
were sent to the storage after having replayed a svn revision.
The filtering of the new contents and directories to write to the storage
is then delegated to the storage filtering proxy.

Proceeding like this has a huge performance impact on the loading of large
subversion repositories as large sets of objects to archive are filtered
again and again after each revision replay.

That commit performs the objects filtering at the loader level instead of
delegating that task to the storage filtering proxy.
It is done by maintaining a set of added or modified paths for a given
revision when replaying it. As we use the svn_ra API, that set of paths
can be easily computed with confidence.

This change provides a really significant speedup to the overall loading
time of a subversion repository.

For my tests, I used the large tortoise SVN repository.
Before that change, the loading took around 24h in the docker environment.
After that change, the loading took around 4h so a 6x speedup !

Related to T3839

Depends on D6925

Test Plan

I added snapshot integrity checks in tests where it was not performed
to ensure all objects referenced by a snapshot can be found in the
archive after a loading, no issues detected.

Diff Detail

Repository
rDLDSVN Subversion (SVN) loader
Branch
svn-loader-performance-optimization
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26101
Build 40794: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 40793: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6950 (id=25170)

Could not rebase; Attempt merge onto 93b4f2fdd8...

Updating 93b4f2f..4d156d9
Fast-forward
 swh/loader/svn/ra.py                | 297 ++++++++++++++-
 swh/loader/svn/svn.py               |  50 ++-
 swh/loader/svn/tests/test_loader.py | 742 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 225 ++++++++++-
 swh/loader/svn/utils.py             | 126 +++++-
 5 files changed, 1325 insertions(+), 115 deletions(-)
Changes applied before test
commit 4d156d9ee10e0b746d85ad80725f6b9f8e561ff1
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit d19f08e3fd5f9d803300f6be698dedf30bd5f527
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit 85aa87be50fea493c437b44cbcc544f285912e5d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit 2f5fd60ab91f5af90c3333f369cb7b72b28b3fbe
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/236/ for more details.

Build is green

Patch application report for D6950 (id=25174)

Could not rebase; Attempt merge onto 93b4f2fdd8...

Updating 93b4f2f..95220c5
Fast-forward
 swh/loader/svn/ra.py                | 297 +++++++++++++-
 swh/loader/svn/svn.py               |  49 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 225 ++++++++++-
 swh/loader/svn/utils.py             | 126 +++++-
 5 files changed, 1326 insertions(+), 116 deletions(-)
Changes applied before test
commit 95220c58ad58e1ff3d8cad7453432d567b010575
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit f3d3eafe017a971952fb1359f19ee1f269edaf4f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit 8c7046e0ab03ae4b8f93134ff0d85d80c4352c69
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit 2f5fd60ab91f5af90c3333f369cb7b72b28b3fbe
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/239/ for more details.

Build is green

Patch application report for D6950 (id=25178)

Could not rebase; Attempt merge onto 93b4f2fdd8...

Updating 93b4f2f..0d734fa
Fast-forward
 swh/loader/svn/ra.py                | 297 +++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 225 ++++++++++-
 swh/loader/svn/utils.py             | 126 +++++-
 5 files changed, 1329 insertions(+), 116 deletions(-)
Changes applied before test
commit 0d734fa0dd778d15db4f8eaddbbfb8c00a1ab693
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit fe9fc903e547ed6f00a8ba2cc4642faa54427989
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit e76adb16d51b726eb0a3c5b266579eeba566f48d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit 2f5fd60ab91f5af90c3333f369cb7b72b28b3fbe
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/242/ for more details.

This looks like an impressive speedup, kudos.

Rather than add this logic on the svn loader only, we could consider either making swh.model.from_disk support incremental computations by keeping track of the ctime / mtime of on-disk data, and by collecting objects for the new loader. This would make this logic reusable by all loaders.

Without changing the swh.model.from_disk logic, we could also just diff the sets of objects between iterations (and rely on the OS cache for the new computation to be vaguely efficient), and only send new ones to the storage.

If we go for the svn-loader-only logic, then, I couldn't figure out if the following cases are properly handled:

  • when modifying the file <root>/a/b/c/d, we need to record a new object for the content <root>/a/b/c/d, but also for the directories <root>/a/b/c, <root>/a/b, <root>/a and <root>. I can only see where your change computes the new hash for the new content, and for the root directory, but not where any of the intermediate dirs are considered.
  • when removing a file <root>/a/b/c, we need to record new objects for the directories <root>/a/b, <root>/a and <root>. In your logic, I only see that <root>/a/b/c is removed from consideration, but I don't see where the parent directories are refreshed.

Is the check_snapshot function really properly fully recursive?

In D6950#180642, @olasd wrote:

This looks like an impressive speedup, kudos.

Rather than add this logic on the svn loader only, we could consider either making swh.model.from_disk support incremental computations by keeping track of the ctime / mtime of on-disk data, and by collecting objects for the new loader. This would make this logic reusable by all loaders.

Without changing the swh.model.from_disk logic, we could also just diff the sets of objects between iterations (and rely on the OS cache for the new computation to be vaguely efficient), and only send new ones to the storage.

Indeed, that feature could be implemented at the swh.model.from_disk side. The first proposal seems the most reasonable to me in terms of performance
for a loader but the second one could also be of interest to have.
Nevertheless, not sure the implementation will be so straightforward and they will be a lot of cases to cover with tests so not a one day work.
Currently the subversion loader is the only one that needs that directories diff feature so I think we can keep the implementation as it is at the
moment but I will create a task to implement the directories diff features in swh.model.from_disk.

If we go for the svn-loader-only logic, then, I couldn't figure out if the following cases are properly handled:

  • when modifying the file <root>/a/b/c/d, we need to record a new object for the content <root>/a/b/c/d, but also for the directories <root>/a/b/c, <root>/a/b, <root>/a and <root>. I can only see where your change computes the new hash for the new content, and for the root directory, but not where any of the intermediate dirs are considered.
  • when removing a file <root>/a/b/c, we need to record new objects for the directories <root>/a/b, <root>/a and <root>. In your logic, I only see that <root>/a/b/c is removed from consideration, but I don't see where the parent directories are refreshed.

That's not something that can be easily guessed from the code but the svn_ra API will trigger the creation of DirEditor objects
for each subpath of added/modified paths. So we are sure to have all modified paths in the set after a replayed revision.

Is the check_snapshot function really properly fully recursive?

Yes it is, see code.

Build is green

Patch application report for D6950 (id=25181)

Could not rebase; Attempt merge onto 93b4f2fdd8...

Updating 93b4f2f..aff346c
Fast-forward
 swh/loader/svn/ra.py                | 298 ++++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 225 ++++++++++-
 swh/loader/svn/utils.py             | 126 +++++-
 5 files changed, 1330 insertions(+), 116 deletions(-)
Changes applied before test
commit aff346cf118a51c1adbf2bd0815ea4a08eeb5a74
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit fe9fc903e547ed6f00a8ba2cc4642faa54427989
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit e76adb16d51b726eb0a3c5b266579eeba566f48d
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit 2f5fd60ab91f5af90c3333f369cb7b72b28b3fbe
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/243/ for more details.

In D6950#180642, @olasd wrote:

Rather than add this logic on the svn loader only, we could consider either making swh.model.from_disk support incremental computations by keeping track of the ctime / mtime of on-disk data, and by collecting objects for the new loader. This would make this logic reusable by all loaders.

Without changing the swh.model.from_disk logic, we could also just diff the sets of objects between iterations (and rely on the OS cache for the new computation to be vaguely efficient), and only send new ones to the storage.

Indeed, that feature could be implemented at the swh.model.from_disk side. The first proposal seems the most reasonable to me in terms of performance
for a loader but the second one could also be of interest to have.
Nevertheless, not sure the implementation will be so straightforward and they will be a lot of cases to cover with tests so not a one day work.
Currently the subversion loader is the only one that needs that directories diff feature so I think we can keep the implementation as it is at the
moment but I will create a task to implement the directories diff features in swh.model.from_disk.

Yeah, of course, that makes sense.

I'm mostly worried about us missing recording one of the changed paths and then being confused about missing a bunch of objects, having that feature built-in to the underlying data structure makes sure the logic and edge case management is centralized. But it's fine to do that later.

If we go for the svn-loader-only logic, then, I couldn't figure out if the following cases are properly handled:

  • when modifying the file <root>/a/b/c/d, we need to record a new object for the content <root>/a/b/c/d, but also for the directories <root>/a/b/c, <root>/a/b, <root>/a and <root>. I can only see where your change computes the new hash for the new content, and for the root directory, but not where any of the intermediate dirs are considered.
  • when removing a file <root>/a/b/c, we need to record new objects for the directories <root>/a/b, <root>/a and <root>. In your logic, I only see that <root>/a/b/c is removed from consideration, but I don't see where the parent directories are refreshed.

That's not something that can be easily guessed from the code but the svn_ra API will trigger the creation of DirEditor objects
for each subpath of added/modified paths. So we are sure to have all modified paths in the set after a replayed revision.

Okay, good.

Is the check_snapshot function really properly fully recursive?

Yes it is, see code.

Awesome. Then I guess I'm fine with not adding a test specific to the contents of the contents of the editor.modified_paths set.

This revision is now accepted and ready to land.Jan 14 2022, 5:44 PM

Build is green

Patch application report for D6950 (id=25231)

Could not rebase; Attempt merge onto cb4bf60c0e...

Merge made by the 'recursive' strategy.
 swh/loader/svn/ra.py                | 300 ++++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 261 ++++++++++++-
 swh/loader/svn/utils.py             | 128 ++++++-
 5 files changed, 1370 insertions(+), 116 deletions(-)
Changes applied before test
commit f41ce6901c8735cc85a48b1736b866668226c37f
Merge: cb4bf60 ca1b045
Author: Jenkins user <jenkins@localhost>
Date:   Tue Jan 18 10:01:46 2022 +0000

    Merge branch 'diff-target' into HEAD

commit ca1b045001694df9a0cbdbba790f960896c7fb7f
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit 76a33cb8f2bcc33129c8c8df160ae21d46767284
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit aba4e6e29600cc6b4e4a4c686ff4946b5ce9b077
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit 30b3c8427391edc0f88dce202dcb8abb07bf548b
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/248/ for more details.

Build is green

Patch application report for D6950 (id=25240)

Could not rebase; Attempt merge onto cb4bf60c0e...

Updating cb4bf60..024ea4c
Fast-forward
 swh/loader/svn/ra.py                | 300 ++++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 261 ++++++++++++-
 swh/loader/svn/utils.py             | 128 ++++++-
 5 files changed, 1370 insertions(+), 116 deletions(-)
Changes applied before test
commit 024ea4cbd964340c6c5e75b0295be9ca9de6b9f9
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit aea0e4e4216c2a8d43dbeed65208df232723cfd0
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit 13eb16e499e79b7a5914af4961cdc5afeea9eada
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit f1913512a5faa0c99d23607b9d63fc6003c729fb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/252/ for more details.

Build is green

Patch application report for D6950 (id=25243)

Could not rebase; Attempt merge onto cb4bf60c0e...

Updating cb4bf60..b35f08f
Fast-forward
 swh/loader/svn/ra.py                | 299 ++++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 261 ++++++++++++-
 swh/loader/svn/utils.py             | 128 ++++++-
 5 files changed, 1369 insertions(+), 116 deletions(-)
Changes applied before test
commit b35f08fe60c2ccc22c4d7c0648a371581786b7ec
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit cf199264266935238d7fb93c5d5f754edce7c589
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit 9dd1921a9935d118bd97645b2f6844b14db536db
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit f1913512a5faa0c99d23607b9d63fc6003c729fb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/255/ for more details.

Currently the subversion loader is the only one that needs that directories diff feature so I think we can keep the implementation as it is at the
moment but I will create a task to implement the directories diff features in swh.model.from_disk.

I have created T3858.

Build is green

Patch application report for D6950 (id=25251)

Could not rebase; Attempt merge onto cb4bf60c0e...

Updating cb4bf60..f6fbbb7
Fast-forward
 swh/loader/svn/ra.py                | 299 ++++++++++++++-
 swh/loader/svn/svn.py               |  52 ++-
 swh/loader/svn/tests/test_loader.py | 745 +++++++++++++++++++++++++++++++-----
 swh/loader/svn/tests/test_utils.py  | 261 ++++++++++++-
 swh/loader/svn/utils.py             | 128 ++++++-
 5 files changed, 1369 insertions(+), 116 deletions(-)
Changes applied before test
commit f6fbbb789715e3a82603ad9916ea04719e95d7ec
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 19:34:44 2022 +0100

    ra: Send modified objects only to storage after replaying a revision
    
    Previously all contents and directories of the reconstructed filesystem
    were sent to the storage after having replayed a svn revision.
    The filtering of the new contents and directories to write to the storage
    is then delegated to the storage filtering proxy.
    
    Proceeding like this has a huge performance impact on the loading of large
    subversion repositories as large sets of objects to archive are filtered
    again and again after each revision replay.
    
    That commit performs the objects filtering at the loader level instead of
    delegating that task to the storage filtering proxy.
    It is done by maintaining a set of added or modified paths for a given
    revision when replaying it. As we use the svn_ra API, that set of paths
    can be easily computed with confidence.
    
    This change provides a really significant speedup to the overall loading
    time of a subversion repository.
    
    Related to T3839

commit a820d7eab8d56c1b793b4144c6e2f16bf1d78ff1
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Jan 11 20:09:56 2022 +0100

    ra: Put externals in cache to avoid exporting them again
    
    Some subversion repositories can define same external on different paths.
    
    In order to avoid exporting it multiple times, which consumes network bandwith
    and slows down the loading, save the exported external in a temporary directory
    on the local filesystem and reuse that copy when the external is set on a path.
    
    Also ensure all the temporary directories created for externals will be deleted
    at the end of the loading process.
    
    Related to T611

commit 473fe145f4b7cd6cd1e1c0f9ec72cad04c38c4a8
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Dec 8 11:43:38 2021 +0100

    ra: Add support for subversion external definitions
    
    Subversion external definitions set on directories through the use of the
    svn:externals property are now handled by the loader.
    
    As with a svn export operation, externals will be attempted to be exported
    in the paths they are defined. If an external is no longer valid (404),
    the error will be ignored and the next one will be processed.
    
    The implementation takes care of keeping the reconstructed repository
    filesystem for a revision in sync with a svn export operation while
    externals are added, updated or removed across revisions replay.
    
    Related to T611

commit f1913512a5faa0c99d23607b9d63fc6003c729fb
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Dec 14 17:39:07 2021 +0100

    utils: Add a function to parse a subversion external definition
    
    Add a function to parse an external definition according to official
    specifications in order to extract or compute:
    
      - the relative path where the external should be exported
    
      - the URL of the external
    
      - the optional revision of the external to export
    
    Related to T611

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/259/ for more details.