Page MenuHomeSoftware Heritage

SVN loader: Add efficient loader based on remote dumps
ClosedPublic

Authored by anlambert on Sep 24 2018, 2:21 PM.

Details

Reviewers
ardumont
Group Reviewers
Reviewers
Summary

This diff adds a new loader class SvnLoaderFromRemoteDump enabling
to load svn repositories in a more efficient way. The loader is based
on the creation of dump files generated with the svnrdump tool. Using
dump files allow to greatly speedup the loading process compared
to a client/server based approach.

This is an updated version of D433 (automatically closed as I
pushed it in a separate branch to keep track of the work done
so far), discarding the incremental loading approach through the
use of partial dumps (as numerous tests with real world repositories
end up with errors).

Related T1161

Diff Detail

Repository
rDLDSVN Subversion (SVN) loader
Branch
svnrdump-loader
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1434
Build 1778: arc lint + arc unit

Event Timeline

Sounds good.

Having a test for it would be nice.

Not that i would know how to proceed though... Would file:///something/somewhere would be considered as a remote svn repository? I would think so but i'm not 100% certain of it.

swh/loader/svn/loader.py
695

Wondering if that could also be used in the loader that mounts a dump from an archive dump:

  • Retrieve last svn revision seen for it (if any)
  • Truncate the dump (both compressed/uncompressed now that it is opened here ;)
  • Then load

Update diff:

  • remove unneeded parameter to SvnLoaderFromRemoteDump constructor
  • add test
swh/loader/svn/loader.py
695

Clarifying.
Today, when mounting a compressed archive dump as a repository prior to loading it, we do not check in advance if we have data for it or not.

Whatever the reality of it, we mount it completely.
I was hoping to reduce the mounting to whatever is necessary.

If we were what you do there, checking for last svn revision seen, we could try and truncate the dump starting from that revision only.

But maybe i'm mixing information from the initial diff and/or that is not doable ;)

Cheers,

ardumont added inline comments.
swh/loader/svn/tests/test_loader.py
975 ↗(On Diff #1356)

\m/

This revision is now accepted and ready to land.Sep 25 2018, 12:07 PM
swh/loader/svn/loader.py
695

Unfortunately, truncating a dump file to a range [revN, revM] where N > 0 can lead to svnadmin unable to mount a repository from it.

This is what I got in https://forge.softwareheritage.org/T1161#22461 when trying to incrementally load the apache repo.

Besides that, there is no issue to truncate to range [rev0, revN] as the whole repository history can be replayed without errors.

Closed by commit rDLDSVN677225bbd57d (I fixed a last minute typo detected by codespell, so original commit id changed).

(I fixed a last minute typo detected by codespell, so original commit id changed).

ok ;)

swh/loader/svn/loader.py
695

Unfortunately, truncating a dump file to a range [revN, revM] where N > 0 can lead to svnadmin unable to mount a repository from it.
This is what I got in https://forge.softwareheritage.org/T1161#22461 when trying to incrementally load the apache repo.

Right. But that's not where i was aiming.

Besides that, there is no issue to truncate to range [rev0, revN] as the whole repository history can be replayed without errors.

That's what i wanted to say, sorry for not being so clear today ;)
That way, we can replay time and again the same local dump without actually actually always mounting the dump for nothing (since no new stuff exist).