Page MenuHomeSoftware Heritage

SVN loader: Create local dump of remote repository to speed up loading task
Closed, MigratedEdits Locked

Description

As the "save code now" feature will be soon deployed to production, it would be nice to offer the possibility to save subversion origins as they are numerous public repositories available out there.

Currently, there is two ways of loading a subversion repository into the archive:

  • through a dump file of the repository (generated with svnadmin)
  • directly through the repository url, using a svn client to request needed data to the associated server

While the latter is what we want to use to save arbitrary public svn repositories, there is a couple of issues with it:

  • it floods the subversion server with a lot of requests
  • it is quite slow (due to the client/server ping pong)

I think we should have a similar behavior for loading svn origins as the one used for git and mercurial: clone the whole repository locally, then use the dump to load it into the archive.
Fortunately, subversion has an utility command called svnrdump [1] that can generate a repository dump stream of revisions. By saving that dump stream to a file, we can then use it to load the repository into the archive.

My first experiments show a great speedup in svn loading tasks when using a local repository dump.
For instance, to load this svn repository with 434 revisions hosted on sourceforge: http://svn.code.sf.net/p/e-foto/code/,
it took 2986s using the client/server ping pong approach and 594s using the dump first then load approach, so a 5x speedup.

[1] http://svnbook.red-bean.com/en/1.7/svn.ref.svnrdump.c.dump.html

Related T336

Event Timeline

anlambert triaged this task as Wishlist priority.Jul 27 2018, 2:54 PM
anlambert created this task.
anlambert renamed this task from Subversion loader: Create a local dump of a remote repository to speed up loading task to SVN loader: Create local dump of remote repository to speed up loading task.Jul 27 2018, 4:46 PM

So I took some time to dig a little further on that idea of creating a dump file using the
svnrdump command from the official tools coming along with subversion.

As said in the task description, the simple approach of generating a dump for a remote
repository, mount it locally, and then load it into the archive works.
To summarize the commands issued before starting the loading process are:

$ mkdir -p <tmp_dir>
$ cd <tmp_dir>
$ svnrdump dump http://svn.code.sf.net/p/e-foto/code/ > e-foto.svndump
* Dumped revision 0.
* Dumped revision 1.
* Dumped revision 2.
* Dumped revision 3.
* Dumped revision 4.
* Dumped revision 5.
...
$ svnadmin create e-foto
$ cat e-foto.svndump | svnadmin load e-foto

The SVN loader then imports the repository contents by using file:///<tmp_dir>/e-photo
as the svn url while keeping http://svn.code.sf.net/p/e-foto/code/ as origin url in the archive.
The main benefit of proceeding that way is the loading performance as there is a nice
speedup compared to the basic approach of relying on the use of a svn client (as it needs to
send a lot of requests to the server).

Another question came to my mind when experimenting with svnrdump :
Could we use the tool to implement an efficient and incremental loader for subversion ?
I mean: would it be possible once an active svn origin has been loaded into the archive
and we visit it back to only dump and load the new revisions (thus discarding the download
of data we already have) ? The svnrdump command allows to specify a range of revisions
to dump so there might be a way. I introduce below what I tried so far on the subject.

For my tests, I will use some extracts of a googlecode svn repository named pyang-repo.
The idea is to simulate an active svn repository by incrementally adding new revisions to
it from generated incremental svn dumps. Let's generate those dump files first.

$ cd ~/tmp
$ scp anlambert@uffizi:/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/pyang/pyang-repo.svndump.gz .
$ svnadmin create pyang-repo
$ gzip -dc pyang-repo.svndump.gz | svnadmin load pyang-repo
$ svnrdump dump -r0:10 file:///tmp/pyang-repo > pyang-repo-r0-10.svndump
$ svnrdump dump -r11:20 --incremental file:///tmp/pyang-repo > pyang-repo-r11-20.svndump
$ svnrdump dump -r21:30 --incremental file:///tmp/pyang-repo > pyang-repo-r21-30.svndump

Next, we create a svn repository containing the first 10 revisions.

$ svnadmin create pyang-repo-test
$ cat pyang-repo-r0-10.svndump | svnadmin load pyang-repo-test

We then create a loading task to load that repository into my local archive.
I use a modified version of the loader that calls svnrdump to generate and mount the
repository dump prior to execute the loading process. As this is the first visit
of the repository, the load is successfull.

[2018-09-20 17:24:27,373: INFO/MainProcess] Received task: swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[e196a5c3-18ae-4ba7-beb5-4d89679a5913]
[2018-09-20 17:24:27,395: DEBUG/Worker-2] Executing svnrdump dump file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 17:24:28,058: DEBUG/Worker-2] PID 18980 is live, skipping
[2018-09-20 17:24:28,060: DEBUG/Worker-2] Creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 17:24:28,072: DEBUG/Worker-2] Done creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 17:24:28,073: DEBUG/Worker-2] Creating origin_visit for origin 3 at time 2018-09-20 15:24:28.073184+00:00
[2018-09-20 17:24:28,086: DEBUG/Worker-2] Done Creating origin_visit for origin 3 at time 2018-09-20 15:24:28.073184+00:00
[2018-09-20 17:24:28,113: INFO/Worker-2] Processing revisions [1-10] for {'local_url': b'/home/antoine/tmp/swh.loader.svn.geqbf0z1-18980/tmprg49g3go', 'swh-origin': 3, 'uuid': b'2ad08786-310e-4110-8bb1-5030b78c55b1', 'remote_url': 'file:///home/antoine/tmp/swh.loader.svn.5hb_9qu8-18980/tmprg49g3go'}
[2018-09-20 17:24:28,137: DEBUG/Worker-2] rev: 1, swhrev: 227ec29ee52db842e0554ccd5ae055bd50ee5f81, dir: 75ed58f260bfa4102d0e09657803511f5f0ab372
[2018-09-20 17:24:28,138: DEBUG/Worker-2] Checking hash computations on revision 1...
[2018-09-20 17:24:28,178: DEBUG/Worker-2] rev: 2, swhrev: 8ba8db4661940c44c61eb50671d0d5e790a693a9, dir: e393131425cbfb316701d83a4e0c8130c4bb91d1
[2018-09-20 17:24:28,179: DEBUG/Worker-2] Checking hash computations on revision 2...
[2018-09-20 17:24:28,210: DEBUG/Worker-2] rev: 3, swhrev: faf0579298fbd0d80caa36b37ea3349f4d62fda7, dir: f85772679d57e094c6cf2e55d872e0e91c1f22af
[2018-09-20 17:24:28,211: DEBUG/Worker-2] Checking hash computations on revision 3...
[2018-09-20 17:24:28,233: DEBUG/Worker-2] rev: 4, swhrev: de936f10d58297e4f6c700a67a9bc1c0126f41fe, dir: 20ce49b93ce92da5f4280ae9c0bd1a53ad7ca892
[2018-09-20 17:24:28,233: DEBUG/Worker-2] Checking hash computations on revision 4...
[2018-09-20 17:24:28,248: DEBUG/Worker-2] rev: 5, swhrev: 28dd784503f9f447ff296b41d54950dd1edf9610, dir: 04b96c66f27d63788b7b884ad3e3496d50946963
[2018-09-20 17:24:28,248: DEBUG/Worker-2] Checking hash computations on revision 5...
[2018-09-20 17:24:28,284: DEBUG/Worker-2] rev: 6, swhrev: 115cf96872d25e22c550cf8315f2e6628464bd00, dir: 1282ef91d4dd0217ed66b9b79b35544acdec40e6
[2018-09-20 17:24:28,285: DEBUG/Worker-2] Checking hash computations on revision 6...
[2018-09-20 17:24:28,326: DEBUG/Worker-2] rev: 7, swhrev: 9487fc27a8a561484fbc413d33e123549f0f6c22, dir: b24bb7392d3ae4cddfc38d44a2e564da9d22c87a
[2018-09-20 17:24:28,326: DEBUG/Worker-2] Checking hash computations on revision 7...
[2018-09-20 17:24:28,357: DEBUG/Worker-2] rev: 8, swhrev: 775ffdb79ea907ac7dcf86b3a7f3acad957cd14d, dir: 8f1a4abdb51379314dd7576d9dff278a3ca46e7f
[2018-09-20 17:24:28,357: DEBUG/Worker-2] Checking hash computations on revision 8...
[2018-09-20 17:24:28,389: DEBUG/Worker-2] rev: 9, swhrev: 840a47a47a618aa6146c3452e1db4a1997fef4f0, dir: 8572aa3c5922a9a3c70ddb7bfc7f8db8082bc4d5
[2018-09-20 17:24:28,389: DEBUG/Worker-2] Checking hash computations on revision 9...
[2018-09-20 17:24:28,426: DEBUG/Worker-2] rev: 10, swhrev: e7ab2783be5b60554b3a638cbaf96fa9ac74ea75, dir: da231a71a515ab96cce80e14b656ebcefc1f89cc
[2018-09-20 17:24:28,426: DEBUG/Worker-2] Checking hash computations on revision 10...
[2018-09-20 17:24:28,454: DEBUG/Worker-2] Sending 84 contents
[2018-09-20 17:24:28,734: DEBUG/Worker-2] Done sending 84 contents
[2018-09-20 17:24:28,735: DEBUG/Worker-2] Sending 47 directories
[2018-09-20 17:24:28,771: DEBUG/Worker-2] Done sending 47 directories
[2018-09-20 17:24:28,772: DEBUG/Worker-2] Sending 10 revisions
[2018-09-20 17:24:28,792: DEBUG/Worker-2] Done sending 10 revisions
[2018-09-20 17:24:28,792: DEBUG/Worker-2] Processed 10 revisions: [..., e7ab2783be5b60554b3a638cbaf96fa9ac74ea75]
[2018-09-20 17:24:28,792: DEBUG/Worker-2] snapshot: {'branches': {b'master': {'target': b"\xe7\xab'\x83\xbe[`UK:c\x8c\xba\xf9o\xa9\xact\xeau", 'target_type': 'revision'}}, 'id': b'\x8a\xa4\xb0\x94\x9c\x9c\xb9P\x13\xcf|\t!+f\t\xf0\xb4!j'}
[2018-09-20 17:24:28,808: DEBUG/Worker-2] Updating origin_visit for origin 3 with status full
[2018-09-20 17:24:28,813: DEBUG/Worker-2] Done updating origin_visit for origin 3 with status full
[2018-09-20 17:24:28,816: DEBUG/Worker-2] Clean up temporary directory dump /home/antoine/tmp/swh.loader.svn.5hb_9qu8-18980 for project tmprg49g3go
[2018-09-20 17:24:28,832: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[e196a5c3-18ae-4ba7-beb5-4d89679a5913] succeeded in 1.454350747000717s: {'status': 'eventful'}

Then we add ten more revisions to our test repository.

$ cat pyang-repo-r11-20.svndump | svnadmin load pyang-repo-test

Next, we create a new loading task that will visit the repository again. This time,
the -r<rev_start>:<rev_end> option of svnrdump will be used to only dump
the new revisions and the loader will try to ingest only those. The idea is to create
a partial dump starting at the last ingested revision. Last loaded svn revision was number 10
so the underlying executed command will be svnrdump dump -r10:head file:///home/antoine/tmp/pyang-repo-test.
Let's see the result of the loading task.

[2018-09-20 17:31:19,917: INFO/MainProcess] Received task: swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[b0451629-f18c-4a19-8ab6-dc80e97458d2]
[2018-09-20 17:31:19,948: DEBUG/Worker-1] Executing svnrdump dump -r10:head file:///home/antoine/tmp/pyang-repo-test
svnadmin: E160006: Révision de source relative -1 non disponible dans le dépôt courant
[2018-09-20 17:31:20,231: ERROR/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[b0451629-f18c-4a19-8ab6-dc80e97458d2] raised unexpected: ValueError('Failed to mount the svn dump for project tmpoej9czhs',)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/antoine/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 161, in run
    raise e from None
  File "/home/antoine/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 158, in run
    result = self.run_task(*args, **kwargs)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/tasks.py", line 81, in run_task
    loader = SWHSvnLoaderFromRemoteDump(svn_url)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 657, in __init__
    root_dir=self.temp_directory)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/utils.py", line 80, in init_svn_repo_from_dump
    raise e
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/utils.py", line 76, in init_svn_repo_from_dump
    project_name)
ValueError: Failed to mount the svn dump for project tmpoej9czhs

Hum, that does not look so good. Let's try to mount the partial dump ourself to get more details
about the error.

$ svnadmin create pyang-repo-r10-20
$ svnrdump dump -r10:20 file:///home/antoine/tmp/pyang-repo-test > pyang-repo-r10-20.svndump
$ cat pyang-repo-r10-20.svndump | svnadmin load pyang-repo-r10-20
... 
------- Nouvelle révision 1 propagée (commit), basée sur révision 10 >>>

<<< Début d'une nouvelle transaction basée sur la révision 11
     * édition du chemin : trunk/pyang/main.py ... fait.
     * édition du chemin : trunk/test/test_bad/expect/xt1.yang.out ... fait.
     * édition du chemin : trunk/test/test_bad/expect/xt2.yang.out ... fait.
     * édition du chemin : trunk/test/test_bad/update_expect.sh ...svnadmin: E160006: Révision de source relative -1 non disponible dans le dépôt courant

Let's track the issue in the dump file, it comes from this extract:

Node-path: trunk/test/test_bad/update_expect.sh
Node-kind: file
Node-action: add
Node-copyfrom-rev: 8
Node-copyfrom-path: trunk/test/test_bad/expect_update.sh
Text-delta: true
Text-delta-base-md5: 0554650cd871a4eadca9c1cd56d7b61e
Text-content-md5: 23b2b61694b69d1fa3eaf62d95645a7b
Text-content-length: 125
Content-length: 125

The problematic part here comes from the line Node-copyfrom-rev: 8. A file has been copied using
a svn copy command and subversion keeps a reference of the last revision that file was modified. Based
on my understanding, subversion allows to copy a file in the current source tree from any previous revisions
using svn copy <file>@<rev> so that's why that information ends up in the dump file.
The problem here is that we only wanted to load revision 10 to 20 into an empty repository. As the first
ten revisions are not in it, the load fails because there is no revision 8 available and so it is impossible to mount that partial dump.

So at this point, the possibility of incrementally load a subversion repository went away as this is an error
that will likely occurs pretty often.

Nevertheless, I continued to dig some info on the subject and then I found rsvndump (http://rsvndump.sourceforge.net/) !
What is that thing, let's read its description:

rsvndump is a command line tool that is able to dump a Subversion repository that resides on a remote server. All data is dumped in the format that can be read an written by svnadmin load/dump, so the data which is produced can easily be imported into a new Subversion repository.

rsvndump supports most of the functionality of the normal svn client program and svnadmin, e.g. authentication, SSL-support, incremental dumps and support for dumping text deltas instead of full text.

Starting with Subversion 1.7 in October 2011, the official distribution already contains a remote dumpfile tool called svnrdump. While both tools serve the same purpose, rsvndump is also able to dump a subdirectory of a repository by resolving copy operation as needed, even without read access to the repository root. Furthermore, rsvndump may work better with old Subversion servers (i.e. versions prior to 1.5).

Also that extract of the man page directly rings a bell to me:

rsvndump will generally replace a copy action by a simple add operation if both of the following conditions are true:

    The source of the copy is outside the directory tree which is being dumped

    The source of the copy is not included in the dump because the revision range has been limited using the --revision flag

So let's try again to load the revisions 11 to 20 by using rsvndump instead of svnrdump
in our custom loader. Below is the output of the loading task.

[2018-09-20 19:31:27,312: INFO/MainProcess] Received task: swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[570ae581-f5ae-433c-8013-4df2a227b764]
[2018-09-20 19:31:27,348: DEBUG/Worker-4] Executing rsvndump -r 10:HEAD --deltas file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:31:27,969: DEBUG/Worker-4] PID 31620 is live, skipping
[2018-09-20 19:31:27,970: DEBUG/Worker-4] Creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:31:27,977: DEBUG/Worker-4] Done creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:31:27,977: DEBUG/Worker-4] Creating origin_visit for origin 1 at time 2018-09-20 17:31:27.977670+00:00
[2018-09-20 19:31:27,985: DEBUG/Worker-4] Done Creating origin_visit for origin 1 at time 2018-09-20 17:31:27.977670+00:00
[2018-09-20 19:31:28,011: DEBUG/Worker-4] svn export --ignore-keywords file:///home/antoine/tmp/swh.loader.svn.8puo04xw-31620/tmpo09mpuwv@10
[2018-09-20 19:31:28,073: DEBUG/Worker-4] snapshot: {'id': b'\x8a\xa4\xb0\x94\x9c\x9c\xb9P\x13\xcf|\t!+f\t\xf0\xb4!j', 'branches': {b'master': {'target_type': 'revision', 'target': b"\xe7\xab'\x83\xbe[`UK:c\x8c\xba\xf9o\xa9\xact\xeau"}}}
[2018-09-20 19:31:28,079: ERROR/Worker-4] Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/antoine/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 890, in load
    self.store_data()
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 485, in store_data
    start_from_scratch=self.start_from_scratch)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 287, in process_repository
    raise SvnLoaderHistoryAltered(msg)
swh.loader.svn.exception.SvnLoaderHistoryAltered: History of svn file:///home/antoine/tmp/swh.loader.svn.8puo04xw-31620/tmpo09mpuwv@10 altered. Skipping...
[2018-09-20 19:31:28,085: DEBUG/Worker-4] Updating origin_visit for origin 1 with status partial
[2018-09-20 19:31:28,090: DEBUG/Worker-4] Done updating origin_visit for origin 1 with status partial
[2018-09-20 19:31:28,094: DEBUG/Worker-4] Clean up temporary directory dump /home/antoine/tmp/swh.loader.svn.8puo04xw-31620 for project tmpo09mpuwv
[2018-09-20 19:31:28,120: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[570ae581-f5ae-433c-8013-4df2a227b764] succeeded in 0.8060262239960139s: {'status': 'failed'}

That's a little bit better but it still fails. Good news is that the generated partial dump has been successfully mounted
by svnadmin load this time. Bad news is that the loading fails when trying to ingest the first revision in the dump.

After some investigations, the issue comes from the fact that when dumping a specific range of revisions with
rsvndump (or even svnrdump), revisions get renumbered in the range [0, N]. So when the loader tries to
ingest the revision number 10 (because it knows the last svn revision saved into the archive, thanks to the swh
revision metadata), he first exports the same revision from the dump in order to check that its checksum
is the same as the one in the archive. But due to the renumbered revisions, the revision number 10 in the dump
is in fact the revision number 19 from the original repository. So that's why the SvnLoaderHistoryAltered is raised.

If we could somehow keep the same revision numbers as in the original repository in the dump file,
that should do the trick. But wait, what is that rsvndump option ?

--keep-revnums

   Keep the revision numbers in the output in sync with the repository. This is done by inserting empty revisions for padding if necessary.

That looks awesome for our issue! Let's try again to load revisions 11 to 20 with it.

[2018-09-20 19:33:41,630: INFO/MainProcess] Received task: swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[4f898276-3542-402f-a6ee-f1ff226bcdf4]
[2018-09-20 19:33:41,679: DEBUG/Worker-2] Executing rsvndump -r 10:HEAD --keep-revnums --deltas file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:33:42,604: DEBUG/Worker-2] PID 1018 is live, skipping
[2018-09-20 19:33:42,605: DEBUG/Worker-2] Creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:33:42,613: DEBUG/Worker-2] Done creating svn origin for file:///home/antoine/tmp/pyang-repo-test
[2018-09-20 19:33:42,613: DEBUG/Worker-2] Creating origin_visit for origin 1 at time 2018-09-20 17:33:42.613542+00:00
[2018-09-20 19:33:42,622: DEBUG/Worker-2] Done Creating origin_visit for origin 1 at time 2018-09-20 17:33:42.613542+00:00
[2018-09-20 19:33:42,647: DEBUG/Worker-2] svn export --ignore-keywords file:///home/antoine/tmp/swh.loader.svn.w_7thaci-1018/tmpm5be3jh0@10
[2018-09-20 19:33:42,682: INFO/Worker-2] Processing revisions [11-20] for {'uuid': b'2ad08786-310e-4110-8bb1-5030b78c55b1', 'local_url': b'/home/antoine/tmp/swh.loader.svn.1byilb6q-1018/tmpm5be3jh0', 'remote_url': 'file:///home/antoine/tmp/swh.loader.svn.w_7thaci-1018/tmpm5be3jh0', 'swh-origin': 1}
[2018-09-20 19:33:42,703: DEBUG/Worker-2] rev: 11, swhrev: 2f5adbfc07df85e93547c190a290d718c1aa541e, dir: ddbd7c962f529afcbab632b79272eb80592994cc
[2018-09-20 19:33:42,703: DEBUG/Worker-2] Checking hash computations on revision 11...
[2018-09-20 19:33:42,743: DEBUG/Worker-2] rev: 12, swhrev: 9eaadb41fb62b1c5d2fde9e0ec696c61602c8097, dir: 0fc5d1e825be8699a701fbed265ba4a8daa7cfaf
[2018-09-20 19:33:42,743: DEBUG/Worker-2] Checking hash computations on revision 12...
[2018-09-20 19:33:42,778: DEBUG/Worker-2] rev: 13, swhrev: 8bc90449382e41270c888dc5f0600381eb38dc31, dir: a0efae7d0533574de1fdd33b6b718073dcaed643
[2018-09-20 19:33:42,778: DEBUG/Worker-2] Checking hash computations on revision 13...
[2018-09-20 19:33:42,814: DEBUG/Worker-2] rev: 14, swhrev: d8fa2e0c35afa202f753325b9290f867dca3bfcb, dir: 9fd979af6ccba32a6f51d96b80d3edff0d4c1b8e
[2018-09-20 19:33:42,815: DEBUG/Worker-2] Checking hash computations on revision 14...
[2018-09-20 19:33:42,852: DEBUG/Worker-2] rev: 15, swhrev: cb0a7c987e432ecf56fbc18fa2761692c8badffe, dir: 183f8ecad659fc699c28bc81fbb7abf3a72f89de
[2018-09-20 19:33:42,852: DEBUG/Worker-2] Checking hash computations on revision 15...
[2018-09-20 19:33:42,915: DEBUG/Worker-2] rev: 16, swhrev: a1b3cf2aebc70ebccc6393b1ab7caa5034a0a138, dir: 98f79f2df6d3dcce9cfa547eb2bc57d77fbd718b
[2018-09-20 19:33:42,915: DEBUG/Worker-2] Checking hash computations on revision 16...
[2018-09-20 19:33:42,973: DEBUG/Worker-2] rev: 17, swhrev: 0d387d54db879d6dd1c9b23ebd8dca6c4790ab0b, dir: 7084f428baaf22c5334bd768b8d7902d7fd2544d
[2018-09-20 19:33:42,973: DEBUG/Worker-2] Checking hash computations on revision 17...
[2018-09-20 19:33:43,035: DEBUG/Worker-2] rev: 18, swhrev: b5a79d613e688928cc1c692be65941c838c15d08, dir: 39b642d852585bd042446877200be5a12b31abfc
[2018-09-20 19:33:43,035: DEBUG/Worker-2] Checking hash computations on revision 18...
[2018-09-20 19:33:43,095: DEBUG/Worker-2] rev: 19, swhrev: 4b7d6f572127fe226c2a9648912aa2922b37d39a, dir: 6c03d6bb437df7cab15f5fefc6222e12ad6f2255
[2018-09-20 19:33:43,095: DEBUG/Worker-2] Checking hash computations on revision 19...
[2018-09-20 19:33:43,155: DEBUG/Worker-2] rev: 20, swhrev: 44d3cb050d23779b5de6b979c3d83fc2ace20123, dir: 4c2c91a2de93aa8300d2d05e59305e0c5b2c7705
[2018-09-20 19:33:43,155: DEBUG/Worker-2] Checking hash computations on revision 20...
[2018-09-20 19:33:43,209: DEBUG/Worker-2] Sending 26 contents
[2018-09-20 19:33:43,340: DEBUG/Worker-2] Done sending 26 contents
[2018-09-20 19:33:43,341: DEBUG/Worker-2] Sending 40 directories
[2018-09-20 19:33:43,368: DEBUG/Worker-2] Done sending 40 directories
[2018-09-20 19:33:43,369: DEBUG/Worker-2] Sending 10 revisions
[2018-09-20 19:33:43,389: DEBUG/Worker-2] Done sending 10 revisions
[2018-09-20 19:33:43,389: DEBUG/Worker-2] Processed 10 revisions: [..., 44d3cb050d23779b5de6b979c3d83fc2ace20123]
[2018-09-20 19:33:43,389: DEBUG/Worker-2] snapshot: {'branches': {b'master': {'target_type': 'revision', 'target': b'D\xd3\xcb\x05\r#w\x9b]\xe6\xb9y\xc3\xd8?\xc2\xac\xe2\x01#'}}, 'id': b'\xcd\xfek\xae\xf2\x97\x8d(\xcegF\xc47\x1a \x98"\x9dr\x8d'}
[2018-09-20 19:33:43,407: DEBUG/Worker-2] Updating origin_visit for origin 1 with status full
[2018-09-20 19:33:43,411: DEBUG/Worker-2] Done updating origin_visit for origin 1 with status full
[2018-09-20 19:33:43,416: DEBUG/Worker-2] Clean up temporary directory dump /home/antoine/tmp/swh.loader.svn.w_7thaci-1018 for project tmpm5be3jh0
[2018-09-20 19:33:43,435: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepositoryTsk[4f898276-3542-402f-a6ee-f1ff226bcdf4] succeeded in 1.7993088069997611s: {'status': 'eventful'}

It worked! Incremental subversion loading through partial dumps is feasible!
The resulting code can be found in D433.

Using rsvndump instead of svnrdump seems the best solution. The only drawback is that it is
not packaged in Debian. But creating it from the source tarball is really easy so that's not a blocker.

\m/
Thanks for the thorough description!
It's awesome.

I'll take a look @D432 tomorrow.

Cheers,

Hello,

thinking more about this. I see something missing in the description.

[2018-09-20 19:33:43,389: DEBUG/Worker-2] snapshot: {'branches': {b'master': {'target_type': 'revision', 'target': b'D\xd3\xcb\x05\r#w\x9b]\xe6\xb9y\xc3\xd8?\xc2\xac\xe2\x01#'}}, 'id': b'\xcd\xfek\xae\xf2\x97\x8d(\xcegF\xc47\x1a \x98"\x9dr\x8d'}

Have you checked that the last part results in the same snapshot as the actual svn loader?
That is do the full loading with the actual svn loader (up to the 20 revisions), take the snapshot and compare it with this quoted one.


Using rsvndump instead of svnrdump seems the best solution. The only drawback is that it is not packaged in Debian.

I found a debian wontfix bug report for an old version (0.5.2) [1].
Well, it was a long time ago (2009, prior to svnrdump being included in svn a priori according to the bug report ;).
That's not a blocker though, just mentioning what i found.

I also found a possible github mirror repository (holding a debian folder... ~> debian package files).
The sourceforge one does not hold any.
That might be useful to integrate it (i did not check further than opening a tarball downloaded from them).

Heading towards the diff now!

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=539374

[2] https://github.com/jgehring/rsvndump

Cheers,

Have you checked that the last part results in the same snapshot as the actual svn loader?
That is do the full loading with the actual svn loader (up to the 20 revisions), take the snapshot and compare it with this quoted one.

I just executed a one pass client based loading task with D432 applied. Below is the result, same hashes everywhere so we are good!

[2018-09-21 11:50:20,018: INFO/MainProcess] Received task: swh.loader.svn.tasks.LoadSvnRepository[2d386d79-2d5c-4068-be86-3bc1a4080a7c]
[2018-09-21 11:50:20,026: DEBUG/Worker-2] Creating svn origin for file:///home/antoine/tmp/pyang-repo-test-cp
[2018-09-21 11:50:20,041: DEBUG/Worker-2] Done creating svn origin for file:///home/antoine/tmp/pyang-repo-test-cp
[2018-09-21 11:50:20,041: DEBUG/Worker-2] Creating origin_visit for origin 2 at time 2018-09-21 09:50:20.041854+00:00
[2018-09-21 11:50:20,048: DEBUG/Worker-2] Done Creating origin_visit for origin 2 at time 2018-09-21 09:50:20.041854+00:00
[2018-09-21 11:50:20,064: INFO/Worker-2] Processing revisions [1-20] for {'uuid': b'2ad08786-310e-4110-8bb1-5030b78c55b1', 'swh-origin': 2, 'local_url': b'/home/antoine/tmp/swh.loader.svn.rv_h3j93-2690/pyang-repo-test-cp', 'remote_url': 'file:///home/antoine/tmp/pyang-repo-test-cp'}
[2018-09-21 11:50:20,067: DEBUG/Worker-2] rev: 1, swhrev: 227ec29ee52db842e0554ccd5ae055bd50ee5f81, dir: 75ed58f260bfa4102d0e09657803511f5f0ab372
[2018-09-21 11:50:20,067: DEBUG/Worker-2] Checking hash computations on revision 1...
[2018-09-21 11:50:20,090: DEBUG/Worker-2] rev: 2, swhrev: 8ba8db4661940c44c61eb50671d0d5e790a693a9, dir: e393131425cbfb316701d83a4e0c8130c4bb91d1
[2018-09-21 11:50:20,090: DEBUG/Worker-2] Checking hash computations on revision 2...
[2018-09-21 11:50:20,112: DEBUG/Worker-2] rev: 3, swhrev: faf0579298fbd0d80caa36b37ea3349f4d62fda7, dir: f85772679d57e094c6cf2e55d872e0e91c1f22af
[2018-09-21 11:50:20,113: DEBUG/Worker-2] Checking hash computations on revision 3...
[2018-09-21 11:50:20,140: DEBUG/Worker-2] rev: 4, swhrev: de936f10d58297e4f6c700a67a9bc1c0126f41fe, dir: 20ce49b93ce92da5f4280ae9c0bd1a53ad7ca892
[2018-09-21 11:50:20,140: DEBUG/Worker-2] Checking hash computations on revision 4...
[2018-09-21 11:50:20,164: DEBUG/Worker-2] rev: 5, swhrev: 28dd784503f9f447ff296b41d54950dd1edf9610, dir: 04b96c66f27d63788b7b884ad3e3496d50946963
[2018-09-21 11:50:20,165: DEBUG/Worker-2] Checking hash computations on revision 5...
[2018-09-21 11:50:20,216: DEBUG/Worker-2] rev: 6, swhrev: 115cf96872d25e22c550cf8315f2e6628464bd00, dir: 1282ef91d4dd0217ed66b9b79b35544acdec40e6
[2018-09-21 11:50:20,217: DEBUG/Worker-2] Checking hash computations on revision 6...
[2018-09-21 11:50:20,269: DEBUG/Worker-2] rev: 7, swhrev: 9487fc27a8a561484fbc413d33e123549f0f6c22, dir: b24bb7392d3ae4cddfc38d44a2e564da9d22c87a
[2018-09-21 11:50:20,269: DEBUG/Worker-2] Checking hash computations on revision 7...
[2018-09-21 11:50:20,308: DEBUG/Worker-2] rev: 8, swhrev: 775ffdb79ea907ac7dcf86b3a7f3acad957cd14d, dir: 8f1a4abdb51379314dd7576d9dff278a3ca46e7f
[2018-09-21 11:50:20,308: DEBUG/Worker-2] Checking hash computations on revision 8...
[2018-09-21 11:50:20,342: DEBUG/Worker-2] rev: 9, swhrev: 840a47a47a618aa6146c3452e1db4a1997fef4f0, dir: 8572aa3c5922a9a3c70ddb7bfc7f8db8082bc4d5
[2018-09-21 11:50:20,342: DEBUG/Worker-2] Checking hash computations on revision 9...
[2018-09-21 11:50:20,380: DEBUG/Worker-2] rev: 10, swhrev: e7ab2783be5b60554b3a638cbaf96fa9ac74ea75, dir: da231a71a515ab96cce80e14b656ebcefc1f89cc
[2018-09-21 11:50:20,380: DEBUG/Worker-2] Checking hash computations on revision 10...
[2018-09-21 11:50:20,421: DEBUG/Worker-2] rev: 11, swhrev: 2f5adbfc07df85e93547c190a290d718c1aa541e, dir: ddbd7c962f529afcbab632b79272eb80592994cc
[2018-09-21 11:50:20,422: DEBUG/Worker-2] Checking hash computations on revision 11...
[2018-09-21 11:50:20,458: DEBUG/Worker-2] rev: 12, swhrev: 9eaadb41fb62b1c5d2fde9e0ec696c61602c8097, dir: 0fc5d1e825be8699a701fbed265ba4a8daa7cfaf
[2018-09-21 11:50:20,458: DEBUG/Worker-2] Checking hash computations on revision 12...
[2018-09-21 11:50:20,492: DEBUG/Worker-2] rev: 13, swhrev: 8bc90449382e41270c888dc5f0600381eb38dc31, dir: a0efae7d0533574de1fdd33b6b718073dcaed643
[2018-09-21 11:50:20,493: DEBUG/Worker-2] Checking hash computations on revision 13...
[2018-09-21 11:50:20,528: DEBUG/Worker-2] rev: 14, swhrev: d8fa2e0c35afa202f753325b9290f867dca3bfcb, dir: 9fd979af6ccba32a6f51d96b80d3edff0d4c1b8e
[2018-09-21 11:50:20,528: DEBUG/Worker-2] Checking hash computations on revision 14...
[2018-09-21 11:50:20,568: DEBUG/Worker-2] rev: 15, swhrev: cb0a7c987e432ecf56fbc18fa2761692c8badffe, dir: 183f8ecad659fc699c28bc81fbb7abf3a72f89de
[2018-09-21 11:50:20,568: DEBUG/Worker-2] Checking hash computations on revision 15...
[2018-09-21 11:50:20,646: DEBUG/Worker-2] rev: 16, swhrev: a1b3cf2aebc70ebccc6393b1ab7caa5034a0a138, dir: 98f79f2df6d3dcce9cfa547eb2bc57d77fbd718b
[2018-09-21 11:50:20,647: DEBUG/Worker-2] Checking hash computations on revision 16...
[2018-09-21 11:50:20,708: DEBUG/Worker-2] rev: 17, swhrev: 0d387d54db879d6dd1c9b23ebd8dca6c4790ab0b, dir: 7084f428baaf22c5334bd768b8d7902d7fd2544d
[2018-09-21 11:50:20,709: DEBUG/Worker-2] Checking hash computations on revision 17...
[2018-09-21 11:50:20,770: DEBUG/Worker-2] rev: 18, swhrev: b5a79d613e688928cc1c692be65941c838c15d08, dir: 39b642d852585bd042446877200be5a12b31abfc
[2018-09-21 11:50:20,770: DEBUG/Worker-2] Checking hash computations on revision 18...
[2018-09-21 11:50:20,834: DEBUG/Worker-2] rev: 19, swhrev: 4b7d6f572127fe226c2a9648912aa2922b37d39a, dir: 6c03d6bb437df7cab15f5fefc6222e12ad6f2255
[2018-09-21 11:50:20,834: DEBUG/Worker-2] Checking hash computations on revision 19...
[2018-09-21 11:50:20,930: DEBUG/Worker-2] rev: 20, swhrev: 44d3cb050d23779b5de6b979c3d83fc2ace20123, dir: 4c2c91a2de93aa8300d2d05e59305e0c5b2c7705
[2018-09-21 11:50:20,930: DEBUG/Worker-2] Checking hash computations on revision 20...
[2018-09-21 11:50:21,022: DEBUG/Worker-2] snapshot: {'id': b'\xcd\xfek\xae\xf2\x97\x8d(\xcegF\xc47\x1a \x98"\x9dr\x8d', 'branches': {b'master': {'target': b'D\xd3\xcb\x05\r#w\x9b]\xe6\xb9y\xc3\xd8?\xc2\xac\xe2\x01#', 'target_type': 'revision'}}}
[2018-09-21 11:50:21,034: DEBUG/Worker-2] Updating origin_visit for origin 2 with status full
[2018-09-21 11:50:21,041: DEBUG/Worker-2] Done updating origin_visit for origin 2 with status full
[2018-09-21 11:50:21,062: INFO/MainProcess] Task swh.loader.svn.tasks.LoadSvnRepository[2d386d79-2d5c-4068-be86-3bc1a4080a7c] succeeded in 1.042523018004431s: {'status': 'eventful'}
anlambert raised the priority of this task from Wishlist to Normal.Sep 21 2018, 1:56 PM

Hum, it seems there exist some subtle corner cases where incremental loading will fail ...
This is what I got for instance, when playing with the Apache Subversion repository by
loading it incrementally (killing rsvndump randomly in order to load what we dumped so far).

[2018-09-21 17:54:28,618: INFO/MainProcess] Received task: swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[d8a609f8-590a-48c4-a9b3-586f6ae0161a]
[2018-09-21 17:54:28,627: DEBUG/Worker-3] Creating svn origin for https://svn.apache.org/repos/asf/
[2018-09-21 17:54:28,638: DEBUG/Worker-3] Done creating svn origin for https://svn.apache.org/repos/asf/
[2018-09-21 17:54:28,638: DEBUG/Worker-3] Creating origin_visit for origin 2 at time 2018-09-21 15:54:28.638882+00:00
[2018-09-21 17:54:28,645: DEBUG/Worker-3] Done Creating origin_visit for origin 2 at time 2018-09-21 15:54:28.638882+00:00
[2018-09-21 17:54:28,670: DEBUG/Worker-3] Executing rsvndump -r 1292:HEAD --keep-revnums https://svn.apache.org/repos/asf/
[2018-09-21 17:55:37,020: DEBUG/Worker-3] rsvndump did not dump all expected revisions but revisions range 1293:1353 are available in the generated dump file and will be l
[2018-09-21 17:55:37,021: DEBUG/Worker-3] Truncating dump file after the last successfully dumped revision (1353) to avoid the loading of corrupted data
svnadmin: E160013: Fichier non trouvé : transaction '1291-zv', chemin '/incubator/directory/ldap/trunk/ldapd-installer'
[2018-09-21 17:56:24,996: ERROR/Worker-3] Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/antoine/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 886, in load
    self.prepare(*args, **kwargs)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 706, in prepare
    root_dir=self.temp_dir)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/utils.py", line 80, in init_svn_repo_from_dump
    raise e
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/utils.py", line 76, in init_svn_repo_from_dump
    project_name)
ValueError: Failed to mount the svn dump for project tmpumc63hfo
[2018-09-21 17:56:25,004: DEBUG/Worker-3] Updating origin_visit for origin 2 with status partial
[2018-09-21 17:56:25,010: DEBUG/Worker-3] Done updating origin_visit for origin 2 with status partial
[2018-09-21 17:56:25,035: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[d8a609f8-590a-48c4-a9b3-586f6ae0161a] succeeded in 116.41378015601367s:

The issue comes from the fact that the first dumped revision (1292) is the following: https://svn.apache.org/viewvc?view=revision&revision=1292
A new directory is added from a copy of another one in the repository and then the copied directory is deleted.

Looking at the dump file, the new files are added, the delete command is issued but as the copied directory did not end up
in the reconstructed file system, svnadmin load fails.

This looks like a bug of rsvndump to me. A quick search in its source code (https://github.com/jgehring/rsvndump/blob/master/src/delta.c#L1321-L1403) shows me that in some cases some delete commands are simply ignored and did not end up in the dump file.
The case I stumbled across is not handled but I think it should.

To be sure, I quickly patched the rsvndump source code and the issue went away, so my assumption seems right.

✔ ~/tmp/rsvndump [master|✚ 1…3] 
19:59 $ git diff
diff --git a/src/delta.c b/src/delta.c
index 2b3ef43..28ddc64 100644
--- a/src/delta.c
+++ b/src/delta.c
@@ -1382,6 +1382,14 @@ static svn_error_t *de_close_edit(void *edit_baton, apr_pool_t *pool)
                                parent = svn_path_dirname(parent, pool);
                        }
 
+                       /*
+                        * Ensure to no dump the deleted node if the pointed path does not
+                        * exist in the repository
+                        */
+                       if (!path_repo_exists(de_baton->path_repo, path, de_baton->local_revnum, pool)) {
+                               skip = 1;
+                       }
+
                        if (!skip) {
                                de_node_baton_t *node = delta_create_node_no_parent(path, de_baton, de_baton->revision_pool);
                                node->kind = svn_node_none; /* Does not matter for deleted nodes */

Anyway, more real tests are needed to analyze the robustness of the incremental approach.

Hum, it seems there exist some subtle corner cases where incremental loading will fail ...
...
To be sure, I quickly patched the rsvndump source code and the issue went away, so my assumption seems right.
....

Nice catch.

At worst, at first, that patch can become a part of the debian package (waiting for this to be eventually patched upstream, determining where upstream is first ;).

Anyway, more real tests are needed to analyze the robustness of the incremental approach.

Quite.

Cheers,

FWIW, this is my main worry about this approach.

rsvndump has seen it's last release in 2012 and commit in 2013. It has
been refused for packaging in Debian due to its overlap with the
(almost, but not quite) equivalent functionality in SVN upstream.

I'm not particularly confident rsvndump is complete enough for our
needs, and it is quite likely we will be on our own if fixes are needed
or additional features to handle corner cases.

The current upstream seems to be here:
https://github.com/jgehring/rsvndump

Contacting him to check about the state of the project before going
further would be a good idea.

FWIW, this is my main worry about this approach.

Yes, that's not something i like to rely upon either.
I'm trying to be the same optimistic me as ever.

rsvndump has seen it's last release in 2012 and commit in 2013.

Yes, i generally agree.

But reflecting on this, that reasoning (short of maintenance) is not something we can solely rely upon.
There must be a trade-off somewhere.

As swh's goal is to ingest everything, including old technologies, we will fall onto this case often (if not always).
In the end, everyone moves on to better (supposedly) technology (latest mainstream one being git).

The ratio speed that anlambert reports seems very interesting!
Recall that i iterated over at least 3 implementations to have the current one ok (speed included).
We cannot make it faster now (from hash computation point of view).

The only play here is what anlambert tries, reduce the network latency (client-server ping-pong) by dumping first.

I'm not particularly confident rsvndump is complete enough for our
needs, and it is quite likely we will be on our own if fixes are needed
or additional features to handle corner cases.

Well, yes.
Indeed, that's the risk.

Also, note that @anlambert might have been a tad violent in his testing (not that i would have known how to test better what he wanted to ;).
Quoting him killing rsvndump randomly in order to load what we dumped so far.
I don't know if that particular use case is something we will really encounter (because i don't know how transactionnal all the tool we use are, svnadmin, rsvndump).

Like he said, we need more runtime tests.

That's also my understanding that the case that fails is over an optimization one (which could become optional if we so choose).
His diff adds 2 things:

  1. using rsvndump to actually try to read the svn remote repository as a dump first
  2. if 1. partially succeeds, it splits the dump up to where it succeeded, and tries to load that.

And i think here, that's the 2. that fails.
So at worse, we could remove that part first (instead of dropping all altogether ;)

The current upstream seems to be here: https://github.com/jgehring/rsvndump

Yes, it seems so but i did not check further than that.

Contacting him to check about the state of the project before going
further would be a good idea.

Undeniably right you are :)

Cheers,

Based on my last tests, I was too confident that svnadmin will be able to load a dump containing an arbitrary revision range
(either generated by svnrdump and rsvndump). So let's put that incremental dump idea in hold for the moment as it needs
more investigation on the subject.

Anyway, some parts of D433 can still be landed in the subversion loader.
So I will do the following:

  • Remove the incremental dump approach, creating a full dump at each visit instead
  • Use svnrdump instead of rsvndump as the latter was only needed for the incremental approach
  • Backup the work done so far with rsvndump in a dedicated branch