Page MenuHomeSoftware Heritage

Test - Ingest XXL svn repository
Closed, MigratedEdits Locked

Description

Some svn repositories are just huge in terms of svn revisions.
For example, http://svn.apache.org/repos/asf contains around 1.75M revisions.

Even svn's own tools break on such cases (svnsync must be iteratively called to continue).

Make sure our loader svn does succeed in injecting such example.

Event Timeline

Tryout information:

  • repo: asf via http connection (as mentioned in description)
  • machine: worker01 with a local remote storage plugged to softwareheritage-test-svn db.

1st tryout: using svn connection over http.

worker did ~40k revisions.
A bad http connection occurred, preventing the job to finish.
Error message was: Unexpected HTTP status 400 'Bad Request' on '/repos/asf'\n

Jun 25 09:50:00 worker01 python3[21680]: [2016-06-25 09:50:00,966: DEBUG/Worker-10] rev: 39991, swhrev: f2bbc8e3268f25697ed0a3a350924b882edf61fe, dir: ed42eef1930e46be16b7306c1c5fa426d2234f0d
Jun 25 09:50:52 worker01 python3[21680]: [2016-06-25 09:50:52,395: ERROR/MainProcess] Task swh.loader.svn.tasks.LoadSvnRepositoryTsk[62db3115-41b5-486e-b1e1-dbaa73c4aefe] raised unexpected: SubversionException("Unexpected HTTP status 400 '
Bad Request' on '/repos/asf'\n", 175002)

As a consequence, the code was adapted with a retry policy (basic retry of 3) on the sensible part that broke.
This was packaged and redeployed on worker01.
The same job has been triggered.
Since it did not finish the first time (no new occurrence were created), worker restarted back from scratch (except for the origin which is the same).

What i mean by restart is:

  • will start from revision 1 and hash each svn revision tree up to the svn repository's HEAD revision
  • send missing data for storage

In effect, the first 40k revisions (and their contents/directories) won't be sent back since they already are stored.
Still, we'll lose the 40k hashing time.

2nd tryout

This time, the worker did more but still failed around ~94k revisions.
Error message: ConnectionResetError: [Errno 104] Error running context: Connection reset by peer

Jun 28 02:41:17 worker01 python3[2087]: [2016-06-28 02:41:17,433: DEBUG/Worker-10] rev: 93918, swhrev: af31126f9821cb96261771fb0009bfe3b9827360, dir: 8b056fadd6be62827758445264853aaf8cbc0d26
Jun 28 02:41:29 worker01 python3[2087]: [2016-06-28 02:41:29,870: DEBUG/Worker-10] Sending 370 contents
Jun 28 02:41:45 worker01 python3[2087]: [2016-06-28 02:41:45,123: DEBUG/Worker-10] Done sending 370 contents
Jun 28 02:41:45 worker01 python3[2087]: [2016-06-28 02:41:45,140: DEBUG/Worker-10] Sending 536 directories
Jun 28 02:41:45 worker01 python3[2087]: [2016-06-28 02:41:45,815: DEBUG/Worker-10] Done sending 536 directories
Jun 28 02:42:43 worker01 python3[2087]: [2016-06-28 02:42:43,955: ERROR/MainProcess] Task swh.loader.svn.tasks.LoadSvnRepositoryTsk[b53e86c7-79b2-4e3c-894f-54919af901f3] raised unexpected: ConnectionResetError(104, 'Error running context:
Connection reset by peer')

At the moment, svn update is possible when finishing one pass.
That is, when done, it creates an occurrence which targets a revision.
When triggering a known svn repository, we retrieve that targeted revision which holds the associated svn revision.
And we start from that one (if the history is not altered that is).

So, for very huge repository, this tactic won't be sufficient since we'll still need to hash again even though the data is already in storage.

Possible improvment implementations:

  • Adding a cache on already seen revision for that origin (unrelated from occurrence)
  • Improve the loader to reschedule the task with the last seen revision for that origin
ardumont renamed this task from Test - Ingest huge repository to Test - Ingest XXL svn repository.Jun 28 2016, 11:52 AM
ardumont updated the task description. (Show Details)

Update on this are on the https://sympa.inria.fr/sympa/arc/swh-devel/2016-06/msg00011.html

This is the 3rd solution Implementation which is tested and deployed on current worker01.

The repositories impacted are:

  • swh-scheduler (in a branch, tagged, debian packaged, uploaded on pergamon)
  • swh-loader-svn (in a branch, tagged, debian packaged, uploaded in pergamon)

The same debian packaging as usual has been used. The only difference is that the git tag are on the respective branches.

Infra detail:

  • swh-scheduler and swh-loader-svn deployed
  • db: softwareheritage-test-svn for the storage
  • db : ardumont-swh-scheduler for the scheduling part.

As explained in the email, a producer produce the asf's svn url repository to load:

  • this will result in a task to load that repository
  • If failure along the way, the task will reschedule one shot task with the necessary information (last known swh-revision, svn revision). The task is then stopped and considered done.
  • (for now) a cron is triggered regularly which triggers a script in charge of loading one shot tasks from the scheduler.

Cheers,

Even svn's own tools break on such cases (svnsync must be iteratively called to continue).

A recent email discussion with Greg Stein, former member of the googlecode team, revealed that some
defenses exists in the server side of the asf repositories.

This explains why i had issues when trying to mirror the svn repositories for tests purposes.

And good news, incremental dumps are available.
Those will help on lifting the veil on whether or not T570 is fixed or not.

And also to ingest and possibly stay up to date with that repository further on.

Those dumps are currently being retrieved in uffizi:/srv/storage/space/mirrors/asf.

Well, trying to mount the repository on the side seems to be a task on its own:

$ svnadmin create asf-mirror
$ 7z x -so svn-asf-public-r0:1164363.7z | svnadmin load ./asf-mirror
...
------- Committed revision 923 >>>

<<< Started new transaction, based on original revision 924
     * editing path : incubator/directory/ldap/trunk/sandbox0/.cvsignore ... done.

------- Committed revision 924 >>>

<<< Started new transaction, based on original revision 925
     * editing path : incubator/directory/ldap/trunk/sandbox0/newbackend/src/java/ldapd/server/jndi/InterceptorPipeline.java ... done.
svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading.
svnadmin: E125005: Cannot accept non-LF line endings in 'svn:log' property

Opening T923.

@anlambert what's the status of ingesting very large SVN repos, now that we have put the loader in production?

Related to T611 (asf repo is a monorepo and it's containing svn:externals properties)
Related to T3839 (we are dealing with large repositories here)

@anlambert ^ with your current awesome work, that may actually converge at some point ;)