Page MenuHomeSoftware Heritage

Mount the asf svn repository mirror
Closed, MigratedEdits Locked

Description

As per title, mount that svn repository.

Dumps have been retrieved from http://svn-dump.apache.org/ and integrity checked (OK).
Stored at uffizi:/srv/storage/space/mirrors/asf

Basic tryout already failed.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

Repeating my initial comment.

$ svnadmin create asf-mirror
$ 7z x -so svn-asf-public-r0:1164363.7z | svnadmin load ./asf-mirror
...
------- Committed revision 923 >>>

<<< Started new transaction, based on original revision 924
     * editing path : incubator/directory/ldap/trunk/sandbox0/.cvsignore ... done.

------- Committed revision 924 >>>

<<< Started new transaction, based on original revision 925
     * editing path : incubator/directory/ldap/trunk/sandbox0/newbackend/src/java/ldapd/server/jndi/InterceptorPipeline.java ... done.
svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading.
svnadmin: E125005: Cannot accept non-LF line endings in 'svn:log' property

Repairing the source (2nd suggestion) seems a no go as this would touch the initial log message (at least for that revision).
So that would be enough to mess up the revision hash history.

The 1st suggestion is currently tested and so far so good (more than 700k revision has been done so far).

ardumont changed the task status from Open to Work in Progress.Jan 12 2018, 2:33 PM

The 1st suggestion is currently tested and so far so good (more than 700k revision has been done so far).

First dump is done without further issue.
Next, trying out the next dumps...

Update on this. As the tested dumps so far were going well.
I have automated the remaining dumps to mount.
It's currently running.

Note:
I have a point in time backup just in case that fails.

During our latest exchange with our asf contact (Greg Stein), i ask about history modification and here is his answer:

The ASF *never* rewrites svn revision history. Given the size of our repository, it would be prohibitive, even if we philosophically thought it was proper (and we don't! static!)

That's good.

That said, we *do* allow log messages to be edited. Records of such changes are only on mailing lists. We have no structured history for this.

That's not a good news.
From our view point, they do modify their history. The log message is used for the revision hash computation.
So we will have altered history hiccup along loading incrementally the asf mirror.

And that also means, we will have that possiblity for any other live repositories.

I don't see anything new here. Subversion offers no integrity guarantees, it applies to the ASF repos like it applies to any other SVN repo out there. We need to decide a policy about when (if at all), re-do full ingestions of Subversion repos (which will allow to re-inject modified objects at the cost of forking the resulting history on Software Heritage) or just say *shrug* and never re-ingest in a non-incremental way any Subversion repo we have previously ingested.

I don't see anything new here.

I had it in mind but it hit me way more when i read it.
Also, explicit is better than implicit

Subversion offers no integrity guarantees,

Yes. That's somewhat bad.

it applies to the ASF repos like it applies to any other SVN repo out there.

Sure.

We need to decide a policy about when (if at all), re-do full ingestions of Subversion repos (which will allow to re-inject modified objects at the cost of forking the resulting history on Software Heritage)

Please, let's decide then...

I recently added a start_from_scratch flag (to permit rescheduling missing objects in the googlecode dumps). So, this can be leveraged.

or just say *shrug* and never re-ingest in a non-incremental way any Subversion repo we have previously ingested.

That sounds rough and off regarding the global swh goal i came to understand.

I prefer option 1.

We could also mix 1. and 2. depending on the repository's size (in terms of svn revisions).

Also, there may be a third option, as a trade-off, (for very large repositories), don't use the swh revision hash as a way forward but the svn revision number (within the swh revision, we have the svn revision).

Status on this, everything were fine up until today.

All went to smooth up until 1722480.

After fails. Will dig in later.

------- Committed revision 1722480 >>>

<<< Started new transaction, based on original revision 1727879
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.10.1.1.html ... done.
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.10.2.0.html ... done.
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.8.1.2.html ... done.
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.8.2.2.html ... done.
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.8.3.0.html ... done.
     * editing path : db/derby/site/trunk/src/documentation/content/xdocs/releases/release-10.9.1.0.html ... done.

------- Committed new rev 1722481 (loaded from original rev 1727879) >>>

<<< Started new transaction, based on original revision 1727880
     * editing path : incubator/public/trunk/content/guides/retirement.xml ...svnadmin: E200014: Base checksum mismatch on '/incubator/public/trunk/content/guides/retirement.xml':
   expected:  86861356f2d6f824e026a1e28291cb0c
     actual:  5b45390df7a05d7fc1cecc3cd312b068

<<< Started new transaction, based on original revision 1732986
     * editing path : httpd/httpd/trunk/modules/proxy/proxy_util.c ...svnadmin: E200014: Base checksum mismatch on '/httpd/httpd/trunk/modules/proxy/proxy_util.c':
   expected:  4ad1d5775dfa2cc9a6191818c6f978aa
     actual:  4151f78e54a01982a48216b7e2e36dcc

Well, nothing too serious. Wrong even possibly missing file dump!

Listing of the scheduled inputs:

...
1711711 svn-asf-public-r1711711:1717366.7z
1717367 svn-asf-public-r1717367:1722480.7z
<missing-the-right-dump>
1727879 svn-asf-public-r1727879:1732985.7z

Well, yeah, that file is missing.

Well, yeah, that file is missing.

i meant missing from uffizi. Checking the source, it's missing from the index page listing.
Need to notify our asf contact but i'll make sure there are no other holes first.

Need to notify our asf contact but i'll make sure there are no other holes first.

No need. All that needs to is there.

The jokes is on me about the dumps.
There are overlap (and holes) in the dumps provided...

1693677 ['1700376', '1727878']
1700377 ['1706178'] <- reudndant with 2nd dump (1693677:1727878)
1706179 ['1711710'] <- redundant with 2nd dump
1711711 ['1717366'] <- redundant with 2nd dump
1717367 ['1722480'] <- redundant with 2nd dump
<holes from 1722480 to 1727878 which must be present in the 2nd dump from 1693677 to 1727878 >
1727879 ['1732985']

Well, the dump is all messed up to be polite now...
Restoring to an old point in time.

Well, the dump is all messed up to be polite now...

My sentence is incorrect.
I mean the current local mirror that i am mounting on uffizi is messed up.
The dumps are fine!

  • Restored in time.
  • Fix the input dumps to mount in order
  • Restarted the dump mounting routine
ardumont claimed this task.

That probably won't have to happen anymore.
svn monorepository needs to be ingested in a specific way.
A new task will get created for this.