Page MenuHomeSoftware Heritage

svn loader: CRLF/LF mess in svn history results in hash computations divergence
Closed, MigratedEdits Locked

Description

When ingesting the asf mirror, if some error occurs during the loading, the process stops.

That's expected and dealt with.
At the moment, this will create an occurrence targeting a revision which holds the svn revision and update the origin_visit to set its status accordingly.

When starting again the loading of the repository, as it is already known, it will fetch the last known revision (and its corresponding svn revision).
Then svn exports at that svn revision on the disk and then compute its hash tree and then the revision hash.
Compare that revision hash with the one already stored in the db.

If it matches, it continues to load from that point on.
If it mismatches, it logs an error message about altered history and stops.

This is what currently happens.

BUT: The asf mirror in question did not change and this also happens.
It's a bug.

This is symptomatic of a divergence in the directory hash tree computation (which cascades to the revision hash computation).
This is about investigating further this tree divergence.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

ardumont renamed this task from Investigate svn bug about history altered to Investigate svn bug about altered history.Sep 23 2016, 10:25 AM
ardumont created this task.

It falls down to the revision 23242 which creates a diverging tree.
Files have a CRLF end of line whereas it should be LF (well according to svn export).

Standard svn or subvertpy export interprets it well.
That is, it's transformed to LF through the svn:eol-style property [1].

The loader svn (well subvertpy's remote agent) does nothing about it.
Thus the divergence.

Now, investigating some more since there is something that resonates wrong somewhere but i don't quite pinpoint it yet.

[1] http://svnbook.red-bean.com/en/1.7/svn.advanced.props.file-portability.html

Taking a look at why the export on subvertpy worked, it's because, the 'native_eol' flag is NULL by default in subvertpy [1].
This then goes on and call the svn lib api binding [2].
That svn lib api binding, for that null value for that flag, falls back to using the 'standard eol marker'. Which is LF on Linux platforms.
So ok for why subvertpy exports works out of the box like svn.

[1] https://github.com/jelmer/subvertpy/blob/master/subvertpy/client.c#L813
[2] https://subversion.apache.org/docs/api/1.7/group__Export.html#ga012408fea426f8bc283890fdad6f0b1c

ardumont updated the task description. (Show Details)

I am unable to reproduce this behavior in integration tests...
Svn-loader's initial code computes correctly the hashes (I have exported each revision and computed separately the hashes independently to be certain of this fact).

When:

  • adding crlf file in the svn repository and then 'exporting' (svn, subvertpy) that file, it does contain the crlf eol as expected.
  • adding crlf file with svn:eol-style at CRLF and then exporting that file, it does contains the crlf as expected.
  • adding crlf file svn:eol-style either with native (on debian) or LF, it is converted at commit time. Exporting that file, the file contains LF as expected (again, because we are Linux based).
  • The same goes exactly the same for CR eol files...

When discussing with jelmer on irc (subvertpy's author). he suggested to check if the file was considered a binary.
It is not (checking if the svn:mime-type property was set but it was not).

So far, what i've been able to gather is that the diff contains CRLF (or CR) characters.
And that's correctly dealt with checkout or export operations.
Not with subvertpy's replay mechanism (which uses diff).

The road so far...

As the property svn:eol-style is set to 'native' and we are running this on debian, the CRLF should be transformed on LF without nothing to do (only applying the patch as usual).
According to documentation, svn's checkout or export operations, and some integration tests design to reproduce the issue (which don't) confirm it.

I have not yet found the reason why those *not-so-nice-adjectives* CRLF are there yet.

I have tried to work on some workarounds. Up until yesterday, I thought i had one that worked.
Unfortunately, it's not working since creating divergence in the tree to fix the CRLF causes some future diffs to fail...

The replay strategy i use from subvertpy to have performance is not using a svn working copy representation, so the diffs are static and not dynamic. Meaning, the diffs you are receiving stay the same, whatever you do with prior diffs.

Note that for the last part, come to think of it, i had already encountered the issue of future diff failures.
It's with the symbolic links use case. I overcame it by swapping representation on a needed basis (real symbol link for hash computation, regular file when new diff against it is needed). I could have not touched that disk representation and adapted the hash computation function (that was the initial implementation btw) but that created analysis issue down the line (regarding exterior hash computations).

This could hint at something to improve the actual workaround.
I don't like where this is going though.
I'd rather found the root cause of the diff but so far, it's a *fail*.

...
This could hint at something to improve the actual workaround.

After multipe tryouts iterating over that workaround implementation, this does not work.
(I kept the commit and its reverted commit voluntarily).

Indeed, the actual implementation rely on a replay which sends static patchs.
The patchs are static because whatever your local representation, the next patchs remains the same.
That means that we cannot diverge on the local representation (change files locally).
Otherwise, the future patchs sent for the file (or whatnot) won't work.
If we were to change the local representation, we'd thus need to store those local change to revert them back before applying the new patch and then apply again those modification to compute the correct hash... This will soon become a hassle on large project. And I'm also no longer sure that we'd be winning against the first implementation of the loader-svn (based on checkout/export on a working copy which were hellishly slow).

Note that it does work for the symlinks because the symlink representation is a simple one.
Also, the loader-svn already do keep the symlink state (by transforming it back and forth between its unix representation for hash computation and its svn one for applying patch).

This is not solved yet.

A run on a large set of googlecode svn dump (~50k) reveals for now that it is a marginal case (only 2 cases amongst those).
The code has been adapted to check on a certain frequency (per configuration basis) if some revision hash divergence occurred.
If it is detected, the error is logged and the loading is stopped.

This is a trade-off as no solution were found for now.

This comment was removed by ardumont.

tl;dr;

So far, other dumps with the error. All of those errors have the same reason, a mix of CRLF lines terminator where export does not expect that:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/pyang/pyang-repo.svndump.gz (hash divergence detected at revision: 343)
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/m/mediawiki/mediawiki-repo.svndump.gz (revision: 337)
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/l/libaprilui/libaprilui-repo.svndump.gz (revision: 407)
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/m/miku/miku-repo.svndump.gz (revision: 951)

details

The following are merely samples of the issue for memory purposes.
The first file of the following commands is the result of exporting the svn repository to a given svn revision (svn export -q --revision 343 --ignore-keywords swh-pyang-repo-343)
The second file tested is the result of the incremental application of the diffs with the remote access pattern (done within the loader-svn).

$ file diff -r swh-pyang-repo-343/trunk/xslt/iso_svrl_for_xslt1.xsl /tmp/swh.loader.svn.b4lw992n.tmp/pyang-repo/trunk/xslt/iso_svrl_for_xslt1.xsl
swh-pyang-repo-343/trunk/xslt/iso_svrl_for_xslt1.xsl:                          XML 1.0 document, ASCII text
/tmp/swh.loader.svn.b4lw992n.tmp/pyang-repo/trunk/xslt/iso_svrl_for_xslt1.xsl: XML 1.0 document, ASCII text, with CRLF, LF line terminators

$ file libaprilui-repo-337/branches/refactor_2/src/aprilui.cpp /tmp/swh.loader.svn.ykabphgm.tmp/libaprilui-repo/branches/refactor_2/src/aprilui.cpp
libaprilui-repo-337/branches/refactor_2/src/aprilui.cpp:                              C++ source, ASCII text
/tmp/swh.loader.svn.ykabphgm.tmp/libaprilui-repo/branches/refactor_2/src/aprilui.cpp: C++ source, ASCII text, with CRLF line terminators

$ file mediawiki-repo-407/trunk/extensions/ImageLink/ImageLink.php /tmp/swh.loader.svn.4bsewexf.tmp/mediawiki-repo/trunk/extensions/ImageLink/ImageLink.php
mediawiki-repo-407/trunk/extensions/ImageLink/ImageLink.php:                              PHP script, ASCII text
/tmp/swh.loader.svn.4bsewexf.tmp/mediawiki-repo/trunk/extensions/ImageLink/ImageLink.php: PHP script, ASCII text, with CRLF line terminators

$ file miku-repo-951/trunk/libs/Qwin/IsEqual.php /tmp/swh.loader.svn.dq1hfnvs.tmp/miku-repo/trunk/libs/Qwin/IsEqual.php
miku-repo-951/trunk/libs/Qwin/IsEqual.php:                              PHP script, ASCII text
/tmp/swh.loader.svn.dq1hfnvs.tmp/miku-repo/trunk/libs/Qwin/IsEqual.php: PHP script, ASCII text, with CRLF line terminators

Ok digging further, all repositories are corrupted, or at least not consistent.
And this inconsistency is unfortunately permitted in the svn toolchain (at least, it was at some point).

Some things are done in svn's side to enforce that property but it's not bullet proof.
And so, instead of having the LF as expected, we have the CRLF/LF mix mess.

Detail for 1 repository, but i checked the other ones, the issue is indeed the same (for asf it's still running but so far it did not break):

pyang-repo:
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.qqjt8rdf.tmp/pyang-repo/trunk/xslt/iso_abstract_expand.xsl'
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.qqjt8rdf.tmp/pyang-repo/trunk/xslt/iso_schematron_skeleton_for_xslt1.xsl'
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.qqjt8rdf.tmp/pyang-repo/trunk/xslt/iso_svrl_for_xslt1.xsl'
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.qqjt8rdf.tmp/pyang-repo/trunk/xslt/svrl2text.xsl'

libaprilui-repo:
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.h2daal9e.tmp/libaprilui-repo/trunk/include/aprilui/aprilui.h'
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.h2daal9e.tmp/libaprilui-repo/trunk/include/aprilui/apriluiExport.h'
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.h2daal9e.tmp/libaprilui-repo/trunk/src/aprilui.cpp'

mediawiki-repo:
change-prop: svn:eol-style native b'/tmp/swh.loader.svn.nhf7ttn4.tmp/mediawiki-repo/tags/ImageLink/imagelink-1.0.0/ImageLink/ImageLink.php'

miku-repo:
change-prop: svn:eol-style LF b'/tmp/swh.loader.svn.wmrurwq_.tmp/miku-repo/trunk/libs/Qwin/IsEqual.php'

Note: Same revision, same repository, same failure. The temporary folders mentioned are different since it's another run for dumping more information.

I don't have easy workarounds/fixes for such inconsistency.

As i explained earlier in this thread here and here, we cannot just fix the files ourselves and move on. Because, if we diverge in some ways, we will fail later at some revision touching that file again, when trying to apply the new patch (as the file to apply it too won't be patch compliant anymore).

If we still want to go that way, we'll need to keep that state somewhere somehow (that's the not easy part).

Possible solutions:

  • Fail: I can however check for this (if svn:eol-type set to native or LF, and CRLF are still detected after applying the patch) then fail with a corruption/inconsistency message.

That will only permit to detect this though. That could be a problem for live svn repositories to follow.

  • not so easy way (and absolutely no guarantee it can work):

Like in the previous solution, if we detect the situation, instead of failing, we fix the files ourselves (be it manual fix, or try to see if svn/subvertpy can checkouts/exports a file representation directly).
We keep some form of index keeping that file reference and its 'faulty' representation somewhere.
When a new patch must be applied to that file, we check the index and revert back to that form before applying the patch.
And check back all other again to keep or not the reference (svn:eol-type could also be removed later on or something...).

I am open to any other suggestions.

Of course i forgot to mention (because i only remember it now), it's not only CRLF, it's sometimes a mix with CR/CRLF/LF as well...

There, asf has the same inconsistency error, which is now detected:

DEBUG:swh.scheduler.task.LoadSWHSvnRepositoryTsk:rev: 23241, swhrev: 5e90b0ab118a65b5d3757d51dec125ab5d9844d5, dir: f9e7753411df30a5f8b443cc6bb5965563d07b15
ERROR:swh.scheduler.task.LoadSWHSvnRepositoryTsk:Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 862, in load
    self.store_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 461, in store_data
    start_from_scratch=self.start_from_scratch)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 280, in process_repository
    svnrepo, revision_start, revision_end, revision_parents)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 380, in process_swh_revisions
    raise e
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 360, in process_swh_revisions
    self.config['revision_packet_size']):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-core/swh/core/utils.py", line 40, in grouper
    for _data in itertools.zip_longest(*args, fillvalue=None):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 304, in process_svn_revisions
    for rev, nextrev, commit, new_objects, root_directory in gen_revs:
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/svn.py", line 226, in swh_hash_data_per_revision
    objects = self.swhreplay.compute_hashes(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 407, in compute_hashes
    self.replay(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 392, in replay
    self.conn.replay(rev, rev+1, self.editor)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 208, in close
    raise ValueError(msg)
ValueError: Inconsistency. CRLF detected in a converted file b'/tmp/swh.loader.svn.s5go_tpz.tmp/asf-mirror/cocoon/site/trunk/site/1.x/3rdparty.html' (svn:eol-style: native)

For completeness's sake:

file asf-mirror-23242/cocoon/site/trunk/site/1.x/3rdparty.html /tmp/swh.loader.svn.teevimrs.tmp/asf-mirror/cocoon/site/trunk/site/1.x/3rdparty.html
asf-mirror-23242/cocoon/site/trunk/site/1.x/3rdparty.html:                            HTML document, ASCII text, with very long lines
/tmp/swh.loader.svn.teevimrs.tmp/asf-mirror/cocoon/site/trunk/site/1.x/3rdparty.html: HTML document, ASCII text, with very long lines, with CRLF line terminators

Note: (swh-)asf-mirror-23242 is an svn export --revision 23242 --ignore-keyword

ardumont renamed this task from Investigate svn bug about altered history to svn loader: CRLF/LF mess in svn history results in hash computations divergence.Sep 10 2018, 10:29 AM

I took some time to dig into that issue last week and this what I understood of it.

To load the content of a subversion repository into the archive, the svn remote
access C api is used trough the subvertpy Python module [1]. Thanks to the svn_ra_replay
function [2], the file system associated to each svn revision can be reconstructed
by incrementally applying the diffs between each successive revision. This is an approach
much more efficient as exporting the subversion tree for each revision.

In order to verify that the tree computed at each revision with the incremental diffs
is valid and contain the exact same files as those obtained with 'svn export', a comparison
of the root directory checksum obtained with the two methods is performed every
N processed revisions.

For some processed repositories (like the Apache one, see posts above), a tree divergence
is obtained due to the internal handling of end of line styles by subversion.
Indeed subversion enables to set a property named 'svn:eol-style' [3] to each versioned
file in order to force the end of line style when a user exports or checkouts a repository.

Based on my understanding, this property does not seem to be taken into account
when using the remote access replay api and thus the reconstructed files may
contain different line endings as those generated by 'svn export', so there is a
tree divergence.

After analysing the repositories that failed to load into the archive,
all contains some CRLF end of lines for versioned files with the svn:eol-style
property set to 'native'. But according to subversion official documentation, this
should not be the case:

native

This causes the file to contain the EOL markers that are native to the operating system on which Subversion was run. In other words, if a user on a Windows machine checks out a working copy that contains a file with an svn:eol-style property set to native, that file will contain CRLF EOL markers. A Unix user checking out a working copy that contains the same file will see LF EOL markers in his copy of the file.

Note that Subversion will actually store the file in the repository using normalized LF EOL markers regardless of the operating system. This is basically transparent to the user, though.

So it means that somehow a file with the svn:eol-style property set to native did
not get stored properly in the svn repository, surely due to a buggy subversion client
or server.

I think the only way to fix those corner case issues is to ensure that each reconstructed
file from the incremental diffs has the same line ending policy as the one obtained
with 'svn export'. While we can not modify a reconstructed file directly on disk (as the
subsequent diffs will likely fail), the current implementation of the loader still allow us
to normalize the line endings according to the eol-style property value, compute the
updated checksum and send the file to the archive from a buffer in memory (see D407 for an attempt to fix the issue).

[1] https://github.com/jelmer/subvertpy/blob/master/subvertpy/_ra.c#L1200-L1226
[2] https://subversion.apache.org/docs/api/latest/svn__ra_8h.html#ad79d196db14e261a932f3207f41da5ae
[3] http://svnbook.red-bean.com/en/1.7/svn.advanced.props.file-portability.html#svn.advanced.props.special.eol-style

I took some time to dig into that issue last week and this what I understood of it.

Well, awesome, thanks.

...

And i think that's a pretty accurate summing up.
Cool.

I think the only way to fix those corner case issues is to ensure that each reconstructed
file from the incremental diffs has the same line ending policy as the one obtained
with 'svn export'.

I agree and tried that approach, but failed... [1]
I was also keeping the local normalized patch applied on disk...
I did so to overcome the future patch appliance error...

While we can not modify a reconstructed file directly on disk (as the
subsequent diffs will likely fail),

Scratch likely, it fails! ;) [2]

the current implementation of the loader still allow us to normalize the line endings
according to the eol-style property value, compute the updated checksum

Yes

and send the file to the archive from a buffer in memory.

Brilliant!
Indeed dropping the problematic applying patch to disk step sounds the right approach here...

That's the difference with my implementation tryouts!
I kept trying to modify the original content on disk (to match its hashes for further patchs appliance...).
It never occurred to me that in this case, skipping that part would help.
Cool.
Thanks.


@anlambert by the way...

According to documentation, svn's checkout or export operations, and some integration tests design to reproduce the issue (which don't) confirm it. [2]

Here is probably why we don't quite understand the test dubbed eol-style now, it's simply does not capture the issue... [3]

[1] T570#10631

[2] T570#10528

[3] https://forge.softwareheritage.org/source/swh-loader-svn/browse/master/swh/loader/svn/tests/test_loader.py$623

Cheers,

Based on my understanding, this property does not seem to be taken into account
when using the remote access replay api and thus the reconstructed files may
contain different line endings as those generated by 'svn export', so there is a
tree divergence.

Thanks for your analysis and for nailing down the issue to this problem.

I think the only way to fix those corner case issues is to ensure that each reconstructed
file from the incremental diffs has the same line ending policy as the one obtained
with 'svn export'. While we can not modify a reconstructed file directly on disk (as the
subsequent diffs will likely fail), the current implementation of the loader still allow us
to normalize the line endings according to the eol-style property value, compute the
updated checksum and send the file to the archive from a buffer in memory (see D407 for an attempt to fix the issue).

Just to be sure I'm getting your proposal right: this will ensure that the normalization is done aligning with what a checkout will do (rather than the other way around), right?
If so, I agree this is definitely the way to go, as we will be guaranteeing that we archive what users would get out of a (badly stored) SVN revision.

Just to be sure I'm getting your proposal right: this will ensure that the normalization is done aligning with what a checkout will do (rather than the other way around), right?
If so, I agree this is definitely the way to go, as we will be guaranteeing that we archive what users would get out of a (badly stored) SVN revision.

Yes, this is what I meant. We should store in the archive the files as a subversion user will get when performing a checkout. This way, the checksums
he can compute on the retrieved contents will be the same as the ones in the archive.

tl; dr

as in T946#22626:

  • 23 origins in error scheduled back [1]
  • workers' dashboard log

[1] P304