Page MenuHomeSoftware Heritage

loader-svn: googlecode import: UnicodeDecodeError in user svn properties fails the loading
Closed, ResolvedPublic

Description

dump: /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/9/9i00/9i00-repo.svndump.gz

$ python3
Python 3.6.4 (default, Jan  5 2018, 02:13:53)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dump = '9i00-repo.svndump.gz'
>>> origin_url = 'http://%s.googlecode.com' % dump
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.svn.tasks import MountAndLoadSvnRepositoryTsk
>>>
>>> t = MountAndLoadSvnRepositoryTsk()
>>> t.run(archive_path=dump, origin_url=origin_url, visit_date='2016-05-03T15:16:32+00:00')
INFO:swh.loader.svn.SvnLoader:Archive to mount and load 9i00-repo.svndump.gz
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Creating svn origin for http://9i00-repo.svndump.gz.googlecode.com
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Done creating svn origin for http://9i00-repo.svndump.gz.googlecode.com
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Creating origin_visit for origin 12 at time 2016-05-03T15:16:32+00:00
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Done Creating origin_visit for origin 12 at time 2016-05-03T15:16:32+00:00
INFO:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:[revision_start-revision_end]: [1-35]
INFO:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Processing {'remote_url': 'file:///tmp/swh.loader.svn.q093b5py.tmp', 'local_url': b'/tmp/swh.loader.svn.n650d22c.tmp/swh.loader.svn.q093b5py.tmp', 'uuid': b'5d117054-d222-43b4-afd8-064e7e915043', 'swh-origin': 12}.
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 1, swhrev: b6694c21369e5108b68bd0434740140b42adf042, dir: 75ed58f260bfa4102d0e09657803511f5f0ab372
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 2, swhrev: 0d7c42ea166ae91fc0023a72c25d13896d722258, dir: e64d3ccab08070f74c9a4577dce5b1e8b531ce55
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 3, swhrev: 2cf883ee78002af3cdfd3513005b15620b585fe2, dir: 85b2a5eaf76ac0cdc42efcd7f4e293c03c27b8ae
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 4, swhrev: 1ead2ed0bc20326ea38d6a121190b21e85d7d629, dir: cc19a1ab046114d03b465082556cb55731a6fe6e
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 5, swhrev: c7b9409849750ce074a75f5b4212b72fbdb66f3b, dir: 84a0b4786ce67c65bef611cf6b3a2c640b641763
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 6, swhrev: d3b567d6904cb944e30177e006fd28fcfc77f4a8, dir: f6f53fc35b557199494cf777160abc4a36dcbef5
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 7, swhrev: efcab6300195ccd0aaefa2dee2d5642e22db2344, dir: 31af4c42188972543ae2a548015f866567526b65
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 8, swhrev: 022aa6dccf16a9706876960cb3969903e0427343, dir: cd54af8d241cc86eea8872b4296c1bfc9a0f6b0c
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:rev: 9, swhrev: 8b03663681c85c94ae815624ef0d5bf2040fbcf6, dir: 4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0
ERROR:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 862, in load
    self.store_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 341, in store_data
    start_from_scratch=self.start_from_scratch)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 539, in process_repository
    svnrepo, revision_start, revision_end, revision_parents)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 262, in process_swh_revisions
    raise e
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 241, in process_swh_revisions
    self.config['revision_packet_size']):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-core/swh/core/utils.py", line 40, in grouper
    for _data in itertools.zip_longest(*args, fillvalue=None):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/loader.py", line 185, in process_svn_revisions
    for rev, nextrev, commit, new_objects, root_directory in gen_revs:
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/svn.py", line 278, in swh_hash_data_per_revision
    objects = self.swhreplay.compute_hashes(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 374, in compute_hashes
    self.replay(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 359, in replay
    self.conn.replay(rev, rev+1, self.editor)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Updating origin_visit for origin 12 with status partial
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Done updating origin_visit for origin 12 with status partial
DEBUG:swh.scheduler.task.MountAndLoadSvnRepositoryTsk:Clean up temp directory /tmp/swh.loader.svn.q093b5py.tmp for project
{'status': 'failed'}

Event Timeline

ardumont created this task.Feb 2 2018, 1:52 PM

This issue is reproduced on 6 repositories (so far, still running locally on some other big repositories).
Those fails are commits holding non-unicode named characters in tree or filename (japanese for example).
What's failing is not clear at all though.

Digging in deeper in the c lib dev dependencies... as the relevant stacktrace part here is quite concise...

ardumont lowered the priority of this task from High to Normal.Sep 24 2018, 4:14 PM

This one is still true. Still banging my head on this.

To ease the reproductibility, one can use the following command (in repository)

python3 -m swh.loader.svn.ra --local-url /tmp/loader-svn/ --svn-url file:///home/tony/scratch/swh/loader-svn/9i00 --revision-start 1 --revision-end 10 --nocleanup

Note: --nocleanup permits to leave the /tmp/loader-svn temporary work folder in place after the run.

ardumont added a comment.EditedSep 24 2018, 4:24 PM

(If you are not interested in details, no need to look further, it's a run extract ;)

Digging in deeper in the c lib dev dependencies... as the relevant stacktrace part here is quite concise...

Yes, Adding some logs shows the problem seems to come from deeper.
What's frustrating is that the stacktrace just starts at the function call...

As far as i dug (again), i entered within svn's internals and got lost.
So far: subvertpy:ra.py -> _ra.c ~> libsvn:svn_ra_replay_range

$ python3 -m swh.loader.svn.ra --local-url /tmp/loader-svn/ --svn-url file:///home/tony/scratch/swh/loader-svn/9i00 --revision-start 1 --revision-end 10 --nocleanup
DEBUG:root:Replay: 1 -> 2
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[]) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:DirEditor: add_directory branches None -1
DEBUG:root:DirEditor: path b'branches'
DEBUG:root:basedireditor: close
DEBUG:root:DirEditor: add_directory tags None -1
DEBUG:root:DirEditor: path b'tags'
DEBUG:root:basedireditor: close
DEBUG:root:DirEditor: add_directory trunk None -1
DEBUG:root:DirEditor: path b'trunk'
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r1 75ed58f260bfa4102d0e09657803511f5f0ab372 (0 new contents, 2 new directories)
DEBUG:root:Replay: 2 -> 3
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: add_file trunk/readme.txt None -1
DEBUG:root:basedireditor: path: b'trunk/readme.txt'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/readme.txt'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r2 e64d3ccab08070f74c9a4577dce5b1e8b531ce55 (1 new contents, 2 new directories)
DEBUG:root:Replay: 3 -> 4
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:DirEditor: add_directory trunk/doc None -1
DEBUG:root:DirEditor: path b'trunk/doc'
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r3 85b2a5eaf76ac0cdc42efcd7f4e293c03c27b8ae (0 new contents, 3 new directories)
DEBUG:root:Replay: 4 -> 5
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/readme.txt
DEBUG:root:basedireditor: path b'trunk/readme.txt'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/readme.txt'
DEBUG:root:fileeditor: apply_textdelta b42e4e5f9a6437b8e61610cec96bad6a
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r4 cc19a1ab046114d03b465082556cb55731a6fe6e (1 new contents, 2 new directories)
DEBUG:root:Replay: 5 -> 6
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/readme.txt
DEBUG:root:basedireditor: path b'trunk/readme.txt'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/readme.txt'
DEBUG:root:fileeditor: apply_textdelta cc1606c0d23389bb9168324dbdbfc09b
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r5 84a0b4786ce67c65bef611cf6b3a2c640b641763 (1 new contents, 2 new directories)
DEBUG:root:Replay: 6 -> 7
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/readme.txt
DEBUG:root:basedireditor: path b'trunk/readme.txt'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/readme.txt'
DEBUG:root:fileeditor: apply_textdelta 7416d916050e0bbf11da53902d04eea8
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r6 f6f53fc35b557199494cf777160abc4a36dcbef5 (1 new contents, 2 new directories)
DEBUG:root:Replay: 7 -> 8
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:DirEditor: add_directory trunk/demo None -1
DEBUG:root:DirEditor: path b'trunk/demo'
DEBUG:root:basedireditor: add_file trunk/demo/db.php None -1
DEBUG:root:basedireditor: path: b'trunk/demo/db.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/db.php'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: add_file trunk/demo/index.php None -1
DEBUG:root:basedireditor: path: b'trunk/demo/index.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/index.php'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: add_file trunk/demo/insert.php None -1
DEBUG:root:basedireditor: path: b'trunk/demo/insert.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/insert.php'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: add_file trunk/demo/operation.php None -1
DEBUG:root:basedireditor: path: b'trunk/demo/operation.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/operation.php'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: add_file trunk/demo/test_input.sql None -1
DEBUG:root:basedireditor: path: b'trunk/demo/test_input.sql'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/test_input.sql'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: add_file trunk/demo/update.php None -1
DEBUG:root:basedireditor: path: b'trunk/demo/update.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/update.php'
DEBUG:root:fileeditor: apply_textdelta None
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r7 31af4c42188972543ae2a548015f866567526b65 (6 new contents, 3 new directories)
DEBUG:root:Replay: 8 -> 9
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/demo/test_input.sql
DEBUG:root:basedireditor: path b'trunk/demo/test_input.sql'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/test_input.sql'
DEBUG:root:fileeditor: apply_textdelta 30338026475849d808224714161e448f
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r8 cd54af8d241cc86eea8872b4296c1bfc9a0f6b0c (1 new contents, 3 new directories)
DEBUG:root:Replay: 9 -> 10
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/demo/index.php
DEBUG:root:basedireditor: path b'trunk/demo/index.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/index.php'
DEBUG:root:fileeditor: apply_textdelta 5545bd8e4650e2ff7aeb4c008fc3958f
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r9 4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0 (1 new contents, 3 new directories)
DEBUG:root:Replay: 10 -> 11
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00'
DEBUG:root:basedireditor: open_file: trunk/demo/index.php
DEBUG:root:basedireditor: path b'trunk/demo/index.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/x_2ooy6e-9i00' b'trunk/demo/index.php'
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 513, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 759, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 714, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 951, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 552, in invoke
    return callback(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 496, in main
    objects = replay.compute_hashes(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 451, in compute_hashes
    self.replay(rev)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 436, in replay
    self.conn.replay(rev, rev+1, self.editor)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte

Note:

  • swh-loader-svn's new branch tracking-unicodedecode-error-issue-t946
  • i changed swh.model.from_disk.Directory to have a str function that displays None for the id (or else that fails when trying to hash... don't know why) [1]

[1] swh-model diff

diff --git a/swh/model/from_disk.py b/swh/model/from_disk.py
index 4767c39..14511f7 100644
--- a/swh/model/from_disk.py
+++ b/swh/model/from_disk.py
@@ -345,3 +345,8 @@ class Directory(MerkleNode):
             id_to_str(self.hash),
             ', '.join(str(entry) for entry in self),
         )
+
+    def __str__(self):
+        return 'Directory(id=None, entries=[%s])' % (
+            ', '.join(str(entry) for entry in self),
+        )
anlambert added a subscriber: anlambert.EditedSep 24 2018, 4:55 PM

Tracking down the issue in subvertpy source code, the error occurs in the subvertpy._ra C extension module.
More precisely, an exception is raised at line 1068 in file subvertpy/editor.c when Python tries to decode a svn property value from 'utf-8' encoding.

1062  static svn_error_t *py_cb_editor_change_prop(void *dir_baton, const char *name, const svn_string_t *value, apr_pool_t *pool)
1063  {
1064          PyObject *self = (PyObject *)dir_baton, *ret;
1065          PyGILState_STATE state = PyGILState_Ensure();
1066
1067          if (value != NULL) {
1068                  ret = PyObject_CallMethod(self, "change_prop", "sz#", name, value->data, value->len);
1069          } else {
1070                  ret = PyObject_CallMethod(self, "change_prop", "sO", name, Py_None);
1071          }
1072          CB_CHECK_PYRETVAL(ret);
1073          Py_DECREF(ret);
1074          PyGILState_Release(state);
1075          return NULL;
1076  }

By analyzing the repository dump file in emacs at Revision number 10 which is the problematic one, we get:

Revision-number: 10
Prop-content-length: 134
Content-length: 134

K 13
google:author
V 7
unknown
K 10
svn:author
V 9
kewenxing
K 8
svn:date
V 27
2006-11-16T06:45:43.333644Z
K 7
svn:log
V 0

PROPS-END

Node-path: trunk/demo/index.php
Node-kind: file
Node-action: change
Prop-delta: true
Prop-content-length: 34
Content-length: 34

K 8
version1
V 6
\265\332\322\273\260\346
PROPS-END

So it looks like the value of the 'version1' property is not correctly encoded.
After check, the used encoding is GB18030.

Thinking a bit more about the issue, there might a way to workaround it in our client code instead
of hacking in subvertpy.

Python allows to override the error handlers for the codecs module: https://docs.python.org/3.7/library/codecs.html?highlight=register_error#codecs.register_error

We could then temporarily override the 'strict' error handler which is the default one and either
try to decode ourself the problematic bytes object or simply returning an empty string.

Something like:

import codecs

def _codecs_error_handler(exception):
    return u"", exception.end

# then in Replay.replay implementation
codecs.register_error("strict", _codecs_error_handler)
self.conn.replay(rev, rev+1, self.editor)
codecs.register_error("strict", codecs.strict_errors)

It seems to do the trick with my first tests.

ardumont added a comment.EditedSep 25 2018, 10:58 AM

By analyzing the repository dump file in emacs

Nice to know, it did not even occur to me to use emacs for that!

So it looks like the value of the 'version1' property is not correctly encoded.

Quite.

It seems to do the trick with my first tests.

I concur that your proposed solution work (cf. below for detail).

Note that installing the error handler as proposed will catch any other exception raised (not only the one raised by change_prop).
There are no other identified so far.
That is not only the one raised by change_prop.
I don't have anything better to propose though ;)

Patched ra's main according to what you proposed:

diff --git a/swh/loader/svn/ra.py b/swh/loader/svn/ra.py
index 737e2e3..1f65137 100644
--- a/swh/loader/svn/ra.py
+++ b/swh/loader/svn/ra.py
@@ -490,6 +490,13 @@ def main(local_url, svn_url, revision_start, revision_end, debug, cleanup):
     revision_end = min(revision_end, revision_end_max)
 
     try:
+        import codecs
+
+        def _codecs_error_handler(e):
+            logging.exception('############ Exception caught')
+            return u"", e.end
+
+        codecs.register_error("strict", _codecs_error_handler)
         replay = Replay(conn, rootpath)
 
         for rev in range(revision_start, revision_end+1):
@@ -504,6 +511,7 @@ def main(local_url, svn_url, revision_start, revision_end, debug, cleanup):
         if debug:
             logging.info('%s' % rootpath.decode('utf-8'))
     finally:
+        codecs.register_error("strict", codecs.strict_errors)
         if cleanup:
             if os.path.exists(rootpath):
                 shutil.rmtree(rootpath)

And run:

$ python3 -m swh.loader.svn.ra --local-url /tmp/loader-svn/ --svn-url file:///home/tony/scratch/swh/loader-svn/9i00 --revision-start 1 --revision-end 10 --nocleanup 
...
INFO:root:r9 4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0 (1 new contents, 3 new directories)
DEBUG:root:Replay: 10 -> 11
DEBUG:root:Editor: open_root -1
DEBUG:root:basedireditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/4ufc6_3s-9i00'
DEBUG:root:basedireditor: open_file: trunk/demo/index.php
DEBUG:root:basedireditor: path b'trunk/demo/index.php'
DEBUG:root:FileEditor: __init__ Directory(id=None, entries=[b'branches', b'tags', b'trunk']) b'/tmp/loader-svn/4ufc6_3s-9i00' b'trunk/demo/index.php'
ERROR:root:############ Exception caught
ERROR:root:############ Exception caught
ERROR:root:############ Exception caught
ERROR:root:############ Exception caught
NoneType: None
DEBUG:root:fileeditor: change_prop: version1 һ
DEBUG:root:fileeditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
DEBUG:root:basedireditor: close
INFO:root:r10 4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0 (1 new contents, 3 new directories)
INFO:root:/tmp/loader-svn/4ufc6_3s-9i00

The hash tree computation is still ok:

$ svn checkout file://$(pwd)/9i00@10 9i00@10
A    9i00@10/branches
A    9i00@10/tags
A    9i00@10/trunk
A    9i00@10/trunk/demo
A    9i00@10/trunk/demo/db.php
A    9i00@10/trunk/demo/index.php
A    9i00@10/trunk/demo/insert.php
A    9i00@10/trunk/demo/operation.php
A    9i00@10/trunk/demo/test_input.sql
A    9i00@10/trunk/demo/update.php
A    9i00@10/trunk/doc
A    9i00@10/trunk/readme.txt
Checked out revision 10.
$ swh-hashtree --path 9i00@10 --ignore .svn
4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0
$ swh-hashtree --path /tmp/loader-svn/4ufc6_3s-9i00
4a5c0dd34fc1d9585d7bc741bfcfa12d709733c0

Note: swh-hashtree in swh-loader-svn/bin

Cheers,

Digging deeper to try and improve the result to return (at the moment, empty string).
I tried to use chardet to detect the language to try and decode the bytes.
This fails as nothing appropriate is found.

It chooses russian over chinese.

DEBUG:root:###### objects: b'\xb5\xda\xd2\xbb\xb0\xe6'
...
DEBUG:chardet.charsetprober:EUC-JP Japanese confidence = 0.01
DEBUG:chardet.charsetprober:GB2312 Chinese confidence = 0.01
DEBUG:chardet.charsetprober:KOI8-R Russian confidence = 0.5119798157077455
...
DEBUG:root:####### {'encoding': 'KOI8-R', 'confidence': 0.5119798157077455, 'language': 'Russian'}
DEBUG:root:###### KOI8-R data: ╣зр╩╟Ф  ~> no translation
DEBUG:root:###### EUC-JP data: 及匯井   ~> no translation
DEBUG:root:###### GB2312 data: 第一版   ~> `first edition`

After trial and error using google-translate, only GB2312 encoding result in a meaning: first edition.

So as we cannot (yet?) automate this decoding part, the sensible thing to do seems to go with returning the empty string.
In effect, that drops the property.
That's a repository specific property we do not read anyway.

The next sensible thing to do would be to check the other repositories...
But...

This issue is reproduced on 6 repositories (so far, still running locally on some other big repositories).

To my old self, what are those other 5 repositories?

Cheers,

ardumont renamed this task from googlecode import: UnicodeDecodeError to loader-svn: googlecode import: UnicodeDecodeError.Sep 25 2018, 2:06 PM

This issue is reproduced on 6 repositories (so far, still running locally on some other big repositories).

To my old self, what are those other 5 repositories?

Well, digging through that at the moment.
I had to configure back the new kibana0 (was banco before) to start parsing those logs back.

I had to configure back the new kibana0 (was banco before) to start parsing those logs back.

Well, not lucky enough, i don't have the arguments of those task messages...

{'args': {}, 'kwargs': {}, 'exception':...

That must have run prior to fixing that.
Taking another route.

Great to have kibana back!

By playing a bit with it, I managed to isolate this svn repo (I had to filter on the date and on the swh_task_id as the error log containing the UnicodeDecodeError exception do not have task parameters in it).

Using a patched subvertpy to track the errors when Python C API functions are called, the UnicodeDecodeError is raised when
loading revision 1317. This is again due to the change of value for a non standard svn property called "jortho" (which we simply ignore
in the loader).

Another try with another svn repo gives me the following output:

r1 a9d1fb73fb683bfa494d1fe569136e0b4d644178 (0 new contents, 2 new directories)
r2 81dfb436704321c162cacbecc2466381a0163434 (2 new contents, 4 new directories)
prop name = svn:mime-type
r3 648f57b9e333fda765c1c7ce21f2e0f7318d5d15 (1 new contents, 2 new directories)
prop name = hehe
py_cb_editor_change_prop
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 498, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 481, in main
    objects = replay.compute_hashes(rev)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 438, in compute_hashes
    self.replay(rev)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 422, in replay
    self.conn.replay(rev, rev+1, self.editor)
  File "/home/antoine/swh/swh-environment/swh-loader-svn/swh/loader/svn/ra.py", line 418, in _codecs_error_handler
    raise e
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 2: invalid continuation byte

Once again, the issue comes from a non standard svn property here named 'hehe'.

ardumont added a comment.EditedSep 27 2018, 3:15 PM

Great to have kibana back!

:)

By playing a bit with it, I managed to isolate th...

Awesome. Thanks.

That's what i'm trying to make sure of.

I'm trying to list all those repositories i mentioned first ;)

(I had to filter on the date and on the swh_task_id as the error log containing the UnicodeDecodeError exception do not have task parameters in it).

Yes, we can join as you did but i found this rather tedious (and not scriptable?)

I have a snippet script to retrieve the information from elastic nodes [1] that i tried to leverage again.
That's the {args, kwargs} i mentioned earlier... (and the actual fact that we miss the arguments in that extract is an old bug in the loaders that has been fixed... but after those loading apparently...)

taking another route...

I was aiming at the fetch_history table but that misses indexes... (prado has some, somerset has none).
So it's too slow (well without creating the missing index)

I started an extract of those table to mount the local db and play with it.
But now i have other problems with my machine... T.T

Today is not a good day for me ;)

[1] https://forge.softwareheritage.org/source/snippets/browse/master/ardumont/kibana_fetch_logs.py

ardumont added a comment.EditedSep 27 2018, 3:53 PM

Today is not a good day for me ;)

Anyway, that will not stop me ;)

So it's too slow (well without creating the missing index)

...

Well, improving on the query, i was finally able to run something adequate directly there.

\copy (select o.url, fh.status, fh.stderr from origin o inner join origin_visit ov on o.id=ov.origin inner join fetch_history fh on fh.origin=o.id where ov.status = 'partial' and o.type='svn' and not fh.status and ov.visit = (select max(visit) from origin_visit where origin=o.id) and stderr like '%UnicodeDecodeError%') to '/home/ardumont/data-transit' ;
COPY 81

[1] P303 for the result

Note: 81 for all history (there will be duplicates).

Only 10 different origins after checks...:

http://9i00.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/9/9i00/9i00-repo.svndump.gz
http://coterie.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/c/coterie/coterie-repo.svndump.gz
http://httthttt.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/h/httthttt/httthttt-repo.svndump.gz
http://javasimplelearn.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/javasimplelearn/javasimplelearn-repo.svndump.gz
http://parrot-im.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/parrot-im/parrot-im-repo.svndump.gz
http://rayan.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/r/rayan/rayan-repo.svndump.gz
http://stm32-ewarm.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/stm32-ewarm/stm32-ewarm-repo.svndump.gz
http://svnant4java.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/svnant4java/svnant4java-repo.svndump.gz
http://veille-technologique.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/v/veille-technologique/veille-technologique-repo.svndump.gz
http://vitaludus.googlecode.com/svn/ /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/v/vitaludus/vitaludus-repo.svndump.gz

I still need to do the join to find the dumps now (done) [2]
But hey, that's progress \m/.

[2] uffizi:/srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps

Cheers,

Awesome \o/!

I was trying to play with your Python scripts to query kibana logs but it's been a while since I
did not write a query for elastic search and their json format is still awful.

Anyway, I will load and report the errors I obtained with those repos.

So here are the results:

http://9i00.googlecode.com/svn/ => Error when trying to decode the content of a svn property named version1
http://coterie.googlecode.com/svn/ => Error when trying to decode the content of a svn property named introduction
http://httthttt.googlecode.com/svn/ => Error when trying to decode the content of a svn property named thanh
http://javasimplelearn.googlecode.com/svn/ => Error when trying to decode the content of a svn property named bugtraq:label
http://parrot-im.googlecode.com/svn/ => Error when trying to decode the content of a svn property named jortho
http://rayan.googlecode.com/svn/ => Error when trying to decode the content of a svn property named revision
http://stm32-ewarm.googlecode.com/svn/ => Error when trying to decode the content of a svn property named hehe
http://svnant4java.googlecode.com/svn/ => Error when trying to decode the content of a svn property named myPicture
http://veille-technologique.googlecode.com/svn/ => Error when trying to decode the content of a svn property named TODO
http://vitaludus.googlecode.com/svn/ => Error when trying to decode the content of a svn property named tsvn:projectlanguage

So each time the decoding of a non standard svn property content fails.
As we simply ignore them in the loader, returning an empty string in the codecs
error handler should be enough to finally close that task.

Ok, so going for that fix.

Thanks for the help.

ardumont renamed this task from loader-svn: googlecode import: UnicodeDecodeError to loader-svn: googlecode import: UnicodeDecodeError in user svn properties.Sep 27 2018, 6:12 PM
ardumont renamed this task from loader-svn: googlecode import: UnicodeDecodeError in user svn properties to loader-svn: googlecode import: UnicodeDecodeError in user svn properties fails the loading.

tl; dr

  • new loader svn packaged and deployed
  • origins in error scheduled back
  • workers logs (kibana dashboard) [1]

Cheers,


Detail

Scheduling origins:

cat data.csv | python3 -m swh.scheduler.cli task schedule -c type -c policy -c kwargs --delimiter ';' -

Note: scheduler using the right db

data.csv:

swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://9i00.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/9/9i00/9i00-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://coterie.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/c/coterie/coterie-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://httthttt.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/h/httthttt/httthttt-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://javasimplelearn.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/javasimplelearn/javasimplelearn-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://parrot-im.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/parrot-im/parrot-im-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://rayan.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/r/rayan/rayan-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://stm32-ewarm.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/stm32-ewarm/stm32-ewarm-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://svnant4java.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/svnant4java/svnant4java-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://veille-technologique.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/v/veille-technologique/veille-technologique-repo.svndump.gz"}
swh-loader-mount-dump-and-load-svn-repository;oneshot;{"origin_url": "http://vitaludus.googlecode.com/svn/", "archive_path": "/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/v/vitaludus/vitaludus-repo.svndump.gz"}

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/06168050-c0bd-11e8-8222-07f3ec376cd5?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:%272018-09-27T22:00:00.000Z%27,mode:absolute,to:%272018-09-29T21:59:59.999Z%27))&_a=(description:%27This%20is%20a%20general%20dashboard%20listing%20the%20full%20logs%20of%20the%20swh-workers%27,filters:!((%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%2720de3150-c0bb-11e8-8222-07f3ec376cd5%27,key:systemd_unit,negate:!f,params:(query:%27swh-worker@swh_loader_svn.service%27,type:phrase),type:phrase,value:%27swh-worker@swh_loader_svn.service%27),query:(match:(systemd_unit:(query:%27swh-worker@swh_loader_svn.service%27,type:phrase)))),(%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%2720de3150-c0bb-11e8-8222-07f3ec376cd5%27,key:message,negate:!t,params:(query:%27%5B.*%5D%20UnicodeDecodeError.*%27,type:phrase),type:phrase,value:%27%5B.*%5D%20UnicodeDecodeError.*%27),query:(match:(message:(query:%27%5B.*%5D%20UnicodeDecodeError.*%27,type:phrase)))),(%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%2720de3150-c0bb-11e8-8222-07f3ec376cd5%27,key:priority,negate:!t,params:(query:%273%27,type:phrase),type:phrase,value:%273%27),query:(match:(priority:(query:%273%27,type:phrase))))),fullScreenMode:!f,options:(darkTheme:!t,hidePanelTitles:!f,useMargins:!f),panels:!((embeddableConfig:(),gridData:(h:31,i:%271%27,w:48,x:0,y:0),id:%2755ea1930-c0bc-11e8-8222-07f3ec376cd5%27,panelIndex:%271%27,type:search,version:%276.3.2%27)),query:(language:lucene,query:%27%27),timeRestore:!f,title:%27swh-loader-svn%20workers!%27%20logs%27,viewMode:view)

ardumont closed this task as Resolved.Sep 28 2018, 3:04 PM