Page MenuHomeSoftware Heritage

translator: Fix parsing of multibyte characters
ClosedPublic

Authored by vlorentz on Sep 8 2021, 5:10 PM.

Details

Summary

tree-sitter returns byte indices, not char indices.

Resolves SWH-SEARCH-12

Diff Detail

Repository
rDSEA Archive search
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6217 (id=22496)

Could not rebase; Attempt merge onto 7479282c70...

Updating 7479282..e59807b
Fast-forward
 swh/search/tests/test_translator.py | 51 ++++++++++++++++++++++++++++++++++++-
 swh/search/translator.py            |  6 ++---
 swh/search/utils.py                 | 11 ++++++--
 3 files changed, 62 insertions(+), 6 deletions(-)
Changes applied before test
commit e59807b5c78bed7547305a8887d6c52521ba3044
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 8 17:09:43 2021 +0200

    translator: Fix parsing of multibyte characters
    
    tree-sitter returns byte indices, not char indices.

commit 7f1f1be3f253e9ed59807491eb1043616a0bf4e3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 8 17:08:46 2021 +0200

    utils: Fix unescape() on non-ASCII strings.
    
    'unicode_escape' assumes latin-1 as input.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/292/ for more details.

This revision is now accepted and ready to land.Sep 9 2021, 11:30 AM