HomeSoftware Heritage

Improve handling comment characters in config file

Description

Improve handling comment characters in config file

Comment characters that are within balanced quotes should not discard
the rest of the line:

[branch "my#branch"]

is a valid section for a branch named my#branch (# is a valid
character in a branch name [0]). Previously this normalized to:

[branch "my

which raised an exception for finding an invalid section header.


Note that the current comment parser solution leaves ambiguity in case
of multi-byte encoded data in configuration files due to the risk of a
byte clashing with either of ", # or ; as they are represented in
ASCII, likely leading to interpreting a configuration file as malformed
depending on e.g. branch names. In real-life systems, the only
multi-byte encoding that is likely to be encountered for Git
configuration files is UTF-8, which should be safe from such accidental
clashes due to how it uses the high bit as continuation byte marker.

The same non-UTF-8 multi-byte issue seems to hold true for Git itself as
its current configuration parser is written [1], and it makes sense to
follow the reference implementation in this regard.

[0]: man git-check-ref-format
[1]: https://github.com/git/git/blob/c2ece9dc/config.c#L504

Details

Provenance
Daniel Andersson <dandersson@users.noreply.github.com>Authored on Nov 10 2017, 4:54 PM
ardumontPushed on Sep 27 2021, 5:34 PM
Parents
rPPDWd4ced71b4fa0: Merge branch 'issue-577' of https://github.com/jonashaag/dulwich
Branches
Unknown
Tags
Unknown

Event Timeline