diff --git a/docs/tutorial/encoding.txt b/docs/tutorial/encoding.txt new file mode 100644 index 00000000..0dd0d7e7 --- /dev/null +++ b/docs/tutorial/encoding.txt @@ -0,0 +1,26 @@ +Encoding +======== + +You will notice that all lower-level functions in Dulwich take byte strings +rather than unicode strings. This is intentional. + +Although `C git`_ recommends the use of UTF-8 for encoding, this is not +strictly enforced and C git treats filenames as sequences of non-NUL bytes. +There are repositories in the wild that use non-UTF-8 encoding for filenames +and commit messages. + +.. _C git: https://github.com/git/git/blob/master/Documentation/i18n.txt + +The library should be able to read *all* existing git repositories, +irregardless of what encoding they use. This is the main reason why Dulwich +does not convert paths to unicode strings. + +A further consideration is that converting back and forth to unicode +is an extra performance penalty. E.g. if you are just iterating over file +contents, there is no need to consider encoded strings. Users of the library +may have specific assumptions they can make about the encoding - e.g. they +could just decide that all their data is latin-1, or the default Python +encoding. + +Higher level functions, such as the porcelain in dulwich.porcelain, will +automatically convert unicode strings to UTF-8 bytestrings. diff --git a/docs/tutorial/index.txt b/docs/tutorial/index.txt index 7d085a1f..5a249de3 100644 --- a/docs/tutorial/index.txt +++ b/docs/tutorial/index.txt @@ -1,18 +1,19 @@ .. _tutorial: ======== Tutorial ======== .. toctree:: :maxdepth: 2 introduction + encoding file-format repo object-store remote tag porcelain conclusion