Currently, we assume that the revision data (except its message) are always utf8-encoded. We would like to have the ability to catch any and all decoding failures within the converter, and to provide another manner of accessing the content we were not able to decode (downloading the raw data?), in the same fashion as the revision message is being handled.
Description
Description
Revisions and Commits
Revisions and Commits
Related Objects
Related Objects
Event Timeline
Comment Actions
Some of the fields currently assumed to be UTF-8 are:
- author/committer name
- author/committer email
- author/committer full name
However, for those fields, I believe a "raw download" is a bit too much and we should rather look at somehow escaping the field.
You should also make sure that the same process is applied to releases.
Finally, there are also some occurrence "branch names" that aren't proper UTF-8.
Comment Actions
For inspiration, swh.storage.converters.decode_with_escape converts raw bytes into a backslash-escaped unicode codepoint sequence that is valid for JSON serialization. Its purpose is to allow serializing arbitrary byte sequences into a PostgreSQL jsonb field, but could probably be moved into swh.core and reused for that purpose.