Page MenuHomeSoftware Heritage

Fail gracefully if the revision decoding process fails
Closed, MigratedEdits Locked

Description

Currently, we assume that the revision data (except its message) are always utf8-encoded. We would like to have the ability to catch any and all decoding failures within the converter, and to provide another manner of accessing the content we were not able to decode (downloading the raw data?), in the same fashion as the revision message is being handled.

Event Timeline

jbertran triaged this task as Normal priority.Jun 9 2016, 2:07 PM
jbertran created this task.
jbertran created this object in space S1 Public.

Some of the fields currently assumed to be UTF-8 are:

  • author/committer name
  • author/committer email
  • author/committer full name

However, for those fields, I believe a "raw download" is a bit too much and we should rather look at somehow escaping the field.

You should also make sure that the same process is applied to releases.

Finally, there are also some occurrence "branch names" that aren't proper UTF-8.

For inspiration, swh.storage.converters.decode_with_escape converts raw bytes into a backslash-escaped unicode codepoint sequence that is valid for JSON serialization. Its purpose is to allow serializing arbitrary byte sequences into a PostgreSQL jsonb field, but could probably be moved into swh.core and reused for that purpose.