Page MenuHomeSoftware Heritage

Improve UTF8 UnicodeDecodeError handling in JSON conversion layer and update API documentation
Closed, MigratedEdits Locked

Description

When converting swh object raw bytes data to a JSON serializable representation, swh-web catches UnicodeDecodeError exception when trying to decode some UTF-8 encoded strings:

  • revision authors and committers: when a person name/fullname can not be decoded, a new key named decoding_failures is added to the person dictionary indicating which fields could not be decoded and the non-utf8 string are then decoded with backslash escape mode, see example
  • revision messages: when a revision message could not be decoded, a new key named message_decoding_failed is added to the revision dictionary and the message is set to None, see example

That UTF-8 decoding error handling is not really consistent and calls for improvements to have something more generic.
Using the error handler implemented for revision authors globally seems the right way to do it.

Once it is done, a new section should be added in the Web API top level documentation to inform about the fields related to UTF-8
decoding errors that might be found in JSON responses.