I've extract some mimetype stats from content_mimetypes:
Some obviously bogus values stand out, most likely due to bugs in the relevant swh-indexer component, e.g.:
$ grep '\[' mimetype-stats.txt [ [application/octet-stream|27805 [ [ [application/octet-stream|5068 [application/x-archive [ [ [|4496 [application/x-archive [|2843 [application/x-archive [ [|2610 [ [ [ [application/octet-stream|2602 [application/octet-stream|1122 [application/x-archive [application/x-archive|677 [application/x-archive [application/x-archive [ [|245 [application/x-archive [application/x-archive [application/x-archive [application/x-archive|176 [application/x-archive [application/x-archive [|153 [application/x-archive [application/x-archive [application/x-archive|127 [application/x-archive [application/x-archive [application/x-archive [|63 [application/x-archive|53 [application/x-archive [ [ [application/x-archive|50 [application/x-archive [application/x-archive [ [application/x-archive|33 [ [application/x-archive|25 [ [ [ [application/x-archive|25 [ [ [application/x-archive|14 [application/x-archive [ [application/x-archive|3
the pipe '|' here is the separator between mimetype value and count of contents with that mimetype. But all the '[' characters are part of the mimetype value, and looks wrong.
There might be other bogus values in the stats that I haven't noticed.