Page MenuHomeSoftware Heritage

analyze bogus mimetype values in content_mimetype table
Closed, MigratedEdits Locked

Description

I've extract some mimetype stats from content_mimetypes:

Some obviously bogus values stand out, most likely due to bugs in the relevant swh-indexer component, e.g.:

$ grep '\[' mimetype-stats.txt
[ [application/octet-stream|27805
[ [ [application/octet-stream|5068
[application/x-archive [ [ [|4496
[application/x-archive [|2843
[application/x-archive [ [|2610
[ [ [ [application/octet-stream|2602
[application/octet-stream|1122
[application/x-archive [application/x-archive|677
[application/x-archive [application/x-archive [ [|245
[application/x-archive [application/x-archive [application/x-archive [application/x-archive|176
[application/x-archive [application/x-archive [|153
[application/x-archive [application/x-archive [application/x-archive|127
[application/x-archive [application/x-archive [application/x-archive [|63
[application/x-archive|53
[application/x-archive [ [ [application/x-archive|50
[application/x-archive [application/x-archive [ [application/x-archive|33
[ [application/x-archive|25
[ [ [ [application/x-archive|25
[ [ [application/x-archive|14
[application/x-archive [ [application/x-archive|3

the pipe '|' here is the separator between mimetype value and count of contents with that mimetype. But all the '[' characters are part of the mimetype value, and looks wrong.

There might be other bogus values in the stats that I haven't noticed.

Event Timeline

FTR, the query I've used to generate the stats is:


(the encoding there is needed due to T818)

From the top of my head, i would say that i forgot to clean up those bogus values after the initial runs around december 2016.
I don't see how i can easily check this though since we don't have the sha1 provenance yet.

...

Oh, now, it's still on.

I took one sample encoding randomly
[ [ [application/x-archive|14 which is 14 occurrences from your snapshot.
I checked in the db an i have 15 results now.

> select id, convert_from(mimetype, 'utf-8'), convert_from(encoding, 'utf-8') as mt from content_mimetype where mimetype='[ [ [application/x-archive' ;

                     id                     |        convert_from        |   mt
--------------------------------------------+----------------------------+--------
 \x05f9383db2ed85942c3f5ade8693bed782d677ad | [ [ [application/x-archive | binary
 \x04cbacc138da81e9e8b299bcfed7e4e5afd57b11 | [ [ [application/x-archive | binary
 \x03a3a7aeb55bc242ed1aeb388718c6c0f703d2aa | [ [ [application/x-archive | binary
 \x06379c7aa4dc351d3a876650f5ba3f7bf718b3cf | [ [ [application/x-archive | binary
 \x05fa43b3e91ffea6843046359c3990f78b79f5a4 | [ [ [application/x-archive | binary
 \x0352e12ed2b91b614fa1df4d9ff3770a993e7e0d | [ [ [application/x-archive | binary
 \x065e5445c8a1cb4c3daf1643e9128e7ab3085995 | [ [ [application/x-archive | binary
 \xf53c5436ed35812baf9ec400195ba29a10591b12 | [ [ [application/x-archive | binary
 \x977d8aba80077635734aea1eb4a43bce0bf298c0 | [ [ [application/x-archive | binary
 \x84beea1d16119f180e590aaed9c1bbb8c5d82e9b | [ [ [application/x-archive | binary
 \xa2802ae983a3d9f5866a6c94d5d8ec608f6141de | [ [ [application/x-archive | binary
 \x92ba592f3baa955d5002485a3986b0c99a90f0a3 | [ [ [application/x-archive | binary
 \x6ea705b4e7a68c28b8ff8e4162e641822d3f73d7 | [ [ [application/x-archive | binary
 \xd894fd820572ff01c7f49a4509867ffcd24db0f3 | [ [ [application/x-archive | binary
 \x1cd355dd999a3d7233e73c9e2013e91595d537c5 | [ [ [application/x-archive | binary

Note: In that case, it's not an issue for the remaining indexer since it's a binary mimetype anyway (so it's not scheduled further).
But, 1. it's factually wrong anyway so it needs fixing anyway. 2. it may not be true for all other bogus values anyway.

I'll take this as a hint to migrate away from using pure file detection and parsing.
I'll use the library python3-magic (we already use in swh-web).

I'll list those bogus values and reschedule them once the fix is done and deployed.

ardumont renamed this task from bogus mimetype values in content_mimetypes table to analyze bogus mimetype values in content_mimetypes table.Nov 15 2017, 11:38 AM

I don't see how i can easily check this though since we don't have the sha1 provenance yet.

well, you could write tests for it (yes, speaking to myself :)

There might be other bogus values in the stats that I haven't noticed.

The only additional bogus values i have seen are the empty mimetype.

$ grep '^|' mimetype-stats.txt
|95

That gives 48285 sha1s to reschedule.

ardumont renamed this task from analyze bogus mimetype values in content_mimetypes table to analyze bogus mimetype values in content_mimetype table.Nov 15 2017, 4:24 PM

I am waiting for the queue to drop at 10000 as that will avoid rescheduling the already done 10000 (well except for the new bogus values :)

In the mean time, i added an index in the table content_mimetype to ease future listing and cleaning up.

create index concurrently content_mimetype_encoding on content_mimetype(mimetype);

And now i can list those bogus values fast:

softwareheritage=> explain select count(distinct id) from content_mimetype where mimetype LIKE '[%' or mimetype like '';
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=506166.69..506166.70 rows=1 width=8)
   ->  Bitmap Heap Scan on content_mimetype  (cost=4786.03..505422.79 rows=297562 width=21)
         Recheck Cond: ((mimetype ~~ '\x5b25'::bytea) OR (mimetype ~~ '\x'::bytea))
         Filter: ((mimetype ~~ '\x5b25'::bytea) OR (mimetype ~~ '\x'::bytea))
         ->  BitmapOr  (cost=4786.03..4786.03 rows=297927 width=0)
               ->  Bitmap Index Scan on content_mimetype_encoding  (cost=0.00..57.19 rows=3062 width=0)
                     Index Cond: ((mimetype >= '\x5b'::bytea) AND (mimetype < '\x5c'::bytea))
               ->  Bitmap Index Scan on content_mimetype_encoding  (cost=0.00..4580.06 rows=294865 width=0)
                     Index Cond: (mimetype = '\x'::bytea)
(9 rows)

softwareheritage=> select count(distinct id) from content_mimetype where mimetype LIKE '[%' or mimetype like '';
 count
-------
 50716
(1 row)

This roughly match the magnitude order (also it confirms that this is a real bug since it continues to grow \m/).

...

I'm still waiting for the queue to drop...

Status:

  • Final listing of bogus values: /srv/storage/space/lists/indexer/mimetype/sha1-with-bogus-values.txt.gz (50733)
  • Queue reached the sane point.
  • workers stopped.
ardumont claimed this task.