Page MenuHomeSoftware Heritage

reschedule indexing of contents with bogus mimetype values
Closed, MigratedEdits Locked

Description

As explained in the parent task, bogus values exist for the indexer mimetype.
We need to list and schedule them back again.

After the fix has been deployed (T849)

Event Timeline

ardumont renamed this task from Schedule back bogus mimetype values for indexing to Schedule back bogus mimetype values for indexation.Nov 15 2017, 11:42 AM
ardumont created this task.
zack renamed this task from Schedule back bogus mimetype values for indexation to reschedule indexing of contents with bogus mimetype values.Nov 15 2017, 12:25 PM

Depends on T761

One worker (worker08.euwest.azure) has been migrated so it's working alone for now.

ardumont changed the task status from Open to Work in Progress.Nov 22 2017, 4:01 PM

The old tool is id 7, the new one is 9:

softwareheritage=> select * from indexer_configuration where id in (7, 9);
 id | tool_name |  tool_version   |                   tool_configuration
----+-----------+-----------------+--------------------------------------------------------
  7 | file      | 5.22            | {"command_line": "file --mime <filepath>"}
  9 | file      | 1:5.30-1+deb9u1 | {"type": "library", "debian-package": "python3-magic"}
(2 rows)

Old and bogus values are:

softwareheritage=> select count(*) from content_mimetype where mimetype LIKE '[%' or mimetype like '' and indexer_configuration_id=7;
 count
-------
 50733
(1 row)

The list of those id has been scheduled back and those have been indexed.
Checking that the new indexed values with the new id, nothing is returned:

softwareheritage=> select count(*) from content_mimetype where (mimetype LIKE '[%' or mimetype like '') and indexer_configuration_id=9;
 count
-------
     0
(1 row)

Checking for example some ids with bogus values, i have indeed 2 values (one for the old tool which is bogus, one for the new one which is not):

softwareheritage=> select convert_from(mimetype, 'utf-8'), convert_from(encoding, 'utf-8'), indexer_configuration_id from content_mimetype where id='\x8feab4fd3881e396012724e166801bb3a4b41419';
        convert_from         | convert_from | indexer_configuration_id
-----------------------------+--------------+--------------------------
 [ [application/octet-stream | binary       |                        7
 application/x-mach-binary   | binary       |                        9
(2 rows)

softwareheritage=> select convert_from(mimetype, 'utf-8'), convert_from(encoding, 'utf-8'), indexer_configuration_id from content_mimetype where id='\xcd0187768974258b2e959320f52137389b020bce';
         convert_from          | convert_from | indexer_configuration_id
-------------------------------+--------------+--------------------------
 [ [ [application/octet-stream | binary       |                        7
 application/x-mach-binary     | binary       |                        9
(2 rows)