Page MenuHomeSoftware Heritage

Indexers: compute (and maintain up-to-date) the filetype of all blobs
Closed, MigratedEdits Locked

Description

We want to have metadata in the DB that associate each blob to its intrinsic filetype.

As a first approximation the filetype might be encoded as a MIME type and computed using file --mime-type.
Having also the detected encoding (as per file --mime-encoding) would be nice too and will help the webapp quite a bit.

More advanced and structured information could be detected by using other tools, some of which are summarized in LWN.net's File-format analysis tools for archivists article.

Event Timeline

ardumont renamed this task from compute (and maintain up-to-date) the filetype of all blobs to Indexers: compute (and maintain up-to-date) the filetype of all blobs.Oct 5 2018, 2:47 PM
ardumont claimed this task.