Indexers: compute (and maintain up-to-date) the filetype of all blobs
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Jun 13 2016, 4:06 PM

Description

We want to have metadata in the DB that associate each blob to its intrinsic filetype.

As a first approximation the filetype might be encoded as a MIME type and computed using file --mime-type.
Having also the detected encoding (as per file --mime-encoding) would be nice too and will help the webapp quite a bit.

More advanced and structured information could be detected by using other tools, some of which are summarized in LWN.net's File-format analysis tools for archivists article.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T439 Indexers: compute (and maintain up-to-date) the filetype of all blobs
Migrated	gitlab-migration	T359 Indexers: batch content analyzer infrastructure
Migrated	gitlab-migration	T1227 General improvments of the indexer: Schedule indexer tasks
Migrated	gitlab-migration	T1229 Indexers: Make orchestrators use swh-scheduler for scheduling
Migrated	gitlab-migration	T1290 Indexers: Use swh.scheduler instead of directly relying on Celery
Migrated	gitlab-migration	T1230 Indexers: Improve readme to be more explicit on how to run locally
Migrated	gitlab-migration	T1310 Simplify indexer design: move away from the pipeline approach
Migrated	gitlab-migration	T1311 indexer: Remove orchestrators
Migrated	gitlab-migration	T1312 indexer: Adapt textual content indexer to actually filter textual content themselves
Migrated	gitlab-migration	T1324 Deploy metadata indexers in production
Migrated	gitlab-migration	T1326 metadata indexer: Deploy origin head
Migrated	gitlab-migration	T991 Indexers: Send range of ids instead of list of ids
Migrated	gitlab-migration	T1375 Deploy revision metadata indexer
Migrated	gitlab-migration	T1376 Deploy origin indexer
Migrated	gitlab-migration	T1374 content indexer: Determine the identifier ranges to use to schedule those
Migrated	gitlab-migration	T818 indexer DB should not use bytea for mimetype and encoding columns

Event Timeline

zack created this task.Jun 13 2016, 4:06 PM

zack added a subtask: T359: Indexers: batch content analyzer infrastructure.

ardumont renamed this task from compute (and maintain up-to-date) the filetype of all blobs to Indexers: compute (and maintain up-to-date) the filetype of all blobs.Oct 5 2018, 2:47 PM

zack edited projects, added Indexer; removed Developers.Oct 18 2018, 9:17 PM

ardumont closed subtask T359: Indexers: batch content analyzer infrastructure as Resolved.Jan 15 2019, 2:44 PM

ardumont closed this task as Resolved.Jan 15 2019, 2:47 PM

ardumont claimed this task.

This task has been migrated to GitLab.

Indexers: compute (and maintain up-to-date) the filetype of all blobsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Indexers: compute (and maintain up-to-date) the filetype of all blobs
Closed, MigratedEdits Locked
Actions

Related Objects
Search...