Page MenuHomeSoftware Heritage

Guess extension from detected mime type to add to filename
AbandonedPublic

Authored by anlambert on Apr 9 2019, 7:48 AM.
This revision can not be accepted until the required legal agreements have been signed.

Details

Summary

partial workaround for T1167 to enable downloading raw contents with extensions under the /content/raw/ endpoint.

If mime types identified as "text/*" other than "text/plain", suggesting the programming language in the content is potentially detected, guess the file extension using mimetypes library and add it to filename.

This way, when users download raw contents using "Save Page As" will get a file with extension, making it easier to inspect the content locally.

Limitations include mime types guessed wrong or not detected.

Future work plan: get content's filename by the hash value.

Test Plan

edit test cases to match changed raw content filename proposed in this diff.

Diff Detail

Repository
rDWAPPS Web applications
Branch
guess-extension-from-mime-type
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5209
Build 7036: tox-on-jenkinsJenkins
Build 7035: arc lint + arc unit

Event Timeline

timokratia created this revision.Apr 9 2019, 7:48 AM

I am not really convinced by this as there is no guarantee that the selected extension will be the right one.
As you said:

Limitations include mime types guessed wrong or not detected.

Nevertheless, the filename is available when browsing a content in an origin context
(see https://archive.softwareheritage.org/browse/origin/https://github.com/git/git/content/git.c/ for instance,
when you click on the Raw button then execute a "Save Page As" operation, the original filename will
be used to save the file to disk).

When browsing a content without origin context (for instance https://archive.softwareheritage.org/browse/content/sha1_git:2014aab6b83c61695d50ac39a18864a8d77858e0/)
and one wants to save its raw bytes, it is up to the user to save it to an adequate filename.

Future work plan: get content's filename by the hash value.

The issue here is that for a given hash, numerous filenames can correspond to it so this is not a one to one lookup.

I will not accept that Diff so you can close it.

anlambert commandeered this revision.Apr 9 2019, 10:10 PM
anlambert abandoned this revision.
anlambert added a reviewer: timokratia.

Hi anlambert ,

Thanks for reviewing this diff, and apologies for my late reply and not closing it, I didn't find out how to close it when I first saw your message.

I read the paper about SWH's identifiers for my digital preservation course paper after submitting this diff and found that file names are considered as contextual information and are thus context dependent, and that a content file can appear in multiple directories while having the same hash value/identifier.

However it's not a required feature for the identifiers, providing file names when accessing contents via hash value can be useful in certain use cases, such as making it easier to download and examine a content file. Do you think deriving file extensions or file names is a worthwhile feature to implement? I would like to look into this in the summer, regardless of being admitted into GSoC or not. Thank you for reading this post, and looking forward to your feedback!

anlambert added a comment.EditedWed, Apr 24, 11:19 AM

Hi @timokratia ,

Currently, you can pass a filename as a query parameter to the content view of swh-web
(see https://archive.softwareheritage.org/browse/content/sha1_git:d0158ee2e79b461bf25b5b66e6778671c2114263/?path=xmlrpc.php
as an example).

Nevertheless, it would be indeed of interest to add a filename as a new optional part of the SWH's identifiers.
Currently, only the origin info is handled as an optional information for a persistent identifier.

With the example above, we could derive the following identifier:
swh:1:d0158ee2e79b461bf25b5b66e6778671c2114263;filename=xmlrpc.php

It is then up to the web application to redirect to an adequate view taking the filename into account.
One of the advantage I see to provide the filename information is that code highlighting will be better.
When no filename is provided, the language to highlight is selected from the content mime type
but using the file extension often leads to better result.

I have created T1687 to keep track of the proposal.

Content Hidden

The content of this revision is hidden until the author has signed all of the required legal agreements.