Paths

Table of Contentst

Differential D1369

Guess extension from detected mime type to add to filename
AbandonedPublic
Actions

Authored by anlambert on Apr 9 2019, 7:48 AM.

Details

Reviewers

timokratia

Group Reviewers

Reviewers

Required Signatures

L3 Software Heritage Contributor License Agreement, version 1.0

Summary

partial workaround for T1167 to enable downloading raw contents with extensions under the /content/raw/ endpoint.

If mime types identified as "text/*" other than "text/plain", suggesting the programming language in the content is potentially detected, guess the file extension using mimetypes library and add it to filename.

This way, when users download raw contents using "Save Page As" will get a file with extension, making it easier to inspect the content locally.

Limitations include mime types guessed wrong or not detected.

Future work plan: get content's filename by the hash value.

Test Plan

edit test cases to match changed raw content filename proposed in this diff.

Diff Detail

Repository

rDWAPPS Web applications

Branch

guess-extension-from-mime-type

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 5209
Build 7036: tox-on-jenkins	Jenkins
Build 7035: arc lint + arc unit

Event Timeline

timokratia created this revision.Apr 9 2019, 7:48 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptApr 9 2019, 7:48 AM

Herald added a required legal document: L3 Software Heritage Contributor License Agreement, version 1.0. · View Herald Transcript

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/378/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tox/378/console

Harbormaster failed remote builds in B5209: Diff 4401!Apr 9 2019, 7:51 AM

I am not really convinced by this as there is no guarantee that the selected extension will be the right one.
As you said:

Limitations include mime types guessed wrong or not detected.

Nevertheless, the filename is available when browsing a content in an origin context
(see https://archive.softwareheritage.org/browse/origin/https://github.com/git/git/content/git.c/ for instance,
when you click on the Raw button then execute a "Save Page As" operation, the original filename will
be used to save the file to disk).

When browsing a content without origin context (for instance https://archive.softwareheritage.org/browse/content/sha1_git:2014aab6b83c61695d50ac39a18864a8d77858e0/)
and one wants to save its raw bytes, it is up to the user to save it to an adequate filename.

Future work plan: get content's filename by the hash value.

The issue here is that for a given hash, numerous filenames can correspond to it so this is not a one to one lookup.

I will not accept that Diff so you can close it.

anlambert commandeered this revision.Apr 9 2019, 10:10 PM

anlambert abandoned this revision.

anlambert added a reviewer: timokratia.

Hi anlambert ,

Thanks for reviewing this diff, and apologies for my late reply and not closing it, I didn't find out how to close it when I first saw your message.

I read the paper about SWH's identifiers for my digital preservation course paper after submitting this diff and found that file names are considered as contextual information and are thus context dependent, and that a content file can appear in multiple directories while having the same hash value/identifier.

However it's not a required feature for the identifiers, providing file names when accessing contents via hash value can be useful in certain use cases, such as making it easier to download and examine a content file. Do you think deriving file extensions or file names is a worthwhile feature to implement? I would like to look into this in the summer, regardless of being admitted into GSoC or not. Thank you for reading this post, and looking forward to your feedback!

Hi @timokratia ,

Currently, you can pass a filename as a query parameter to the content view of swh-web
(see https://archive.softwareheritage.org/browse/content/sha1_git:d0158ee2e79b461bf25b5b66e6778671c2114263/?path=xmlrpc.php
as an example).

Nevertheless, it would be indeed of interest to add a filename as a new optional part of the SWH's identifiers.
Currently, only the origin info is handled as an optional information for a persistent identifier.

With the example above, we could derive the following identifier:
swh:1:d0158ee2e79b461bf25b5b66e6778671c2114263;filename=xmlrpc.php

It is then up to the web application to redirect to an adequate view taking the filename into account.
One of the advantage I see to provide the filename information is that code highlighting will be better.
When no filename is provided, the language to highlight is selected from the content mime type
but using the file extension often leads to better result.

I have created T1687 to keep track of the proposal.

anlambert mentioned this in T1687: Add filename as an optional part in persistent identifiers.Apr 24 2019, 11:24 AM

Content Hidden

The content of this revision is hidden until the author has signed all of the required legal agreements.

Guess extension from detected mime type to add to filenameAbandonedPublicActions

Details

Diff Detail

Event Timeline

Content Hidden

Guess extension from detected mime type to add to filename
AbandonedPublic
Actions