Page MenuHomeSoftware Heritage

assets: XSS filtering improvements

Authored by anlambert on Apr 18 2019, 4:58 PM.


  • put related code in a dedicated file
  • use a XSS filtering hook to fix some image relative src urls included in README HTML rendering (load image bytes from the archive content if available)
  • remove previoulsy introduced hacks in Python code as correct image loading in README HTML rendering is now handled client-side by the feature described above

Related T1641

Diff Detail

rDWAPPS Web applications
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Apr 18 2019, 4:58 PM
vlorentz added inline comments.

Do we need to re-select the node here? Can't we alter the node parameter directly?


Why can't we use dir_info['checksums']['sha1'] anymore?


This name and route are too generic. I suggest to use something like _directory_resolve_content_path and directory/resolve/content-path. Or _directory_content_path_to_hash/_directory_path_to_content_hash

anlambert added inline comments.Apr 18 2019, 5:48 PM

Yes, this is needed as the node passed as parameter is not the one that ends up inserted in the DOM.

I supposed this is how DOMPurify works or it might be related that we are making an asynchronous call
to fetch here.


Well seen, will update accordingly


I would go for directory/resolve/content-path/ and _directory_path_to_content_hash.

I think also adding the content checksums to the returned JSON result would better honored
that renaming.

vlorentz added inline comments.Apr 18 2019, 5:50 PM

JSONPurify's example does change the given node:

Note that you need to return it.


and I think you can drop the new endpoint.

olasd requested changes to this revision.Apr 18 2019, 5:51 PM
olasd added a subscriber: olasd.

I'm not familiar with this code at all, so I apologize in advance for asking dumb questions :)

To my untrained eyes, this looks like a sensible change, so please take my comments as just comments rather than as strong blockers, I'm happy to be told off :).


I understand that this is the current behavior, so I'm fine with this accepting the updated check as is.

However, I'm slightly uncomfortable with the status quo, and we should ask ourselves the question of whether we really should load externally referenced resources by default (which leaks PII from of our users to external hosters), rather than on-demand.

I'd be in favor or adding an explicit click-through that our users would use to actually load external media, which would be more privacy-conscious.

We can even add a checkmark saying "I don't care" which would enable autoloading by default for further navigation.

(One can argue that there's browser extensions like uMatrix designed by/for people who want to avoid this, but I think we'd be better citizens by raising awareness of the issue rather than just dismissing it)


So, if I understand correctly, this assumes that path-absolute URLs are actually relative to the current directory (so, that the current directory is the root of the resource), correct?

Obviously that'll not always be the case (but it's hard to ever know when it is), so at least we should add a comment to make this explicit.


Before passing the path through to the database, we should use os.path.normpath and bail out if the path starts with ... This would minimize database queries and allow us to resolve weird paths such as a/b/../c/./d (which could certainly happen in autogenerated stuff)

This revision now requires changes to proceed.Apr 18 2019, 5:51 PM
anlambert added inline comments.Apr 18 2019, 6:20 PM

Considering the only external data loaded are image bytes (all sensible stuffs like scripts are removed by DOMPurify) , I do not know if we want to bother the users to explicitly authorize their loading. They will likely end up to always authorize the images loading as some README rendering will look really sad without them.

Nevertheless, I can start documenting myself on how to implement that.


In fact, I have retested and this piece of code is not needed after all.
Whether there is a leading slash or not, if the image is stored in the archive,
it will be found in both cases. So I will remove it.


I will give it a try again but I am worried about the asynchronous call to fetch.



anlambert updated this revision to Diff 4631.Apr 19 2019, 12:03 PM


  • address comments
  • simplify DOMPurify hook: remove no more needed node selection and asynchronous call to fetch
  • handle url redirection directly in the new renamed directory/resolve/content-path/ endpoint
anlambert added inline comments.Apr 19 2019, 12:08 PM

I prefer to keep the new endpoint and let this one returns a 404 error when providing a path pointing to a file (as only directories and sub-ones should be handled by that endpoint).

olasd accepted this revision.Apr 19 2019, 3:14 PM
olasd added inline comments.

should be ../ (if I were very nasty, I could call a directory ...)

This revision is now accepted and ready to land.Apr 19 2019, 3:14 PM
anlambert updated this revision to Diff 4638.Apr 19 2019, 3:34 PM

Update: address latest olasd comment ('..' => '../')

This revision was automatically updated to reflect the committed changes.