Page MenuHomeSoftware Heritage

Add support for client side rendering of Jupyter notebooks
ClosedPublic

Authored by anlambert on Apr 12 2019, 6:18 PM.

Details

Summary

That diff adds support for client side rendering of Jupyter notebooks inside
the content view of swh-web/browse.

Related T1641

Diff Detail

Repository
rDWAPPS Web applications
Branch
jupyter-nb-rendering
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5499
Build 7465: tox-on-jenkinsJenkins
Build 7464: arc lint + arc unit

Event Timeline

Update:

  • rebase
  • add support for math typesetting upon notebook rendering through the use of the MathJax library
  • prevent notebook rendering to override our default font
  • increase max content display size

Update:

  • Rebase
  • Add support for image rendering in the notebook when the src url is relative to a directory present in the swh archive
  • Put XSS filtering related code in a dedicated file
  • Fix numerous issues regarding math typesetting (it was tough to handle all possible math formula as some LaTeX code escaping and small hacks were required in markdown code prior converting it to HTML, but I ended up with something that works great everywhere)
  • CSS improvements and fixes

Currently this diff contains a single commit. I intend to improve the git history
as future planned changes.

Update:

  • move the XSS filtering improvements in a separate diff (D1425)
  • more CSS fixes and improvements (better handle small screens for instance)
  • factorize WebLabels configuration for MathJax
anlambert retitled this revision from [WIP] Add support for client side rendering of Jupyter notebooks to Add support for client side rendering of Jupyter notebooks.
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/web/assets/config/mathjax-js-files.js
21–23 ↗(On Diff #4627)

Do we really want to start pulling JS from two other domains/orgs?

swh/web/assets/config/webpack-plugins/generate-weblabels-webpack-plugin/index.js
180–182

This seems like a bad way to check for relativity, eg. an absolute URL starting with ftp:// or a relative starting with http:blahblah.

Maybe search for :// in the URL instead?

191–195

Same.

(Maybe there should be a utility function to check for relativity?)

191–196

In the chunk above, you do:

if relative:
   url = base + url

and here you do:

if relative:
    foo = base + url
else:
    foo = url
354–357

Same

swh/web/assets/src/bundles/webapp/notebook-rendering.js
18–20

yum

24–26

JS is weird...

61

Do we use transpiling to older JS versions? If not, this will break on not-so-old browsers.

75–85

Why don't we want these underscores to be interprested by showdown?

129–136

Let's open a task to remember that

swh/web/assets/src/bundles/webapp/notebook.css
107

Why?

This revision now requires changes to proceed.Apr 18 2019, 6:03 PM
swh/web/assets/config/mathjax-js-files.js
21–23 ↗(On Diff #4627)

First one is the CDN from which we get the minified js source files of MathJax, second one enables to get the unminified source code and is only used in the WebLabels page.
The goal is to respect LibreJS specification here.

swh/web/assets/config/webpack-plugins/generate-weblabels-webpack-plugin/index.js
180–182

Seems better indeed

swh/web/assets/src/bundles/webapp/notebook-rendering.js
24–26

Tell me about it

61

Of course we do

75–85

To get correct math typesetting when using MathJax.

LaTeX formula containing substrings like { A }_{ 1 } will be turned by showdown to { A }<em>{ 1 }
(even with the literalMidWordUnderscores option set to true) and MathJax will fail to properly
render it afterwards.

Update:

  • rebase
  • address comments
  • more CSS fixes / improvements
  • unescape HTML from text that will be inserted in pre elements
swh/web/assets/src/bundles/webapp/notebook-rendering.js
129–136
swh/web/assets/src/bundles/webapp/notebook.css
107

I adapted https://github.com/jsvine/nbpreview/blob/master/css/vendor/notebook.css for our web application.

The CSS rule

div[style="max-height:1000px;max-width:1500px;overflow:auto;"] {
    max-height: none !important;    
}

was related to pandas dataframe formatting.

I removed it as it effectively seems weird to have such a rule here.

Update: Fix url in notebook.css comment

Update: Check that hljs supports the notebook language prior calling the highlight function

I think we need to have a conversation about how we decide to add support for rendering specific file formats in the webapp.

  • There are a million different file formats out there, why are we rendering Markdown and Jupyter notebooks and not something else?
  • When we have support for a million (or even just a dozen, really) file formats, are we going to test for which-is-which in a long chain of IFs in a template or do we need a proper plugin system?
  • How do we decide if something is a given file format or not, especially considering that getting it wrong might even entail security vulnerabilities, injections, etc.? (because we're rendering content we do not control directly)

In summary, this feature is really cool. But I think we rushed a bit into getting it in. I'm not opposed to actually landing this, but we really need to sit down and think about the process, the architecture, and the policy for deciding how (and if) to add support for other formats in the future.

There are a million different file formats out there, why are we rendering Markdown and Jupyter notebooks and not something else?

It was a user request so I think it is wise to listen to their need, plus the proposed use case is really interesting for showcasing the
scientific knowledge contained into the archive.
Regarding the rendering implementation, that's why I tagged this as easy hack as we already had all the required software components
client side to implement this in a few line of code (I did not see coming the LaTeX escaping hell tough).

When we have support for a million (or even just a dozen, really), are we going to test for which-is-which in a long chained series of IFs in a template? That doesn't seem really wise…

Sure, this calls for clever refactoring.

How do we decide if something is a given file format or not, especially considering that getting it wrong might even entail security vulnerabilities, injections, etc.? (because we're rendering content we do not control directly)

All produced HTML gets filtered for XSS injection before inserting it into the DOM. So apart some external image bytes, nothing sensible will be loaded.
@olasd suggested to have some kind of user input to approve the loading of external resources, this could be added to ensure more safety.

we really need to sit down and think about the process, the architecture, and the policy for deciding how (and if) add support for other formats in the future

I will create a task on this next week and start brainstorming about it.

This revision is now accepted and ready to land.Apr 29 2019, 2:28 PM
This revision was automatically updated to reflect the committed changes.