Page MenuHomeSoftware Heritage

apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
ClosedPublic

Authored by vlorentz on Jul 6 2021, 2:47 PM.

Details

Summary

Motivation:

This commit started as a simple change: I wanted to replace:

`<type> <IRI>`

with:

``<type> <IRI>``

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the process_paragraph method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor <type> <IRI> string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to self.data["description"], because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

  • our custom role code is now neatly restricted in visit_problematic, so it can't overflow, because docutils runs visit_problematic with *only* the role's string as child
  • it detects unexpected nodes, such as the title_reference roles, which is usually produced when accidentally using single-backquotes instead of double-backquotes to wrap inline code blocks (it happens a lot when one is used to markdown)
Test Plan

I checked most of the endpoints' documentation, and it's visually identical.

Diff Detail

Repository
rDWAPPS Web applications
Branch
apidoc-noregexp
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22493
Build 35043: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35042: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5971 (id=21513)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit 524970b8b33b8a3cdafa5f9acba3fdb18cebe8a5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/928/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/928/console
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 6 2021, 3:06 PM
Harbormaster failed remote builds in B22486: Diff 21513!

fix error on recent mypy versions

Build is green

Patch application report for D5971 (id=21516)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit a9cd34931968a8e0bdbdcd637827b9b4f5e72f85
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/929/ for more details.

Nice ! I guess the remaining docutils error in http://localhost:5004/api/1/raw-extrinsic-metadata/swhid/doc/ is due to the missing target node handling ?

indeed, it was a little tricky to get it right

Build is green

Patch application report for D5971 (id=21520)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit 3c64a0ed37a2aa99905e9ad2935a932531ce89be
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/930/ for more details.

indeed, it was a little tricky to get it right

FYI, there is still a docutils error in http://localhost:5004/api/1/ as only first line of rst document is used in the endpoint list and thus reference context is missing.
Maybe we could render the whole docstring to HTML and extract the first paragraph to integrate in the endpoints template in that case ?

This revision now requires changes to proceed.Jul 6 2021, 5:01 PM

fix links in the endpoint index

anlambert requested changes to this revision.EditedJul 6 2021, 5:42 PM

I still find docutils warnings after last diff update in those pages (sorry):

I think we should add a new test that should render HTML documentation for each endpoint and ensure no docutils warnings are issued.
I got the following output in django logs so there must be a way to detect those kind of errors I guess:

<string>:3: (WARNING/2) Duplicate explicit target name: "extrinsic metadata".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:3: (WARNING/2) Duplicate explicit target name: "extrinsic metadata".

Update: We can force docutils to throw exception on warnings by setting the halt_level setting to 2 (warning), apidoc page will return an error 500 on any warning in that case.

This revision now requires changes to proceed.Jul 6 2021, 5:42 PM

Build is green

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/933/ for more details.

ugh, and it's not even the only issue -_-

I'll keep working on it

Build has FAILED

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/934/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/934/console

Build was aborted

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.
Changes applied before test
commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/935/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/935/console
  • Only produce anonymous target to avoid duplication
  • rebase
  • turn on errors

Update: We can force docutils to throw exception on warnings by setting the halt_level setting to 2 (warning), apidoc page will return an error 500 on any warning in that case.

FYI: I didn't test this, because for some reason I don't get the same warnings you do...

Build is green

Patch application report for D5971 (id=21570)

Rebasing onto 25c639f292...

Current branch diff-target is up to date.
Changes applied before test
commit fd03b5842202ee314ab523d08b94974ac9e7fd7a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:
`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)
See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/959/ for more details.
This revision is now accepted and ready to land.Jul 9 2021, 12:34 PM
This revision was landed with ongoing or failed builds.Jul 9 2021, 12:48 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D5971 (id=21574)

Rebasing onto dd2e5e261b...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-960-D5971.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/960/ for more details.