Details

Reviewers

anlambert

Group Reviewers

Reviewers

Commits

rDWAPPSdd2e5e261b86: apidoc: Stop parsing docutils trees with regexps on its pseudo-XML

Summary

Motivation:

This commit started as a simple change: I wanted to replace:

`<type> <IRI>`

with:

``<type> <IRI>``

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the process_paragraph method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor <type> <IRI> string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to self.data["description"], because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

our custom role code is now neatly restricted in visit_problematic, so it can't overflow, because docutils runs visit_problematic with *only* the role's string as child
it detects unexpected nodes, such as the title_reference roles, which is usually produced when accidentally using single-backquotes instead of double-backquotes to wrap inline code blocks (it happens a lot when one is used to markdown)

Test Plan

I checked most of the endpoints' documentation, and it's visually identical.

Diff Detail

Repository

rDWAPPS Web applications

Branch

apidoc-noregexp

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 22545
Build 35136: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 35135: arc lint + arc unit

Event Timeline

vlorentz created this revision.Jul 6 2021, 2:47 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJul 6 2021, 2:47 PM

Build has FAILED

Patch application report for D5971 (id=21513)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit 524970b8b33b8a3cdafa5f9acba3fdb18cebe8a5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/928/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/928/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 6 2021, 3:06 PM

Harbormaster failed remote builds in B22486: Diff 21513!

fix error on recent mypy versions

Build is green

Patch application report for D5971 (id=21516)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit a9cd34931968a8e0bdbdcd637827b9b4f5e72f85
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/929/ for more details.

Harbormaster completed remote builds in B22489: Diff 21516.Jul 6 2021, 3:27 PM

vlorentz requested review of this revision.Jul 6 2021, 3:27 PM

Nice ! I guess the remaining docutils error in http://localhost:5004/api/1/raw-extrinsic-metadata/swhid/doc/ is due to the missing target node handling ?

implement visit_target

indeed, it was a little tricky to get it right

Build is green

Patch application report for D5971 (id=21520)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit 3c64a0ed37a2aa99905e9ad2935a932531ce89be
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/930/ for more details.

Harbormaster completed remote builds in B22493: Diff 21520.Jul 6 2021, 4:55 PM

In D5971#153603, @vlorentz wrote:

indeed, it was a little tricky to get it right

FYI, there is still a docutils error in http://localhost:5004/api/1/ as only first line of rst document is used in the endpoint list and thus reference context is missing.
Maybe we could render the whole docstring to HTML and extract the first paragraph to integrate in the endpoints template in that case ?

This revision now requires changes to proceed.Jul 6 2021, 5:01 PM

fix links in the endpoint index

I still find docutils warnings after last diff update in those pages (sorry):

I think we should add a new test that should render HTML documentation for each endpoint and ensure no docutils warnings are issued.
I got the following output in django logs so there must be a way to detect those kind of errors I guess:

<string>:3: (WARNING/2) Duplicate explicit target name: "extrinsic metadata".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:5: (WARNING/2) Duplicate explicit target name: "software heritage persistent identifier".
<string>:3: (WARNING/2) Duplicate explicit target name: "extrinsic metadata".

Update: We can force docutils to throw exception on warnings by setting the halt_level setting to 2 (warning), apidoc page will return an error 500 on any warning in that case.

This revision now requires changes to proceed.Jul 6 2021, 5:42 PM

Build is green

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/933/ for more details.

Harbormaster completed remote builds in B22500: Diff 21527.Jul 6 2021, 5:53 PM

ugh, and it's not even the only issue -_-

I'll keep working on it

Build has FAILED

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/934/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/934/console

Build was aborted

Patch application report for D5971 (id=21527)

Rebasing onto 36b44ef08e...

Current branch diff-target is up to date.

Changes applied before test

commit 2ca90586abf325951e4405f8a02810f9b9cec664
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/935/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/935/console

Only produce anonymous target to avoid duplication
rebase
turn on errors

In D5971#153650, @anlambert wrote:

Update: We can force docutils to throw exception on warnings by setting the halt_level setting to 2 (warning), apidoc page will return an error 500 on any warning in that case.

FYI: I didn't test this, because for some reason I don't get the same warnings you do...

Build is green

Patch application report for D5971 (id=21570)

Rebasing onto 25c639f292...

Current branch diff-target is up to date.

Changes applied before test

commit fd03b5842202ee314ab523d08b94974ac9e7fd7a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jul 6 14:33:29 2021 +0200

    apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
    
    Motivation:
    
    This commit started as a simple change: I wanted to replace:

`<type> <IRI>`
```

with:

```
``<type> <IRI>``
```

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the `process_paragraph` method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor `<type> <IRI>` string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to `self.data["description"]`, because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

* our custom role code is now neatly restricted in `visit_problematic`,
  so it can't overflow, because docutils runs `visit_problematic` with
  *only* the role's string as child
* it detects unexpected nodes, such as the `title_reference` roles,
  which is usually produced when accidentally using single-backquotes
  instead of double-backquotes to wrap inline code blocks (it happens
  a lot when one is used to markdown)

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/959/ for more details.

Harbormaster completed remote builds in B22545: Diff 21570.Jul 9 2021, 11:44 AM

Looks good to me !

This revision is now accepted and ready to land.Jul 9 2021, 12:34 PM

rebase

This revision was landed with ongoing or failed builds.Jul 9 2021, 12:48 PM

Closed by commit rDWAPPSdd2e5e261b86: apidoc: Stop parsing docutils trees with regexps on its pseudo-XML (authored by vlorentz). · Explain Why

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDWAPPSdd2e5e261b86: apidoc: Stop parsing docutils trees with regexps on its pseudo-XML.

Build is green

Patch application report for D5971 (id=21574)

Rebasing onto dd2e5e261b...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-960-D5971.

Changes applied before test

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/960/ for more details.

Harbormaster completed remote builds in B22549: Diff 21574.Jul 9 2021, 1:06 PM

apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D5971 (id=21513)

Changes applied before test

Patch application report for D5971 (id=21516)

Changes applied before test

Patch application report for D5971 (id=21520)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21570)

Changes applied before test

Patch application report for D5971 (id=21574)

Changes applied before test

Revision Contents
Changeset List

Diff 21570

swh/web/api/apidoc.py

swh/web/api/views/content.py

swh/web/api/views/metadata.py

swh/web/api/views/origin.py

swh/web/common/utils.py

swh/web/tests/api/test_apidoc.py

apidoc: Stop parsing docutils trees with regexps on its pseudo-XMLClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D5971 (id=21513)

Changes applied before test

Patch application report for D5971 (id=21516)

Changes applied before test

Patch application report for D5971 (id=21520)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21527)

Changes applied before test

Patch application report for D5971 (id=21570)

Changes applied before test

Patch application report for D5971 (id=21574)

Changes applied before test

Revision ContentsChangeset List

Diff 21570

swh/web/api/apidoc.py

swh/web/api/views/content.py

swh/web/api/views/metadata.py

swh/web/api/views/origin.py

swh/web/common/utils.py

swh/web/tests/api/test_apidoc.py

apidoc: Stop parsing docutils trees with regexps on its pseudo-XML
ClosedPublic
Actions

Revision Contents
Changeset List