HomeSoftware Heritage

apidoc: Stop parsing docutils trees with regexps on its pseudo-XML

Description

apidoc: Stop parsing docutils trees with regexps on its pseudo-XML

Motivation:

This commit started as a simple change: I wanted to replace:

`<type> <IRI>`

with:

``<type> <IRI>``

Unfortunately, this syntax looks too much like XML for its own good,
so it was stripped by the process_paragraph method, because it reads
the docutils pseudo-XML representation and strips every tag it doesn't
know about.
(I'm saying pseudo-XML, because my poor <type> <IRI> string was not
escaped with XML entities, so it was in fact undistinguishable from
actual XML tags).

Changes:

Therefore, stops using the XML-like string representation of docutils
trees, and visits tree nodes directly instead.
Conveniently, this is already in a node visit, so we can reuse that;
simply by iterating recursively instead of stopping the recursion
as soon as we see a known node (ie. the visitors actually visited
only nodes very close to the root).

This means that we needed to add methods to handle each node type,
and produce its ReST output. And since we don't have a global view
anymore, we need to return the produced ReST instead of appending
directly to self.data["description"], because handlers of parent
nodes may need to re-indent their children's output.o

This results in cleaner code (and also closer to what we expect from
a visitor transformer), so it's a win too.

This has some other nice side-effects:

  • our custom role code is now neatly restricted in visit_problematic, so it can't overflow, because docutils runs visit_problematic with *only* the role's string as child
  • it detects unexpected nodes, such as the title_reference roles, which is usually produced when accidentally using single-backquotes instead of double-backquotes to wrap inline code blocks (it happens a lot when one is used to markdown)

Details