server: Use xml.etree.ElementTree instead of nested dicts internally
This commit does not touch the external API though; ie. metadata_dict
is still present in the JSON API, and the equivalent jsonb field remains
in the database. They will probably be removed in a future commit
because they are not very useful, though.
Rationale:
I find xmltodict's approach of translating XML tree to native structures
to be intrinsically flawed for non-trivial handling of XML, because the
data structure is:
- implementation-defined (by xmltodict, which is python-only) and it may change across versions
- does not intrinsically store namespaces, and relies on an internal prefix map (though it isn't much of an issue right now, as we do not need composability and all the changed APIs are private)
- not stable; for example, <a><b>foo</b></a> and <a><b>foo</b><b>bar</b></a> are encoded completely differently (the former is a Dict[str, str], the latter is Dict[str, list].
And every operation manipulating this data structure needs to check
presence, number *and* type on every access. Consider this part of this
commit for example:
- swh_deposit = metadata.get("swh:deposit") - if not swh_deposit: - return None - - swh_reference = swh_deposit.get("swh:reference") - if not swh_reference: - return None - - swh_origin = swh_reference.get("swh:origin") - if swh_origin: - url = swh_origin.get("@url") - if url: - return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"]
the use of XPath makes it considerably shorter; and the original version
did not even check number/type (ie. it would crash if an element was
duplicated).