Page MenuHomeSoftware Heritage

server: Use xml.etree.ElementTree instead of nested dicts internally
ClosedPublic

Authored by vlorentz on Feb 21 2022, 6:10 PM.

Details

Summary

This commit does not touch the external API though; ie. metadata_dict
is still present in the JSON API, and the equivalent jsonb field remains
in the database. They will probably be removed in a future commit
because they are not very useful, though.

Rationale:

I find xmltodict's approach of translating XML tree to native structures
to be intrinsically flawed for non-trivial handling of XML, because the
data structure is:

  • implementation-defined (by xmltodict, which is python-only) and it may change across versions
  • does not intrinsically store namespaces, and relies on an internal prefix map (though it isn't much of an issue right now, as we do not need composability and all the changed APIs are private)
  • not stable; for example, <a><b>foo</b></a> and <a><b>foo</b><b>bar</b></a> are encoded completely differently (the former is a Dict[str, str], the latter is Dict[str, list].

And every operation manipulating this data structure needs to check
presence, number *and* type on every access. Consider this part of this
commit for example:

-    swh_deposit = metadata.get("swh:deposit")
-    if not swh_deposit:
-        return None
-
-    swh_reference = swh_deposit.get("swh:reference")
-    if not swh_reference:
-        return None
-
-    swh_origin = swh_reference.get("swh:origin")
-    if swh_origin:
-        url = swh_origin.get("@url")
-        if url:
-            return url
+    ref_origin = metadata.find(
+        "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES
+    )
+    if ref_origin is not None:
+        return ref_origin.attrib["url"]

the use of XPath makes it considerably shorter; and the original version
did not even check number/type (ie. it would crash if an element was
duplicated).

Diff Detail

Repository
rDDEP Push deposit
Branch
validate-metadata
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 27030
Build 42267: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 42266: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7215 (id=26145)

Could not rebase; Attempt merge onto 40adc8c23b...

Updating 40adc8c2..bcfdd332
Fast-forward
 swh/deposit/api/checks.py                          |  26 +--
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  22 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 248 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |   9 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 378 insertions(+), 443 deletions(-)
Changes applied before test
commit bcfdd3328e8b2cd3f11013eee1bd1a6c82136a11
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit efc14e494507c10384fd8509c990e4a3d6623a3e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit a25afe62bd1ea0452ef2e395575d1ec64382f6b4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)
See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/717/ for more details.
ardumont added a subscriber: ardumont.

lgtm, request changes for the sake of some suggestions/remarks inline.

You are missing some types on (some) impacted methods and some copyright bumps as well.

swh/deposit/api/checks.py
72–84

less duplication.

swh/deposit/api/common.py
320

yeah, it's more for our maintenance use case now. Reading data immediately without having to juggle with another (metadata storage) backend.
Given the data volumetry, that's not a problem for now though.

662–666

out of order given the parameter definition in the method.

swh/deposit/api/private/__init__.py
40–41

out of sync, suggestion above ^.

swh/deposit/api/private/deposit_read.py
110

or something ^

116

we can drop the types in the docstring if we add the type to the method ;)

201

Relatedly to your question in comment above (about dropping that dict), it's still is used in the deposit loader which consumes this... (i think).
So first, if you still want to drop it, we need to make sure it's not needed there.

This revision now requires changes to proceed.Feb 22 2022, 11:01 AM

Build has FAILED

Patch application report for D7215 (id=26157)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..20b7768e
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  13 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 385 insertions(+), 438 deletions(-)
Changes applied before test
commit 20b7768e6d33a4dcb9f89dabca14b701a63d2fdf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)
Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/723/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/723/console

Build is green

Patch application report for D7215 (id=26158)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..26fc730b
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  19 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 386 insertions(+), 443 deletions(-)
Changes applied before test
commit 26fc730b43382f66cd705c0b49ce9ba2e6c3bb11
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)
See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/724/ for more details.
vlorentz marked 3 inline comments as done.

update docstrings and types as requested

Build is green

Patch application report for D7215 (id=26160)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..0c4afc28
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  28 +--
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  69 +++---
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  19 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 393 insertions(+), 451 deletions(-)
Changes applied before test
commit 0c4afc28d32cf1c1f5f2984af498ec2d9aec985a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)
See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/725/ for more details.
swh/deposit/api/checks.py
72–84

let's do it in a future diff, this one is already big enough

ardumont added inline comments.
swh/deposit/api/checks.py
72–84

ok

swh/deposit/tests/api/test_deposit_private_check.py
143

?

was the existing test off somehow?

This revision is now accepted and ready to land.Feb 22 2022, 2:27 PM
swh/deposit/tests/api/test_deposit_private_check.py
143

it matches a new error message I introduced when there is no XML to parse at all (which came from merge_metadata being essentially a fold with {} as starting element, so it returned {} when the list of metadata files was empty):

-        metadata_status_ok, details = check_metadata(metadata)
-        # Ensure in case of error, we do have the rejection details
-        assert metadata_status_ok or (not metadata_status_ok and details is not None)
-        # we can have warnings even if checks are ok (e.g. missing suggested field)
-        details_dict.update(details or {})
+        if raw_metadata is None:
+            metadata_status_ok = False
+            details_dict["metadata"] = [{"summary": "Missing Atom document"}]
+        else:
+            metadata_tree = ElementTree.fromstring(raw_metadata)
+            metadata_status_ok, details = check_metadata(metadata_tree)
+            # Ensure in case of error, we do have the rejection details
+            assert metadata_status_ok or (
+                not metadata_status_ok and details is not None
+            )
+            # we can have warnings even if checks are ok (e.g. missing suggested field)
+            details_dict.update(details or {})
swh/deposit/tests/api/test_deposit_private_check.py
143

I mean *the previous lack of error* came from merge_metadata being a fold

swh/deposit/tests/api/test_deposit_private_check.py
143

ack, thx.

i had noticed the change diff you pointed out, i did not relate the 2.

Build has FAILED

Patch application report for D7215 (id=26176)

Rebasing onto b9f565aaa3...

Current branch diff-target is up to date.
Changes applied before test
commit 81849f818a66368c80d9f65590ea839f19e90a63
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).
Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/727/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/727/console

Build is green

Patch application report for D7215 (id=26177)

Rebasing onto b9f565aaa3...

Current branch diff-target is up to date.
Changes applied before test
commit 55ae87b13c3926e0feb9cb0696db3fff0134c7ed
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:
  • swh_deposit = metadata.get("swh:deposit")
  • if not swh_deposit:
  • return None -
  • swh_reference = swh_deposit.get("swh:reference")
  • if not swh_reference:
  • return None -
  • swh_origin = swh_reference.get("swh:origin")
  • if swh_origin:
  • url = swh_origin.get("@url")
  • if url:
  • return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

    the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).
See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/728/ for more details.