Details

Reviewers

Group Reviewers

Commits

rDDEP55ae87b13c39: server: Use xml.etree.ElementTree instead of nested dicts internally

Summary

This commit does not touch the external API though; ie. metadata_dict
is still present in the JSON API, and the equivalent jsonb field remains
in the database. They will probably be removed in a future commit
because they are not very useful, though.

Rationale:

I find xmltodict's approach of translating XML tree to native structures
to be intrinsically flawed for non-trivial handling of XML, because the
data structure is:

implementation-defined (by xmltodict, which is python-only) and it may change across versions
does not intrinsically store namespaces, and relies on an internal prefix map (though it isn't much of an issue right now, as we do not need composability and all the changed APIs are private)
not stable; for example, <a>foo</a> and <a>foobar</a> are encoded completely differently (the former is a Dict[str, str], the latter is Dict[str, list].

And every operation manipulating this data structure needs to check
presence, number *and* type on every access. Consider this part of this
commit for example:

-    swh_deposit = metadata.get("swh:deposit")
-    if not swh_deposit:
-        return None
-
-    swh_reference = swh_deposit.get("swh:reference")
-    if not swh_reference:
-        return None
-
-    swh_origin = swh_reference.get("swh:origin")
-    if swh_origin:
-        url = swh_origin.get("@url")
-        if url:
-            return url
+    ref_origin = metadata.find(
+        "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES
+    )
+    if ref_origin is not None:
+        return ref_origin.attrib["url"]

the use of XPath makes it considerably shorter; and the original version
did not even check number/type (ie. it would crash if an element was
duplicated).

Diff Detail

Repository

rDDEP Push deposit

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Feb 21 2022, 6:10 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptFeb 21 2022, 6:10 PM

Build is green

Patch application report for D7215 (id=26145)

Could not rebase; Attempt merge onto 40adc8c23b...

Updating 40adc8c2..bcfdd332
Fast-forward
 swh/deposit/api/checks.py                          |  26 +--
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  22 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 248 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |   9 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 378 insertions(+), 443 deletions(-)

Changes applied before test

commit bcfdd3328e8b2cd3f11013eee1bd1a6c82136a11
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit efc14e494507c10384fd8509c990e4a3d6623a3e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit a25afe62bd1ea0452ef2e395575d1ec64382f6b4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/717/ for more details.

Harbormaster completed remote builds in B27017: Diff 26145.Feb 21 2022, 6:14 PM

vlorentz requested review of this revision.Feb 21 2022, 6:14 PM

lgtm, request changes for the sake of some suggestions/remarks inline.

You are missing some types on (some) impacted methods and some copyright bumps as well.

swh/deposit/api/checks.py
72–84	less duplication.
swh/deposit/api/common.py
320	yeah, it's more for our maintenance use case now. Reading data immediately without having to juggle with another (metadata storage) backend. Given the data volumetry, that's not a problem for now though.
662–666	out of order given the parameter definition in the method.
swh/deposit/api/private/__init__.py
38–40	out of sync, suggestion above ^.
swh/deposit/api/private/deposit_read.py
110–112	or something ^
115–116	we can drop the types in the docstring if we add the type to the method ;)
204	Relatedly to your question in comment above (about dropping that dict), it's still is used in the deposit loader which consumes this... (i think). So first, if you still want to drop it, we need to make sure it's not needed there.

This revision now requires changes to proceed.Feb 22 2022, 11:01 AM

rebase

Build has FAILED

Patch application report for D7215 (id=26157)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..20b7768e
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  13 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 385 insertions(+), 438 deletions(-)

Changes applied before test

commit 20b7768e6d33a4dcb9f89dabca14b701a63d2fdf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/723/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/723/console

Harbormaster failed remote builds in B27029: Diff 26157!Feb 22 2022, 11:31 AM

fix lint

Build is green

Patch application report for D7215 (id=26158)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..26fc730b
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  24 +-
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  58 +++--
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  19 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 386 insertions(+), 443 deletions(-)

Changes applied before test

commit 26fc730b43382f66cd705c0b49ce9ba2e6c3bb11
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/724/ for more details.

Harbormaster completed remote builds in B27030: Diff 26158.Feb 22 2022, 11:38 AM

update docstrings and types as requested

Build is green

Patch application report for D7215 (id=26160)

Could not rebase; Attempt merge onto 770cc0f515...

Updating 770cc0f5..0c4afc28
Fast-forward
 swh/deposit/api/checks.py                          |  24 +-
 swh/deposit/api/common.py                          |  84 ++++---
 swh/deposit/api/edit.py                            |   9 +-
 swh/deposit/api/private/__init__.py                |  28 +--
 swh/deposit/api/private/deposit_check.py           |  21 +-
 swh/deposit/api/private/deposit_read.py            |  69 +++---
 swh/deposit/cli/client.py                          |   6 +-
 swh/deposit/tests/api/test_checks.py               | 249 +++++++++++++--------
 .../tests/api/test_collection_add_to_origin.py     |   4 +-
 .../tests/api/test_collection_post_multipart.py    |   1 -
 .../tests/api/test_deposit_private_check.py        |  19 +-
 .../api/test_deposit_private_read_metadata.py      |  80 +++----
 swh/deposit/tests/cli/test_client.py               |   5 +-
 .../atom/entry-data-with-metadata-provenance.xml   |   2 +-
 swh/deposit/tests/data/atom/entry-data2.xml        |   6 +
 swh/deposit/tests/data/atom/entry-data3.xml        |   8 +-
 swh/deposit/tests/test_utils.py                    | 103 +--------
 swh/deposit/utils.py                               | 126 +++--------
 18 files changed, 393 insertions(+), 451 deletions(-)

Changes applied before test

commit 0c4afc28d32cf1c1f5f2984af498ec2d9aec985a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

commit a10ed57bf8e46f404500cb1bec7bd94729eeb513
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 17:06:55 2022 +0100

Fix URI of schema.org

Either is valid according to https://schema.org/docs/gs.html ;
but we need to pick one, as they are opaque identifiers.
And codemeta chose http:// (because it was the only one to be
valid back then), so we should stick to this one.

commit 7727a9c0c8b8d40a8914ae74760a3bf7d35f255b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Feb 21 16:37:49 2022 +0100

Remove metadata merging; use only the latest document

We don't use that feature at all as far as I am aware.

I also find that it complicates any metadata handling (especially the validation
I would like to add in the near future), and probably does not match semantics
intended by SWORD (merging occurs on PUT requests, as we don't implement PATCH)

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/725/ for more details.

Harbormaster completed remote builds in B27033: Diff 26160.Feb 22 2022, 11:52 AM

vlorentz added inline comments.Feb 22 2022, 2:11 PM

swh/deposit/api/checks.py
72–84	let's do it in a future diff, this one is already big enough

ardumont accepted this revision.Feb 22 2022, 2:27 PM

ardumont added inline comments.

swh/deposit/api/checks.py
72–84	ok
swh/deposit/tests/api/test_deposit_private_check.py
143	? was the existing test off somehow?

This revision is now accepted and ready to land.Feb 22 2022, 2:27 PM

vlorentz added inline comments.Feb 22 2022, 2:55 PM

swh/deposit/tests/api/test_deposit_private_check.py

143

it matches a new error message I introduced when there is no XML to parse at all (which came from merge_metadata being essentially a fold with {} as starting element, so it returned {} when the list of metadata files was empty):

-        metadata_status_ok, details = check_metadata(metadata)
-        # Ensure in case of error, we do have the rejection details
-        assert metadata_status_ok or (not metadata_status_ok and details is not None)
-        # we can have warnings even if checks are ok (e.g. missing suggested field)
-        details_dict.update(details or {})
+        if raw_metadata is None:
+            metadata_status_ok = False
+            details_dict["metadata"] = [{"summary": "Missing Atom document"}]
+        else:
+            metadata_tree = ElementTree.fromstring(raw_metadata)
+            metadata_status_ok, details = check_metadata(metadata_tree)
+            # Ensure in case of error, we do have the rejection details
+            assert metadata_status_ok or (
+                not metadata_status_ok and details is not None
+            )
+            # we can have warnings even if checks are ok (e.g. missing suggested field)
+            details_dict.update(details or {})

vlorentz added inline comments.Feb 22 2022, 3:13 PM

swh/deposit/tests/api/test_deposit_private_check.py
143	I mean the previous lack of error came from merge_metadata being a fold

ardumont added inline comments.Feb 22 2022, 3:15 PM

swh/deposit/tests/api/test_deposit_private_check.py
143	ack, thx. i had noticed the change diff you pointed out, i did not relate the 2.

rebase

Build has FAILED

Patch application report for D7215 (id=26176)

Rebasing onto b9f565aaa3...

Current branch diff-target is up to date.

Changes applied before test

commit 81849f818a66368c80d9f65590ea839f19e90a63
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/727/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/727/console

Harbormaster failed remote builds in B27047: Diff 26176!Feb 22 2022, 3:23 PM

fix rebase

Build is green

Patch application report for D7215 (id=26177)

Rebasing onto b9f565aaa3...

Current branch diff-target is up to date.

Changes applied before test

commit 55ae87b13c3926e0feb9cb0696db3fff0134c7ed
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 21 17:55:07 2022 +0100

    server: Use xml.etree.ElementTree instead of nested dicts internally
    
    This commit does not touch the external API though; ie. `metadata_dict`
    is still present in the JSON API, and the equivalent `jsonb` field remains
    in the database. They will probably be removed in a future commit
    because they are not very useful, though.
    
    Rationale:
    
    I find xmltodict's approach of translating XML tree to native structures
    to be intrinsically flawed for non-trivial handling of XML, because the
    data structure is:
    
    * implementation-defined (by xmltodict, which is python-only) and it may
      change across versions
    * does not intrinsically store namespaces, and relies on an internal
      prefix map  (though it isn't much of an issue right now, as we do not need
      composability and all the changed APIs are private)
    * not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
      are encoded completely differently (the former is a `Dict[str, str]`,
      the latter is `Dict[str, list]`.
    
    And every operation manipulating this data structure needs to check
    presence, number *and* type on every access. Consider this part of this
    commit for example:

swh_deposit = metadata.get("swh:deposit")
if not swh_deposit:
return None -
swh_reference = swh_deposit.get("swh:reference")
if not swh_reference:
return None -
swh_origin = swh_reference.get("swh:origin")
if swh_origin:
url = swh_origin.get("@url")
if url:
return url + ref_origin = metadata.find( + "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES + ) + if ref_origin is not None: + return ref_origin.attrib["url"] `

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/728/ for more details.

Harbormaster completed remote builds in B27048: Diff 26177.Feb 22 2022, 3:28 PM

Closed by commit rDDEP55ae87b13c39: server: Use xml.etree.ElementTree instead of nested dicts internally (authored by vlorentz). · Explain WhyFeb 22 2022, 3:40 PM

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDDEP55ae87b13c39: server: Use xml.etree.ElementTree instead of nested dicts internally.

server: Use xml.etree.ElementTree instead of nested dicts internally
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D7215 (id=26145)

Changes applied before test

Patch application report for D7215 (id=26157)

Changes applied before test

Patch application report for D7215 (id=26158)

Changes applied before test

Patch application report for D7215 (id=26160)

Changes applied before test

Patch application report for D7215 (id=26176)

Changes applied before test

Patch application report for D7215 (id=26177)

Changes applied before test

Revision Contents
Changeset List

Diff 26179

swh/deposit/api/checks.py

swh/deposit/api/common.py

swh/deposit/api/edit.py

swh/deposit/api/private/init.py

swh/deposit/api/private/deposit_check.py

swh/deposit/api/private/deposit_read.py

swh/deposit/cli/client.py

swh/deposit/tests/api/test_checks.py

swh/deposit/tests/api/test_collection_add_to_origin.py

swh/deposit/tests/api/test_collection_post_multipart.py

swh/deposit/tests/api/test_deposit_private_check.py

swh/deposit/tests/cli/test_client.py

swh/deposit/tests/test_utils.py

swh/deposit/utils.py

server: Use xml.etree.ElementTree instead of nested dicts internallyClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D7215 (id=26145)

Changes applied before test

Patch application report for D7215 (id=26157)

Changes applied before test

Patch application report for D7215 (id=26158)

Changes applied before test

Patch application report for D7215 (id=26160)

Changes applied before test

Patch application report for D7215 (id=26176)

Changes applied before test

Patch application report for D7215 (id=26177)

Changes applied before test

Revision ContentsChangeset List

Diff 26179

swh/deposit/api/checks.py

swh/deposit/api/common.py

swh/deposit/api/edit.py

swh/deposit/api/private/__init__.py

swh/deposit/api/private/deposit_check.py

swh/deposit/api/private/deposit_read.py

swh/deposit/cli/client.py

swh/deposit/tests/api/test_checks.py

swh/deposit/tests/api/test_collection_add_to_origin.py

swh/deposit/tests/api/test_collection_post_multipart.py

swh/deposit/tests/api/test_deposit_private_check.py

swh/deposit/tests/cli/test_client.py

swh/deposit/tests/test_utils.py

swh/deposit/utils.py

server: Use xml.etree.ElementTree instead of nested dicts internally
ClosedPublic
Actions

Revision Contents
Changeset List

swh/deposit/api/private/init.py