Page MenuHomeSoftware Heritage

Replace xmltodict with ElementTree to handle SWORD documents
ClosedPublic

Authored by vlorentz on Feb 23 2022, 7:03 PM.

Details

Summary

xmltodict was already on the way out for the deposit, and the latest
libexpat security update broke it entirely when dealing with namespaces,
which means we cannot use it until this is addressed.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006317

Functional changes of this commit:

  1. No more writes to the 'metadata' jsonb column in the DB (as it strongly depends on xmltodict)
  2. ServiceDocumentDepositClient always outputs a list of collections, instead of None/dict/List[dict] depending on the number of collections (artefact of using xmltodict, which is replaced by proper parsing)

Diff Detail

Repository
rDDEP Push deposit
Branch
drop-xmltodict
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 27089
Build 42366: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 42365: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7233 (id=26214)

Could not rebase; Attempt merge onto 5a7d59cb2c...

Merge made by the 'recursive' strategy.
 swh/deposit/api/common.py                          |  39 ++----
 swh/deposit/api/edit.py                            |   7 +-
 swh/deposit/api/private/deposit_read.py            |   2 -
 swh/deposit/cli/client.py                          |   8 +-
 swh/deposit/client.py                              |  80 +++++++++---
 swh/deposit/config.py                              |   1 -
 swh/deposit/parsers.py                             |  10 +-
 swh/deposit/tests/api/conftest.py                  |   3 +-
 .../tests/api/test_collection_add_to_origin.py     |   7 +-
 swh/deposit/tests/api/test_collection_list.py      |  47 ++++---
 swh/deposit/tests/api/test_collection_post_atom.py |  71 +++++-----
 .../tests/api/test_collection_post_binary.py       |  52 +++++---
 .../tests/api/test_collection_post_multipart.py    |  39 +++---
 .../tests/api/test_collection_reuse_slug.py        |  31 +++--
 swh/deposit/tests/api/test_delete.py               |   8 +-
 .../tests/api/test_deposit_private_check.py        |   7 +-
 .../api/test_deposit_private_read_metadata.py      |   7 -
 swh/deposit/tests/api/test_deposit_schedule.py     |  20 ++-
 swh/deposit/tests/api/test_deposit_state.py        |  64 ++++++---
 swh/deposit/tests/api/test_deposit_update_atom.py  |  64 ++++-----
 .../tests/api/test_deposit_update_binary.py        |  19 +--
 swh/deposit/tests/api/test_get_file.py             |  22 ++--
 swh/deposit/tests/api/test_parsers.py              |  97 +++++---------
 swh/deposit/tests/cli/test_client.py               | 144 +++++++++++++++------
 swh/deposit/tests/conftest.py                      |  14 +-
 swh/deposit/tests/test_client_module.py            |   2 +-
 swh/deposit/utils.py                               |  14 --
 27 files changed, 490 insertions(+), 389 deletions(-)
Changes applied before test
commit 443015cc03b5254a539bcd97b96b6a79e426585c
Merge: 5a7d59cb 486eb888
Author: Jenkins user <jenkins@localhost>
Date:   Wed Feb 23 18:03:15 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 486eb888d8f63d28dbeb3a0f549ea37594828bec
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 19:00:09 2022 +0100

    Replace xmltodict with ElementTree to handle SWORD documents
    
    xmltodict was already on the way out for the deposit, and the latest
    libexpat security update broke it entirely when dealing with namespaces,
    which means we cannot use it until this is addressed.
    
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006317
    
    Functional changes of this commit:
    
    1. No more writes to the 'metadata' jsonb column in the DB (as it
       strongly depends on xmltodict)
    2. ServiceDocumentDepositClient always outputs a list of collections,
       instead of None/dict/List[dict] depending on the number of
       collections (artefact of using xmltodict, which is replaced by proper
       parsing)

commit 68eb441cb90edbab5b1fadb1b411c106a6c9c70e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:44:07 2022 +0100

    Remove metadata_dict from the API
    
    No one uses that, and it's redundant, as we provide the original XML

commit 189ba2f7cc9c5b35891b23adf19872b8f230b7e8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:43:33 2022 +0100

    Remove unnecessary use of BytesIO

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/744/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/744/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 23 2022, 7:11 PM
Harbormaster failed remote builds in B27087: Diff 26214!

Build is green

Patch application report for D7233 (id=26215)

Could not rebase; Attempt merge onto 5a7d59cb2c...

Merge made by the 'recursive' strategy.
 swh/deposit/api/common.py                          |  39 ++----
 swh/deposit/api/edit.py                            |   7 +-
 swh/deposit/api/private/deposit_read.py            |   2 -
 swh/deposit/cli/client.py                          |   2 +-
 swh/deposit/client.py                              |  80 +++++++++---
 swh/deposit/config.py                              |   1 -
 swh/deposit/parsers.py                             |  10 +-
 swh/deposit/tests/api/conftest.py                  |   3 +-
 .../tests/api/test_collection_add_to_origin.py     |   7 +-
 swh/deposit/tests/api/test_collection_list.py      |  47 ++++---
 swh/deposit/tests/api/test_collection_post_atom.py |  71 +++++-----
 .../tests/api/test_collection_post_binary.py       |  52 +++++---
 .../tests/api/test_collection_post_multipart.py    |  39 +++---
 .../tests/api/test_collection_reuse_slug.py        |  31 +++--
 swh/deposit/tests/api/test_delete.py               |   8 +-
 .../tests/api/test_deposit_private_check.py        |   7 +-
 .../api/test_deposit_private_read_metadata.py      |   7 -
 swh/deposit/tests/api/test_deposit_schedule.py     |  20 ++-
 swh/deposit/tests/api/test_deposit_state.py        |  64 ++++++---
 swh/deposit/tests/api/test_deposit_update_atom.py  |  64 ++++-----
 .../tests/api/test_deposit_update_binary.py        |  19 +--
 swh/deposit/tests/api/test_get_file.py             |  22 ++--
 swh/deposit/tests/api/test_parsers.py              |  97 +++++---------
 swh/deposit/tests/cli/test_client.py               | 144 +++++++++++++++------
 swh/deposit/tests/conftest.py                      |  14 +-
 swh/deposit/tests/test_client_module.py            |   2 +-
 swh/deposit/utils.py                               |  14 --
 27 files changed, 487 insertions(+), 386 deletions(-)
Changes applied before test
commit abe23026ddccdc59660fc24c25266158a51fbcb6
Merge: 5a7d59cb d3704492
Author: Jenkins user <jenkins@localhost>
Date:   Wed Feb 23 18:16:45 2022 +0000

    Merge branch 'diff-target' into HEAD

commit d370449226af418750903c808a409948a2a4fb71
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 19:00:09 2022 +0100

    Replace xmltodict with ElementTree to handle SWORD documents
    
    xmltodict was already on the way out for the deposit, and the latest
    libexpat security update broke it entirely when dealing with namespaces,
    which means we cannot use it until this is addressed.
    
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006317
    
    Functional changes of this commit:
    
    1. No more writes to the 'metadata' jsonb column in the DB (as it
       strongly depends on xmltodict)
    2. ServiceDocumentDepositClient always outputs a list of collections,
       instead of None/dict/List[dict] depending on the number of
       collections (artefact of using xmltodict, which is replaced by proper
       parsing)

commit 68eb441cb90edbab5b1fadb1b411c106a6c9c70e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:44:07 2022 +0100

    Remove metadata_dict from the API
    
    No one uses that, and it's redundant, as we provide the original XML

commit 189ba2f7cc9c5b35891b23adf19872b8f230b7e8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:43:33 2022 +0100

    Remove unnecessary use of BytesIO

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/745/ for more details.

Build is green

Patch application report for D7233 (id=26216)

Rebasing onto 5a7d59cb2c...

First, rewinding head to replay your work on top of it...
Applying: Remove unnecessary use of BytesIO
Applying: Remove metadata_dict from the API
Applying: Replace xmltodict with ElementTree to handle SWORD documents
Changes applied before test
commit 11bbdc6e26ef61e43f4ac58997e7b770706eee03
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 19:00:09 2022 +0100

    Replace xmltodict with ElementTree to handle SWORD documents
    
    xmltodict was already on the way out for the deposit, and the latest
    libexpat security update broke it entirely when dealing with namespaces,
    which means we cannot use it until this is addressed.
    
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006317
    
    Functional changes of this commit:
    
    1. No more writes to the 'metadata' jsonb column in the DB (as it
       strongly depends on xmltodict)
    2. ServiceDocumentDepositClient always outputs a list of collections,
       instead of None/dict/List[dict] depending on the number of
       collections (artefact of using xmltodict, which is replaced by proper
       parsing)

commit 16ee56352bac1bae78620034800dce9427d4cf3b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:44:07 2022 +0100

    Remove metadata_dict from the API
    
    No one uses that, and it's redundant, as we provide the original XML

commit 149317dc3c1ecbaaa795547bd5331e0bddfab2f8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:43:33 2022 +0100

    Remove unnecessary use of BytesIO

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/746/ for more details.

ardumont added a subscriber: ardumont.

lgtm

swh/deposit/client.py
437

i'm gonna believe you on that one ;)

swh/deposit/tests/api/test_deposit_update_atom.py
581

clearer ;)

This revision is now accepted and ready to land.Feb 23 2022, 8:19 PM

This needs a rebase (it's missing the last commit from master but nothing is to be modified after it).

Can you please rebase, update and push so that i can rebase my last [1] 2 deposit diffs [2] on it?

[1] I should be done for the deposit modifications

[2] D7227 D7228

Build is green

Patch application report for D7233 (id=26222)

Could not rebase; Attempt merge onto 5a7d59cb2c...

Updating 5a7d59cb..cf92b5f2
Fast-forward
 swh/deposit/api/common.py                          |  39 ++----
 swh/deposit/api/edit.py                            |   7 +-
 swh/deposit/api/private/deposit_read.py            |   2 -
 swh/deposit/cli/client.py                          |   2 +-
 swh/deposit/client.py                              |  78 ++++++++---
 swh/deposit/config.py                              |   1 -
 swh/deposit/parsers.py                             |  10 +-
 swh/deposit/tests/api/conftest.py                  |   3 +-
 .../tests/api/test_collection_add_to_origin.py     |   7 +-
 swh/deposit/tests/api/test_collection_list.py      |  47 ++++---
 swh/deposit/tests/api/test_collection_post_atom.py |  71 +++++-----
 .../tests/api/test_collection_post_binary.py       |  52 +++++---
 .../tests/api/test_collection_post_multipart.py    |  39 +++---
 .../tests/api/test_collection_reuse_slug.py        |  31 +++--
 swh/deposit/tests/api/test_delete.py               |   8 +-
 .../tests/api/test_deposit_private_check.py        |   7 +-
 .../api/test_deposit_private_read_metadata.py      |   7 -
 swh/deposit/tests/api/test_deposit_schedule.py     |  20 ++-
 swh/deposit/tests/api/test_deposit_state.py        |  64 ++++++---
 swh/deposit/tests/api/test_deposit_update_atom.py  |  64 ++++-----
 .../tests/api/test_deposit_update_binary.py        |  19 +--
 swh/deposit/tests/api/test_get_file.py             |  22 ++--
 swh/deposit/tests/api/test_parsers.py              |  97 +++++---------
 swh/deposit/tests/cli/test_client.py               | 144 +++++++++++++++------
 swh/deposit/tests/conftest.py                      |  14 +-
 swh/deposit/tests/test_client_module.py            |   2 +-
 swh/deposit/utils.py                               |  14 --
 27 files changed, 485 insertions(+), 386 deletions(-)
Changes applied before test
commit cf92b5f2feb6e9ca3e071e98922417fd530761c3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 19:00:09 2022 +0100

    Replace xmltodict with ElementTree to handle SWORD documents
    
    xmltodict was already on the way out for the deposit, and the latest
    libexpat security update broke it entirely when dealing with namespaces,
    which means we cannot use it until this is addressed.
    
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006317
    
    Functional changes of this commit:
    
    1. No more writes to the 'metadata' jsonb column in the DB (as it
       strongly depends on xmltodict)
    2. ServiceDocumentDepositClient always outputs a list of collections,
       instead of None/dict/List[dict] depending on the number of
       collections (artefact of using xmltodict, which is replaced by proper
       parsing)

commit 5998e19648f4a987dc024935f20d7e0adae31dc8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:44:07 2022 +0100

    Remove metadata_dict from the API
    
    No one uses that, and it's redundant, as we provide the original XML

commit 79be8e1c74e15607d4ced35fc2f2de1359c02ecb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Feb 23 13:43:33 2022 +0100

    Remove unnecessary use of BytesIO

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/750/ for more details.