Page MenuHomeSoftware Heritage

maven: Use BeautifulSoup instead of xmltodict for parsing pom files
ClosedPublic

Authored by anlambert on Aug 8 2022, 4:38 PM.

Details

Summary

xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.

Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.

Fixes SWH-LISTER-69

Diff Detail

Repository
rDLS Listers
Branch
maven-fix-multi-byte-encoding-pom-parsing
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30717
Build 48026: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 48025: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8217 (id=29631)

Rebasing onto 751c3df1b7...

Current branch diff-target is up to date.
Changes applied before test
commit 547825097a914acfcc3e6282da0ebdb12ad16eb9
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Mon Aug 8 16:30:47 2022 +0200

    maven: Use BeautifulSoup instead of xmltodict for parsing pom files
    
    xmltodict cannot parse POM files with multi-byte encoding so prefer to
    use the XML parser of BeautifulSoup based on lxml instead.
    
    Also drop xmltodict requirement as it is no longer used in swh-lister
    codebase.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/575/ for more details.

Bleh, I was about to suggest using lxml directly instead of BeautifulSoup; but it seems there are plenty of projects with incorrect xmlns declaration, so BeautifulSoup ignoring namespaces is actually a blessing here.

swh/lister/maven/lister.py
280–281

Catch proper exception for lxml parsing error.

This revision is now accepted and ready to land.Aug 9 2022, 11:09 AM

Build is green

Patch application report for D8217 (id=29648)

Rebasing onto d51bce0a1c...

First, rewinding head to replay your work on top of it...
Applying: maven: Use BeautifulSoup instead of xmltodict for parsing pom files
Changes applied before test
commit c01b41f49bf14812fdad5f7c098429f6df43be6b
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Mon Aug 8 16:30:47 2022 +0200

    maven: Use BeautifulSoup instead of xmltodict for parsing pom files
    
    xmltodict cannot parse POM files with multi-byte encoding so prefer to
    use the XML parser of BeautifulSoup based on lxml instead.
    
    Also drop xmltodict requirement as it is no longer used in swh-lister
    codebase.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/578/ for more details.

Build is green

Patch application report for D8217 (id=29649)

Rebasing onto d51bce0a1c...

Current branch diff-target is up to date.
Changes applied before test
commit cee6bcb514ecbed5a039c4255a97c0533d7c2e9e
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Mon Aug 8 16:30:47 2022 +0200

    maven: Use BeautifulSoup instead of xmltodict for parsing pom files
    
    xmltodict cannot parse POM files with multi-byte encoding so prefer to
    use the XML parser of BeautifulSoup based on lxml instead.
    
    Also drop xmltodict requirement as it is no longer used in swh-lister
    codebase.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/579/ for more details.