Page MenuHomeSoftware Heritage

pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
ClosedPublic

Authored by anlambert on Feb 5 2021, 2:23 PM.

Details

Summary

Another issue found while retesting the listers locally.

xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page., see below:

Traceback (most recent call last):
  File "/home/anlambert/.virtualenvs/swh/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/cli.py", line 65, in run
    get_lister(lister, **config).run()
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pattern.py", line 121, in run
    for page in self.get_pages():
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pypi/lister.py", line 57, in get_pages
    page_xmldict = xmltodict.parse(response.text)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: mismatched tag: line 6, column 4

So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.

Diff Detail

Repository
rDLS Listers
Branch
pypi-use-bs4
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19038
Build 29506: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 29505: arc lint + arc unit

Event Timeline

Fix typos in commit message

Build is green

Patch application report for D5027 (id=17918)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit c4fc7fa6e21ca182491273376fa36e6ff699b280
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 14:17:32 2021 +0100

    pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
    
    xmltodict now raises an error while trying to parse the HTML content
    of https://pypi.org/simple/ page.
    
    So use BeautifulSoup HTML parser instead as it is aleady a requirement
    of swh-lister and it does fail parsing the PyPI HTML page.
    
    Also drop no longer used xmltdict in requirements.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/243/ for more details.

Build is green

Patch application report for D5027 (id=17919)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit 2461c97bbbc430f5119968fc10c97f7b0cc60417
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 14:17:32 2021 +0100

    pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
    
    xmltodict now raises an error while trying to parse the HTML content
    of https://pypi.org/simple/ page.
    
    So use BeautifulSoup HTML parser instead as it is aleady a requirement
    of swh-lister and it does not fail parsing the PyPI HTML page.
    
    Also drop no longer used xmltodict in requirements.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/244/ for more details.

ardumont added a subscriber: ardumont.

lgtm

That does unify with the cgit implem ;)

This revision is now accepted and ready to land.Feb 5 2021, 2:28 PM