Page MenuHomeSoftware Heritage

pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
ClosedPublic

Authored by anlambert on Feb 5 2021, 2:23 PM.

Details

Summary

Another issue found while retesting the listers locally.

xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page., see below:

Traceback (most recent call last):
  File "/home/anlambert/.virtualenvs/swh/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/cli.py", line 65, in run
    get_lister(lister, **config).run()
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pattern.py", line 121, in run
    for page in self.get_pages():
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pypi/lister.py", line 57, in get_pages
    page_xmldict = xmltodict.parse(response.text)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: mismatched tag: line 6, column 4

So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Fix typos in commit message

Build is green

Patch application report for D5027 (id=17918)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit c4fc7fa6e21ca182491273376fa36e6ff699b280
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 14:17:32 2021 +0100

    pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
    
    xmltodict now raises an error while trying to parse the HTML content
    of https://pypi.org/simple/ page.
    
    So use BeautifulSoup HTML parser instead as it is aleady a requirement
    of swh-lister and it does fail parsing the PyPI HTML page.
    
    Also drop no longer used xmltdict in requirements.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/243/ for more details.

Build is green

Patch application report for D5027 (id=17919)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit 2461c97bbbc430f5119968fc10c97f7b0cc60417
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 14:17:32 2021 +0100

    pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
    
    xmltodict now raises an error while trying to parse the HTML content
    of https://pypi.org/simple/ page.
    
    So use BeautifulSoup HTML parser instead as it is aleady a requirement
    of swh-lister and it does not fail parsing the PyPI HTML page.
    
    Also drop no longer used xmltodict in requirements.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/244/ for more details.

ardumont added a subscriber: ardumont.

lgtm

That does unify with the cgit implem ;)

This revision is now accepted and ready to land.Feb 5 2021, 2:28 PM