Page MenuHomeSoftware Heritage

pypi.client: Improve tarballs download time
ClosedPublic

Authored by anlambert on Nov 29 2018, 4:26 PM.

Details

Summary

While working on the npm loader by adapting the code from the PyPI one,
I noticed that the download of tarballs was wery slow.

Turned out that this is due to the use of the iter_content method
from the requests reponse api [1]. By default, that method iterates
on the response content one bytes at a time so the slow download.

Turning the chunk_size parameter of that method to None will read data
as it arrives in whatever size the chunks are received and greatly
speedup download time.

For instance, before that fix, loading all Sphinx packages took:

$ time python3 -m swh.loader.pypi.loader sphinx
...
real    53m53,489s
user    53m19,212s
sys     0m11,460s

After that fix, that process now takes:

$ time python3 -m swh.loader.pypi.loader sphinx
...
real    2m21,667s
user    0m55,900s
sys     0m10,416s

[1] http://docs.python-requests.org/en/master/api/#requests.Response.iter_content

Diff Detail

Repository
rDLDPY PyPI loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.