Page MenuHomeSoftware Heritage

pypi.client: Improve tarballs download time

Authored by anlambert on Nov 29 2018, 4:26 PM.



While working on the npm loader by adapting the code from the PyPI one,
I noticed that the download of tarballs was wery slow.

Turned out that this is due to the use of the iter_content method
from the requests reponse api [1]. By default, that method iterates
on the response content one bytes at a time so the slow download.

Turning the chunk_size parameter of that method to None will read data
as it arrives in whatever size the chunks are received and greatly
speedup download time.

For instance, before that fix, loading all Sphinx packages took:

$ time python3 -m swh.loader.pypi.loader sphinx
real    53m53,489s
user    53m19,212s
sys     0m11,460s

After that fix, that process now takes:

$ time python3 -m swh.loader.pypi.loader sphinx
real    2m21,667s
user    0m55,900s
sys     0m10,416s


Diff Detail

rDLDPY PyPI loader
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Nov 29 2018, 4:26 PM
ardumont accepted this revision.Nov 29 2018, 4:30 PM

nice, thanks.

This revision is now accepted and ready to land.Nov 29 2018, 4:30 PM
This revision was automatically updated to reflect the committed changes.