Details

Reviewers

Commits

rDLDPY27ea42192186: pypi.client: Compute hashes and write the tarball in one roundtrip
rDLDPY7b6560bd469d: pypi.tasks: Rename to LoadPyPI task

Summary

Given a pypi origin, load its release artifacts (as synthetic
revision).

The first time, the visit results in a snapshot (targetting created
revisions). Further visits with no change results in the same latest
snapshot. Visit with new release artifacts results in a new snapshot
(new snapshot share the same branches as the last one, plus the new
ones)

Functional note

Missing release artifact information are skipped

Release artifacts whose PKG-INFO file is missing are skipped

Modules

swh.loader.pypi.client: Client interface to query pypi.org (and also somewhat manipulating the artifact local representations). It's the PyPiProject's collaborator to fetch missing information on the project.

swh.loader.pypi.model: PyPiProject representation of a pypi origin. It's the loader's collaborator to manipulate the origin (filter releases, etc...)

swh.loader.pypi.loader: The main entry point for the loader (fetch_data fetches, store_data stores... ;)

Technical Note

The client cache is there for local runs and post data analysis (also,
it has been used for tests). That setting should be set to off for
production.

Edge cases expected

Pypi project can resolve in no longer existing origin (404)

Hash checksum divergences stop the loading (maybe this can be improved later)

Possible improvments

Add some retry around the fetch artifacts routine (client)

Add more tests

simplify PyPiProject class (it was created at first to do more than it does now)

Branch is remote on the repository: loader-pypi

Test Plan

make test

Diff Detail

Repository

rDLDPY PyPI loader

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Sep 11 2018, 11:50 AM

Harbormaster completed remote builds in B1380: Diff 1268.Sep 11 2018, 11:50 AM

ardumont edited the summary of this revision. (Show Details)Sep 11 2018, 11:58 AM

Nice work, thanks!

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
Releases, which are point in time snapshots of a Project, tagged with a version number (and _only_ a version number);
Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked 😶), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.

Ideally this terminology would be documented in the client class as well so we can keep it consistent.

Can you explain why some methods in the loader class have an (override) comment?

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

I think I understand the client / model split (client is generic, model is SWH-specific), but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?). I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

swh/loader/pypi/client.py
49 ↗	(On Diff #1268)	the argument is not a file, it's a parsed structure
62 ↗	(On Diff #1268)	It's not clear what this does. Can you add an (abridged) example directory listing of what you expect in `dir_path` in the docstring ?
71 ↗	(On Diff #1268)	Please add documentation on what the arguments do.
152–153 ↗	(On Diff #1268)	The status code alone may not be sufficient to debug the problem. Please also provide the body of the response in exceptional cases to make them easier to diagnose.
160 ↗	(On Diff #1268)	Why have the client pass the project_url and the project name? Couldn't the URL be constructed like in the `release` function?
184 ↗	(On Diff #1268)	This base URL should be registered somewhere else. I'd suggest either a class attribute or an instance attribute set by a parameter (there's more than one Python package repository in the wild).
187 ↗	(On Diff #1268)	As this does more than fetching, I'd rename this function `prepare_release_artifact`, and possibly break it down in three distinct "fetch" (also handles caching and integrity control), "uncompress" and "metadata gathering" steps? To be more generally useful, I'd let the caller control where the archive is fetched and uncompressed, rather than needing to build a known directory name between the caller and this, and risking both to get out of sync.
223–244 ↗	(On Diff #1268)	this whole method could benefit from using the `stream=True` argument to `requests.get` : http://docs.python-requests.org/en/master/user/advanced/#body-content-workflow doesn't put the whole tarball in memory allows you to check the size before fetching the first byte This would still use `r.iter_content()`, you just need to use `r.headers['content-length']` (?) instead of `len(r.content)`
254 ↗	(On Diff #1268)	You should compute all hashes when processing the file initially (unfortunately I don't think our hashutil provides this functionality natively), this would save one full file read, and probably the need for `convert_to_hex` as well.
258–259 ↗	(On Diff #1268)	`release.update(artifact)`? or more sensibly `release['artifact'] = artifact`?
swh/loader/pypi/loader.py
37 ↗	(On Diff #1268)	Is the client argument ever used? I don't seem to see it used in this diff.
81 ↗	(On Diff #1268)	This may fail if the tarball contained files with bogus permissions (e.g. directories without write access). We may gain from a permission normalization phase in the decompression stage.
135 ↗	(On Diff #1268)	I wonder why all objects have an underscore in this function?
140–151 ↗	(On Diff #1268)	Creating instance attributes in a method is considered pretty bad form; Please move those initializations to the `__init__` method, or at least to the `prepare` method. Those should also probably be `id` -> `object` dicts to provide some deduplication and memory savings in the following loop (for contents, the id should be a compound key with all hashes).
153–154 ↗	(On Diff #1268)	This loop completely bypasses the incremental loading capabilities implemented in the base loader. This risks the loader using too many resources as it will materialize all objects in memory before sending them to the database, most notably on initial imports. The incremental pattern is currently used in the Debian loader. The idea is that the `prepare` method talks to the database and PyPI, and gathers what work needs to be done in some instance attributes. `fetch_data` and `store_data` get called in a loop, with the provision that `fetch_data` only fetches packages one at a time, and `store_data` only pushes the data for that package in a queue (which gets automatically flushed when it's too full, therefore keeping memory usage limited).
161–187 ↗	(On Diff #1268)	The revision generation should probably be moved to a method of the PyPIProject class, or to a function in the same module.
swh/loader/pypi/model.py
12 ↗	(On Diff #1268)	Where does the data come from, and what schema does the project conform to?
36 ↗	(On Diff #1268)	Where does this data come from? Please be explicit that the return value is in the shape the Software Heritage archive expects.
64 ↗	(On Diff #1268)	permit -> allow
69 ↗	(On Diff #1268)	Isn't PyPIClient doing that? It'd be useful to have some expected usage patterns for this class. Please also add documentation for the arguments for class instantiation.
72 ↗	(On Diff #1268)	client + project and project_metadata_url are possibly redundant
104 ↗	(On Diff #1268)	Possibly re-inline the info function, as it's not being tested separately.
106 ↗	(On Diff #1268)	This really filters the files within a release?
125–136 ↗	(On Diff #1268)	again with the stuck `_` key ;)
151 ↗	(On Diff #1268)	From what I can tell, this uncompresses the multiple files for a single release, is that correct? If that's so, then it should be named accordingly.
152 ↗	(On Diff #1268)	and
155 ↗	(On Diff #1268)	Is it a name or a version? I see it gets packed into a `'name'` entry in the flattened dict, so an example would be useful.
163 ↗	(On Diff #1268)	Is this really useful? AIUI the `releases` list (which should probably be a `release_files` list) comes from the PyPI API directly, so it'd make sense for the PyPIClient class to understand it directly, wouldn't it? If it is really useful, then the functionality should likely be moved to a separate function/staticmethod to ease testing.
164 ↗	(On Diff #1268)	flattened
167 ↗	(On Diff #1268)	Are there other digests that can be checked? If so we should avoid losing them, and check them on the `client` side.
187 ↗	(On Diff #1268)	`known_files`
swh/loader/pypi/tasks.py
11	Task, not Tsk

ardumont marked 4 inline comments as done.Sep 13 2018, 4:42 PM

ardumont added inline comments.

swh/loader/pypi/client.py
62 ↗	(On Diff #1268)	yes.
152–153 ↗	(On Diff #1268)	yes
160 ↗	(On Diff #1268)	Probably, it was only for debug purposes so i did not push further.
swh/loader/pypi/loader.py
37 ↗	(On Diff #1268)	Unfortunately, it's for the tests.
81 ↗	(On Diff #1268)	It's already done in the swh.core.tarball.uncompress call [1] [1] https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/tarball.py$165-173
135 ↗	(On Diff #1268)	To explicit it's private to that method, i agree it's not needed. I'll keep only those for the methods then. (it's not really only that function, e.g. _known_releases method also).
140–151 ↗	(On Diff #1268)	Creating instance attributes in a method is considered pretty bad form; Please move those initializations to the init method, or at least to the prepare method. I don't like it either. But i like keeping the state local to where it's needed (here fetch_data, store_data). In the prepare method (or init), it's not so local anymore... I'm also doing this because the way the loader core works... I'd prefer to pass along those that state to the store_data method... But i don't feel like changing that now.
153–154 ↗	(On Diff #1268)	Yes, i struggled on this. First implementation broke the fetch_data, store_data pattern as fetch_call did not use any state and directly called the maybe_load_... methods. So store_data was unused. The incremental pattern is currently used in the Debian loader. Will take a closer look there. Thanks.
swh/loader/pypi/model.py
64 ↗	(On Diff #1268)	rahhhhh
69 ↗	(On Diff #1268)	Isn't PyPIClient doing that? Yes, it's pypiclient (as collaborator as this class) which does it. It'd be useful to have some expected usage patterns for this class. Please also add documentation for the arguments for class instantiation. Yes, will do.
167 ↗	(On Diff #1268)	well, only md5 ;)

In D408#7860, @olasd wrote:

Nice work, thanks!

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

Yes.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;

Releases, which are point in time snapshots of a Project, tagged with a version number (and _only_ a version number);

Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked 😶), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.
Ideally this terminology would be documented in the client class as well so we can keep it consistent.

Ok.

Can you explain why some methods in the loader class have an (override) comment?

It's just for mentioning it's coming from the base class.
To distinguish between those and the other ones being initialized in that class.

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

Yes, well, that part is not for production.
It was more for the development phase (as explained in the description of this diff ;). It helped me a lot to post-analyze the json and release artifacts.

I did not plan on sharing the cache between the different workers.

I think I understand the client / model split (client is generic, model is SWH-specific),

Yes, that was the goal.

but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?).

Yes, i don't like it either.
I did not find anything better then.

I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

I'm ok with that.

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Right.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

I agree.

For the testing, I focused mainly on the loading part.

I'll keep it in mind for later.

ardumont marked 16 inline comments as done.Sep 13 2018, 8:19 PM

ardumont added inline comments.

swh/loader/pypi/client.py
187 ↗	(On Diff #1268)	Well right now, i won't split it yet. But i will do the other improvments proposed.
223–244 ↗	(On Diff #1268)	That sounds good, thanks.
254 ↗	(On Diff #1268)	yeah, ok. (unfortunately I don't think our hashutil provides this functionality natively) I think `hashutil.hash_file` allow that. (Not for the `convert_to_hex` part though, hard-coded to compute the binary format).
258–259 ↗	(On Diff #1268)	release.update(artifact)? Huh, right... or more sensibly release['artifact'] = artifact? No, the release here is actually the artifact... (that's one of your point earlier ;) I'm renaming the release name here.
swh/loader/pypi/loader.py
161–187 ↗	(On Diff #1268)	Yes, but later.
swh/loader/pypi/model.py
155 ↗	(On Diff #1268)	Well, the name is the version.
swh/loader/pypi/tasks.py
11	ymmv, we have some with Task, Tsk (that'd be my fault if we were pointing finger) and nothing... Maybe nothing is good enough (e.g. task_type with the backend as the fqdn, we already see the `.tasks.` module).

Taking care of some remarks

loader.pypi.client: Improve docstrings
pypi.client: Improve error message when failing to fetch artifacts
pypi.client: Extract PyPI instance's base url as instance attribute
pypi.client: Rename prepare_release_artifact method
pypi.client: Clarify the term from release to artifact
loader.pypi: Move converter functions in converters module
test_loader: Fix typo
Move PyPiProject from model to client module
*: Use "PyPI" terminology consistently everywhere
pypi.loader: Remove (override) comment
*: Use less _ in prefix variable names
pypi.client: Fix wrong verb use (allow instead of permit)
pypi.client: Dissociate artifact and release object
test_loader: Fix pep8 violation
pypi.client: Clarify filtering release artifacts method
loader.pypi: Clarify term from release to artifact

Mainly remains the memory optimization alignment with the debian loader

Harbormaster completed remote builds in B1383: Diff 1271.Sep 13 2018, 8:20 PM

pypi.loader: Make loader fully incremental in loading swh objects
pypi.client: Improve PyPIProject docstring

Harbormaster completed remote builds in B1384: Diff 1272.Sep 14 2018, 12:00 AM

ardumont marked 2 inline comments as done.Sep 14 2018, 12:01 AM

ardumont added inline comments.

swh/loader/pypi/loader.py
153–154 ↗	(On Diff #1268)	And done \m/

ardumont marked 12 inline comments as done.Sep 14 2018, 12:06 AM

pypi.tasks: Rename to LoadPyPI task
pypi.client: Compute hashes and write the tarball in one roundtrip

Harbormaster completed remote builds in B1386: Diff 1274.Sep 14 2018, 1:47 AM

This revision was not accepted when it landed; it landed in state Needs Review.Sep 14 2018, 1:47 AM

Closed by commit rDLDPY7b6560bd469d: pypi.tasks: Rename to LoadPyPI task (authored by ardumont). · Explain Why

This revision was automatically updated to reflect the committed changes.

ardumont added inline comments.Sep 14 2018, 1:48 AM

swh/loader/pypi/client.py
254 ↗	(On Diff #1268)	I have opened D410 to open an hash_stream and add an option to require hexdigest (in swh.model.hashutil). And i will update this diff with its usage here. If you have a minute, it'd be nice. Thanks for your time.

/me raises an eyebrow...

I did not intend to close this...

I'm on branch loader-pypi

arc diff --update D408 origin/master

To update the diff which is the way to go.

My mistake seems to be me wanting to push the branch, that closed the diff.
What's confusing is that i already did that (update the branch) but that did not close it before...

And now i just don't understand what is displayed...

ardumont mentioned this in D411: pypi.loader: Finalize pypi loader (take 2).Sep 18 2018, 11:11 AM

Bootstrap pypi loader
ClosedPublic
Actions

Details

Functional note

Modules

Technical Note

Edge cases expected

Possible improvments

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 1275

swh/loader/pypi/tasks.py

Bootstrap pypi loaderClosedPublicActions

Details

Functional note

Modules

Technical Note

Edge cases expected

Possible improvments

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 1275

swh/loader/pypi/tasks.py

Bootstrap pypi loader
ClosedPublic
Actions

Revision Contents
Changeset List