Page MenuHomeSoftware Heritage

Bootstrap pypi loader
ClosedPublic

Authored by ardumont on Sep 11 2018, 11:50 AM.

Details

Summary

Given a pypi origin, load its release artifacts (as synthetic
revision).

The first time, the visit results in a snapshot (targetting created
revisions). Further visits with no change results in the same latest
snapshot. Visit with new release artifacts results in a new snapshot
(new snapshot share the same branches as the last one, plus the new
ones)

Functional note

  • Missing release artifact information are skipped
  • Release artifacts whose PKG-INFO file is missing are skipped

Modules

  • swh.loader.pypi.client: Client interface to query pypi.org (and also somewhat manipulating the artifact local representations). It's the PyPiProject's collaborator to fetch missing information on the project.
  • swh.loader.pypi.model: PyPiProject representation of a pypi origin. It's the loader's collaborator to manipulate the origin (filter releases, etc...)
  • swh.loader.pypi.loader: The main entry point for the loader (fetch_data fetches, store_data stores... ;)

Technical Note

The client cache is there for local runs and post data analysis (also,
it has been used for tests). That setting should be set to off for
production.

Edge cases expected

  • Pypi project can resolve in no longer existing origin (404)
  • Hash checksum divergences stop the loading (maybe this can be improved later)

Possible improvments

  • Add some retry around the fetch artifacts routine (client)
  • Add more tests
  • simplify PyPiProject class (it was created at first to do more than it does now)

Branch is remote on the repository: loader-pypi

Test Plan

make test

Diff Detail

Repository
rDLDPY PyPI loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Nice work, thanks!

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

  • Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
  • Releases, which are point in time snapshots of a Project, tagged with a version number (and _only_ a version number);
  • Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked ๐Ÿ˜ถ), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.

Ideally this terminology would be documented in the client class as well so we can keep it consistent.

Can you explain why some methods in the loader class have an (override) comment?

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

I think I understand the client / model split (client is generic, model is SWH-specific), but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?). I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

swh/loader/pypi/client.py
49 โ†—(On Diff #1268)

the argument is not a file, it's a parsed structure

62 โ†—(On Diff #1268)

It's not clear what this does. Can you add an (abridged) example directory listing of what you expect in dir_path in the docstring ?

71 โ†—(On Diff #1268)

Please add documentation on what the arguments do.

152โ€“153 โ†—(On Diff #1268)

The status code alone may not be sufficient to debug the problem. Please also provide the body of the response in exceptional cases to make them easier to diagnose.

160 โ†—(On Diff #1268)

Why have the client pass the project_url *and* the project name? Couldn't the URL be constructed like in the release function?

184 โ†—(On Diff #1268)

This base URL should be registered somewhere else. I'd suggest either a class attribute or an instance attribute set by a parameter (there's more than one Python package repository in the wild).

187 โ†—(On Diff #1268)

As this does more than fetching, I'd rename this function prepare_release_artifact, and possibly break it down in three distinct "fetch" (also handles caching and integrity control), "uncompress" and "metadata gathering" steps?

To be more generally useful, I'd let the caller control where the archive is fetched and uncompressed, rather than needing to build a known directory name between the caller and this, and risking both to get out of sync.

223โ€“244 โ†—(On Diff #1268)

this whole method could benefit from using the stream=True argument to requests.get : http://docs.python-requests.org/en/master/user/advanced/#body-content-workflow

  • doesn't put the whole tarball in memory
  • allows you to check the size before fetching the first byte

This would still use r.iter_content(), you just need to use r.headers['content-length'] (?) instead of len(r.content)

254 โ†—(On Diff #1268)

You should compute all hashes when processing the file initially (unfortunately I don't think our hashutil provides this functionality natively), this would save one full file read, and probably the need for convert_to_hex as well.

258โ€“259 โ†—(On Diff #1268)

release.update(artifact)? or more sensibly release['artifact'] = artifact?

swh/loader/pypi/loader.py
37 โ†—(On Diff #1268)

Is the client argument ever used? I don't seem to see it used in this diff.

81 โ†—(On Diff #1268)

This may fail if the tarball contained files with bogus permissions (e.g. directories without write access). We may gain from a permission normalization phase in the decompression stage.

135 โ†—(On Diff #1268)

I wonder why all objects have an underscore in this function?

140โ€“151 โ†—(On Diff #1268)

Creating instance attributes in a method is considered pretty bad form; Please move those initializations to the __init__ method, or at least to the prepare method.

Those should also probably be id -> object dicts to provide some deduplication and memory savings in the following loop (for contents, the id should be a compound key with all hashes).

153โ€“154 โ†—(On Diff #1268)

This loop completely bypasses the incremental loading capabilities implemented in the base loader. This risks the loader using too many resources as it will materialize all objects in memory before sending them to the database, most notably on initial imports.

The incremental pattern is currently used in the Debian loader.

The idea is that the prepare method talks to the database and PyPI, and gathers what work needs to be done in some instance attributes. fetch_data and store_data get called in a loop, with the provision that fetch_data only fetches packages one at a time, and store_data only pushes the data for that package in a queue (which gets automatically flushed when it's too full, therefore keeping memory usage limited).

161โ€“187 โ†—(On Diff #1268)

The revision generation should probably be moved to a method of the PyPIProject class, or to a function in the same module.

swh/loader/pypi/model.py
12 โ†—(On Diff #1268)

Where does the data come from, and what schema does the project conform to?

36 โ†—(On Diff #1268)

Where does this data come from? Please be explicit that the return value is in the shape the Software Heritage archive expects.

64 โ†—(On Diff #1268)

permit -> allow

69 โ†—(On Diff #1268)

Isn't PyPIClient doing that?

It'd be useful to have some expected usage patterns for this class.

Please also add documentation for the arguments for class instantiation.

72 โ†—(On Diff #1268)

client + project and project_metadata_url are possibly redundant

104 โ†—(On Diff #1268)

Possibly re-inline the info function, as it's not being tested separately.

106 โ†—(On Diff #1268)

This really filters the files within a release?

125โ€“136 โ†—(On Diff #1268)

again with the stuck _ key ;)

151 โ†—(On Diff #1268)

From what I can tell, this uncompresses the multiple files for a single release, is that correct?

If that's so, then it should be named accordingly.

152 โ†—(On Diff #1268)

and

155 โ†—(On Diff #1268)

Is it a name or a version? I see it gets packed into a 'name' entry in the flattened dict, so an example would be useful.

163 โ†—(On Diff #1268)

Is this really useful? AIUI the releases list (which should probably be a release_files list) comes from the PyPI API directly, so it'd make sense for the PyPIClient class to understand it directly, wouldn't it?

If it is really useful, then the functionality should likely be moved to a separate function/staticmethod to ease testing.

164 โ†—(On Diff #1268)

flattened

167 โ†—(On Diff #1268)

Are there other digests that can be checked? If so we should avoid losing them, and check them on the client side.

187 โ†—(On Diff #1268)

known_files

swh/loader/pypi/tasks.py
11

Task, not Tsk

ardumont added inline comments.
swh/loader/pypi/client.py
62 โ†—(On Diff #1268)

yes.

152โ€“153 โ†—(On Diff #1268)

yes

160 โ†—(On Diff #1268)

Probably, it was only for debug purposes so i did not push further.

swh/loader/pypi/loader.py
37 โ†—(On Diff #1268)

Unfortunately, it's for the tests.

81 โ†—(On Diff #1268)

It's already done in the swh.core.tarball.uncompress call [1]

[1] https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/tarball.py$165-173

135 โ†—(On Diff #1268)

To explicit it's private to that method, i agree it's not needed.
I'll keep only those for the methods then.

(it's not really only that function, e.g. _known_releases method also).

140โ€“151 โ†—(On Diff #1268)

Creating instance attributes in a method is considered pretty bad form; Please move those initializations to the init method, or at least to the prepare method.

I don't like it either.
But i like keeping the state local to where it's needed (here fetch_data, store_data).
In the prepare method (or init), it's not so local anymore...

I'm also doing this because the way the loader core works...
I'd prefer to pass along those that state to the store_data method...
But i don't feel like changing that now.

153โ€“154 โ†—(On Diff #1268)

Yes, i struggled on this.
First implementation broke the fetch_data, store_data pattern as fetch_call did not use any state and directly called the maybe_load_... methods. So store_data was unused.

The incremental pattern is currently used in the Debian loader.

Will take a closer look there. Thanks.

swh/loader/pypi/model.py
64 โ†—(On Diff #1268)

rahhhhh

69 โ†—(On Diff #1268)

Isn't PyPIClient doing that?

Yes, it's pypiclient (as collaborator as this class) which does it.

It'd be useful to have some expected usage patterns for this class.
Please also add documentation for the arguments for class instantiation.

Yes, will do.

167 โ†—(On Diff #1268)

well, only md5 ;)

In D408#7860, @olasd wrote:

Nice work, thanks!

:)

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

Yes.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

  • Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
  • Releases, which are point in time snapshots of a Project, tagged with a version number (and _only_ a version number);
  • Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked ๐Ÿ˜ถ), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.
Ideally this terminology would be documented in the client class as well so we can keep it consistent.

Ok.

Can you explain why some methods in the loader class have an (override) comment?

It's just for mentioning it's coming from the base class.
To distinguish between those and the other ones being initialized in that class.

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

Yes, well, that part is not for production.
It was more for the development phase (as explained in the description of this diff ;). It helped me a lot to post-analyze the json and release artifacts.

I did not plan on sharing the cache between the different workers.

I think I understand the client / model split (client is generic, model is SWH-specific),

Yes, that was the goal.

but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?).

Yes, i don't like it either.
I did not find anything better then.

I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

I'm ok with that.

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Right.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

I agree.

For the testing, I focused mainly on the loading part.

I'll keep it in mind for later.

ardumont added inline comments.
swh/loader/pypi/client.py
187 โ†—(On Diff #1268)

Well right now, i won't split it yet.
But i will do the other improvments proposed.

223โ€“244 โ†—(On Diff #1268)

That sounds good, thanks.

254 โ†—(On Diff #1268)

yeah, ok.

(unfortunately I don't think our hashutil provides this functionality natively)

I think hashutil.hash_file allow that.

(Not for the convert_to_hex part though, hard-coded to compute the binary format).

258โ€“259 โ†—(On Diff #1268)

release.update(artifact)?

Huh, right...

or more sensibly release['artifact'] = artifact?

No, the release here is actually the artifact... (that's one of your point earlier ;)
I'm renaming the release name here.

swh/loader/pypi/loader.py
161โ€“187 โ†—(On Diff #1268)

Yes, but later.

swh/loader/pypi/model.py
155 โ†—(On Diff #1268)

Well, the name is the version.

swh/loader/pypi/tasks.py
11

ymmv, we have some with Task, Tsk (that'd be my fault if we were pointing finger) and nothing...

Maybe nothing is good enough (e.g. task_type with the backend as the fqdn, we already see the .tasks. module).

ardumont marked 2 inline comments as done.

Taking care of some remarks

  • loader.pypi.client: Improve docstrings
  • pypi.client: Improve error message when failing to fetch artifacts
  • pypi.client: Extract PyPI instance's base url as instance attribute
  • pypi.client: Rename prepare_release_artifact method
  • pypi.client: Clarify the term from release to artifact
  • loader.pypi: Move converter functions in converters module
  • test_loader: Fix typo
  • Move PyPiProject from model to client module
  • *: Use "PyPI" terminology consistently everywhere
  • pypi.loader: Remove (override) comment
  • *: Use less _ in prefix variable names
  • pypi.client: Fix wrong verb use (allow instead of permit)
  • pypi.client: Dissociate artifact and release object
  • test_loader: Fix pep8 violation
  • pypi.client: Clarify filtering release artifacts method
  • loader.pypi: Clarify term from release to artifact

Mainly remains the memory optimization alignment with the debian loader

  • pypi.loader: Make loader fully incremental in loading swh objects
  • pypi.client: Improve PyPIProject docstring
ardumont added inline comments.
swh/loader/pypi/loader.py
153โ€“154 โ†—(On Diff #1268)

And done \m/

  • pypi.tasks: Rename to LoadPyPI task
  • pypi.client: Compute hashes and write the tarball in one roundtrip
This revision was not accepted when it landed; it landed in state Needs Review.Sep 14 2018, 1:47 AM
This revision was automatically updated to reflect the committed changes.
swh/loader/pypi/client.py
254 โ†—(On Diff #1268)

I have opened D410 to open an hash_stream and add an option to require hexdigest (in swh.model.hashutil).
And i will update this diff with its usage here.

If you have a minute, it'd be nice.

Thanks for your time.

/me raises an eyebrow...

I did not intend to close this...

I'm on branch loader-pypi

arc diff --update D408 origin/master

To update the diff which is the way to go.

My mistake seems to be me wanting to push the branch, that closed the diff.
What's confusing is that i already did that (update the branch) but that did not close it before...

And now i just don't understand what is displayed...