Page MenuHomeSoftware Heritage

PyPI loader
Closed, MigratedEdits Locked

Description

We need a PyPI loader that is, in essence, the tarball loader but capable of extracting upstream metadata that are meaningful (and specific to) PyPI.

Related Objects

Event Timeline

capable of extracting upstream metadata that are meaningful (and specific to) PyPI.

I have sorted out the technical parts (retrieving metadata, tarballs) that permits to do that.
Now, remains the more functional part about our how to represent those origins' data.

We need a PyPI loader that is, in essence, the tarball loader

To clarify, that means, for 1 pypi project:

  • 1 snapshot targetting all the following releases...

    for each pypi tarball of project p:

    -> 1 release (not synthetic) targetting...

    -> synthetic revision for tarball (keeping original artifact metadata) targetting...

    -> ... the root directory of the uncompressed associated tree

Right?

Cheers,

ardumont changed the task status from Open to Work in Progress.Aug 1 2018, 3:10 PM
ardumont claimed this task.

The basic loader will be the tarball loader, yes. In addition to that there are two aspects to be defined:

  1. the stack of objects to be added to the DAG
  2. the metadata to extract

For (1), I think what we currently do for Debian packages is as you said, i.e., snapshot -> release -> revision -> tarball root dir. Maybe you can check for comparison (or @olasd can chime in?). We should do the same here.

For (2), it depends on what's available as common and easily extractable metadata in PyPI bundles. For Debian source packages we extract a bunch of metadata from debian/changelog and similar files. I'm guessing Python tarballs will have interesting metadata in requirements.txt and possibly other files, but I'm not familiar enough with Python distribution best practices to provide a comprehensive list. @seirl @olasd maybe you can chime in on this?

Once we have determined (2), there's the question about to which object attach the metadata. If the stack in (1) includes both revision and release, we need to pick one. There again, consistency with what we do for Debian packages would be valuable.

Hope this helps

For (1), I think what we currently do for Debian packages is as you said, i.e., snapshot -> release -> revision -> tarball root dir. Maybe you can check for comparison (or @olasd can chime in?). We should do the same here.

Right, will check this. Thanks for reminding me.

For (2), it depends on what's available as common and easily extractable metadata in PyPI bundles.

The pypi api provides already quite the information (P288, P289 for examples).
For now, the current implementation leverages it.

For Debian source packages we extract a bunch of metadata from debian/changelog and similar files.

Yes.
The current pypi loader (loader-pypi branch) does not however extract information from any files (yet?).
Since i thought the pypi metadata information provided was enough.

I'm guessing Python tarballs will have interesting metadata in requirements.txt and possibly other files,

I don't think this is something we want to do.
I though we did this for the debian loader because there were no other choice (no upstream api to give information for example).

My take on this is if the pypi api is enough, we should stick to it.

In any case, the question is, what metadata are we interested in?

So far, for the project metadata, i chose a subset of what's provided (home_page, description, license, package_url, project_url, upstream). For information, I'm strongly thinking of keeping all the original project information.

And I also keep an 'original_artifact entry targetting the current release file (i.e. fqdn url, filename, size, date, release name, filename, size, sha1, sha1_git, sha256, blake2s256, ...).
Those are stored in the revision (see below).

but I'm not familiar enough with Python distribution best practices to provide a comprehensive list. @seirl @olasd maybe you can chime in on this?

Oh, that makes me think that i saw that some origin [1] whose provided metadata held multiple files under the same pypi 'release' version (.whl, .egg, .tar.gz so far).
Those are different distribution formats (egg ~ 2004, wheel ~2012, ...) which i'm not much familiar with yet.
Some other origins provides only tarballs [2].

The current implementation only deals with tarballs at the moment.

Once we have determined (2), there's the question about to which object attach the metadata.

Yes.

If the stack in (1) includes both revision and release, we need to pick one.

Well, in our model so far, only the revision table permits to hold metadata...
Not that it is a blocker but if we were to choose the release (which makes sense here), that would mean adapting the storage.

Today for pypi, i found the revision redundant but i create it nonetheless because that's the only object permitting to not lose metadata...
I don't think this is a good enough reason to keep doing it though.

There again, consistency with what we do for Debian packages would be valuable

Sure. Again, will check.

Hope this helps

Sure. Thanks.

Cheers,

[1] https://pypi.org/pypi/BoardTester/json
[2] https://pypi.org/pypi/arrow/json

The pypi api provides already quite the information (P288, P289 for examples).
For now, the current implementation leverages it.

As far as I can tell from those examples, the metadata that PyPI gives you are the most recent ones, probably the ones extracted from the most recent version, so it would be incorrect to associate them to other releases.

As far as I can tell from those examples, the metadata that PyPI gives you are the most recent ones, probably the ones extracted from the most recent version, so it would be incorrect to associate them to other releases.

Oh right, thanks!

Checking the docs again [1]: Returns metadata (info) about an individual project at the latest version...

So that means, we need to do one more read to the /release api (per release) to have the correct information [2]: Returns metadata about an individual release at a specific version, otherwise identical to /pypi/<project_name>/json.. That still means the correct information is reachable from pypi ;)

[1] https://warehouse.readthedocs.io/api-reference/json/#get--pypi--project_name--json

[2] https://warehouse.readthedocs.io/api-reference/json/#get--pypi--project_name---version--json

There remains 3 actions to do for the current implementation to be complete:

  • 1. Actionable one: be incremental in reading the releases in the pypi api (for now, for a given origin, it naively asks each time everything back which is then filtered out later).

This can and should be avoided. Implementation wise, that means resolving the existing snapshots's targetted releases (through their name).
If present, do not ask again for those (still keeping those for the snapshot creation though).

  • 2. The packaging type we want to support. So far, i have gathered the following intel from the documentation and multiple runs:

Some packages team push multiple package type files:

  • bdist_wininst: .exe (binary)
  • bdist_egg: .egg (archive/directory), .tar.bz2
  • bdist_wheel: .whl (zip archive)
  • bdist_dumb: .tar.gz, .zip
  • sdist: .tar.gz, .zip
  • bdist_msi: .win32.msi (binary)
  • bdist_rpm: .rpm (archive)
  • and possibly more (i did not run the lister on all pypi yet)...

I have chosen so far to use only the sdist type which is the archive format for the source code (zip, tar).

But there are some current limitations as some project (few):

  • reference the same data in both zip and tar format [1]. So which to choose? If both, how do we represent this in our model (cf. below).
  • does not provide an sdist type (did see some only pushing wheel file, probably some other can push only .egg...).

We can probably have a policy to integrate those by unzipping those (.whl is a zip, .egg can be an archive or a directory). [2]

  • (FIXME: add refs) some provide both tarballs for source code, documentation, test in the sdist type. The only way to distinguish between those is to rely on heuristic on the filenames (-source-, -doc-, `test, ...).

So, the actual loader implementation is ignoring all other package types mentioned.
I think the binary formats (bdist_wininst, bdist_msi) is not really an issue.
But maybe the others are?

If we were to integrate the other formats, what would be the model representation?

  • 1 revision per uncompressed directory

~> that makes no sense to me

  • 1 upper directory holding the output of the different archive types' uncompressing?

~> seems more reasonable

  • 3. Do we keep the metadata at the revision level or is it reasonable to have it on the release directly (require model adaptation)?

This must have been discussed already but don't find the discussion (nor recall that discussion ;).

[1] https://pypi.org/pypi/ATpy/json

[2] https://pypi.org/pypi/atquant/json

[3] https://pypi.org/pypi/2311321-das1d31-2131313213213/json

Cheers,

In T421#21639, @zack wrote:

The basic loader will be the tarball loader, yes. In addition to that there are two aspects to be defined:

  1. the stack of objects to be added to the DAG
  2. the metadata to extract

For (1), I think what we currently do for Debian packages is as you said, i.e., snapshot -> release -> revision -> tarball root dir. Maybe you can check for comparison (or @olasd can chime in?). We should do the same here.

The Debian loader doesn't create release objects. Our data model doesn't allow to attach arbitrary structured metadata to release objects (as Git doesn't either), so we've shortcut this level of indirection.

For (2), it depends on what's available as common and easily extractable metadata in PyPI bundles. For Debian source packages we extract a bunch of metadata from debian/changelog and similar files. I'm guessing Python tarballs will have interesting metadata in requirements.txt and possibly other files, but I'm not familiar enough with Python distribution best practices to provide a comprehensive list. @seirl @olasd maybe you can chime in on this?

The sdist should contain the structured metadata in the <package_name>.egg-info/ directory (most interestingly, the PKG-INFO file).

Once we have determined (2), there's the question about to which object attach the metadata. If the stack in (1) includes both revision and release, we need to pick one. There again, consistency with what we do for Debian packages would be valuable.

In the swh data model, (intrinsic) metadata can only be attached to revision objects. Release objects only have a date, an author and a message. I think we should skip release objects altogether, to be consistent with the Debian loader.

There remains 3 actions to do for the current implementation to be complete:

  • 1. Actionable one: be incremental in reading the releases in the pypi api (for now, for a given origin, it naively asks each time everything back which is then filtered out later).

This can and should be avoided. Implementation wise, that means resolving the existing snapshots's targetted releases (through their name).
If present, do not ask again for those (still keeping those for the snapshot creation though).

What the Debian loader does is to always get the full list of versions for all available source packages. As the Debian archive format references checksums for source packages, we keep a cache of already-loaded source packages (and their swh object id) so that we only load the packages that we have never seen before.

Would such an approach be doable for the PyPI loader? I'm pretty sure the API gives checksums of the tarballs in the listing API.

  • 2. The packaging type we want to support. So far, i have gathered the following intel from the documentation and multiple runs:

Some packages team push multiple package type files:

  • bdist_wininst: .exe (binary)
  • bdist_egg: .egg (archive/directory), .tar.bz2
  • bdist_wheel: .whl (zip archive)
  • bdist_dumb: .tar.gz, .zip
  • sdist: .tar.gz, .zip
  • bdist_msi: .win32.msi (binary)
  • bdist_rpm: .rpm (archive)
  • and possibly more (i did not run the lister on all pypi yet)...

I have chosen so far to use only the sdist type which is the archive format for the source code (zip, tar).

Sounds good.

But there are some current limitations as some project (few):

  • reference the same data in both zip and tar format [1]. So which to choose? If both, how do we represent this in our model (cf. below).

Probably both.

  • does not provide an sdist type (did see some only pushing wheel file, probably some other can push only .egg...).

We can probably have a policy to integrate those by unzipping those (.whl is a zip, .egg can be an archive or a directory). [2]

I'd ignore these. Software Heritage only archives source code, and wheels/eggs are binaries, not source (even though most Python packages only contain plain text source that is indistinguishable from binaries). Do you have concrete numbers, that is, what is the ratio of packages that don't have a sdist?

  • (FIXME: add refs) some provide both tarballs for source code, documentation, test in the sdist type. The only way to distinguish between those is to rely on heuristic on the filenames (-source-, -doc-, `test, ...).

Do you have an example of package with such setup? (I guess that's your FIXME :p)

To give an example of how we could handle things: the Debian source package format allows multiple original tarballs, with a suffix, and they all get decompressed with the following heuristic:

  • decompress the <package>_<version>.orig.tar.gz file into a <package>-<version> directory
  • for each extra orig tarball, named <package>_<version>.orig-$foo.tar.gz
    • decompress the tarball in a new, empty <package>-<version>/$foo subdirectory
  • remove the <package>-<version>/debian directory (if it exists)
  • decompress <package>_<version>-<debian_revision>.debian.tar.gz in <package>-<version>/debian

Formally, the unpacked Debian source package is the entirety of the <package>-<version> directory.

The Debian loader loads the whole <package>-<version> directory, as decompressed by dpkg-source, as the root of the synthetic revision it generates.

Example:
https://archive.softwareheritage.org/browse/revision/390bac7426d7e30d4d9b8f0bfeecb9cd1dad0daf/?origin=deb://Debian/packages/firefox&origin_type=deb
https://archive.softwareheritage.org/api/1/revision/390bac7426d7e30d4d9b8f0bfeecb9cd1dad0daf/ shows the original artifacts as a main tarball, alongside a bunch of l10n tarballs.

I don't know of a Python standard to handle split sdists, so concrete examples would be great :)

So, the actual loader implementation is ignoring all other package types mentioned.
I think the binary formats (bdist_wininst, bdist_msi) is not really an issue.
But maybe the others are?

bdists are binary distributions and are ok to ignore, IMO. If a package only ships binary distributions, then it doesn't provide source code and can't be archived.

If we were to integrate the other formats, what would be the model representation?

I'm not sure what other formats you're talking about. I'll assume you're talking about, for instance, a package providing two sdists in .zip and .tar.gz format.

  • 1 revision per uncompressed directory

~> that makes no sense to me

Well, I think that's what makes the most sense:

  • Unpack all the sdist formats
  • If things are well, the contents are identical. In that case, the revision objects would end up with the same id; we can ignore that there ever was multiple formats, and just have a single branch pointing to a single revision for that version of the package in the snapshot
  • If the contents are different, load both and make the snapshot have a branch pointing to each format.
  • 1 upper directory holding the output of the different archive types' uncompressing?

~> seems more reasonable

I think the differences between the contents of different sdist formats should be minimal (or none) and therefore we should try to minimize how much we (externally) show this quirk of the python packaging ecosystem.

  • 3. Do we keep the metadata at the revision level or is it reasonable to have it on the release directly (require model adaptation)?

This must have been discussed already but don't find the discussion (nor recall that discussion ;).

I think metadata at the revision level is fine. In my opinion, metadata should be as intrinsic as possible; ideally, it would only come from the contents of the sdist (i.e. parsing the PKG-INFO file after decompression), rather than from the API.

[1] https://pypi.org/pypi/ATpy/json

[2] https://pypi.org/pypi/atquant/json

[3] https://pypi.org/pypi/2311321-das1d31-2131313213213/json

Cheers,

The Debian loader doesn't create release objects. Our data model doesn't allow to attach arbitrary structured metadata to release objects (as Git doesn't either), so we've shortcut this level of indirection.

Right, I did not notice when comparing (/me *grunts*). Thanks.

That makes sense, I'll align then.

In the swh data model, (intrinsic) metadata can only be attached to revision objects. Release objects only have a date, an author and a message. I think we should skip release objects altogether, to be consistent with the Debian loader.

Yes, i will align.

The sdist should contain the structured metadata in the <package_name>.egg-info/ directory (most interestingly, the PKG-INFO file).

Interesting.
Like I implicitely said earlier, i do not read anything from the uncompressed archive yet.
I'm trusting the pypi api.

I don't know if adding an adhoc parsing step is necessary yet.
I'd need to check where the pypi's metadata information is coming from first (so far, i assumed the api populated that data from something extracted during the package upload step or something).

I think metadata at the revision level is fine.

ok.

In my opinion, metadata should be as intrinsic as possible; ideally, it would only come from the contents of the sdist (i.e. parsing the PKG-INFO file after decompression), rather than from the API.

I'm ok with the intrinsic.
I was hoping for the pypi infra to take care of that (during upload for example).
Then again, i'll check the pypi api's documentation. Hopefully, it's explained somewhere ;)
If it's not done by pypi, then, i'll adapt to this instead.

What the Debian loader does is to always get the full list of versions for all available source packages.

Yes, the pypi loader does it as well. Because the first api call (for a project) gives the releases list.
But that information is missing the release metadata associated (first listing gives only metadata about the associated release files).
The main listing only gives (implicitely) the latest metadata on the project.

So then, the loader queries further the pypi api per (package) release to retrieve those missing information.
What i want to do here is shortcut that step if we already have loaded such release.

As the Debian archive format references checksums for source packages, we keep a cache of already-loaded source packages (and their swh object id) so that we only load the packages that we have never seen before.

I suppose you mean the swh.storage.schemata model module shared between the debian lister and the debian loader (+ the associated db which goes along).

Would such an approach be doable for the PyPI loader?

Yes, it's doable.

But i'm wondering if it's necessary.
I could make the loader ask the storage the existing known revisions for that origin (through the last snapshot).
Then dismiss the redundancy.

I'm pretty sure the API gives checksums of the tarballs in the listing API.

Yes, it does.

bdists are binary distributions and are ok to ignore, IMO.

Great!

If a package only ships binary distributions, then it doesn't provide source code and can't be archived.

Indeed, that's why i ignored those i recognized immediately as binary (windows related).

I'm not sure what other formats you're talking about. I'll assume you're talking about, for instance, a package providing two sdists in .zip and .tar.gz format.

No, i was speaking about the other bdist* formats.
You then made me realize it was some binary format, so as you already mentioned, it's no longer an issue.
Thanks.

1 revision per uncompressed directory ~> that makes no sense to me

Well, I think that's what makes the most sense:

Unpack all the sdist formats
If things are well, the contents are identical. In that case, the revision objects would end up with the same id; we can ignore that there ever was multiple formats, and just have a single branch pointing to a single revision for that version of the package in the snapshot
If the contents are different, load both and make the snapshot have a branch pointing to each format.

Right, now that makes indeed the most sense, thanks!

I think the differences between the contents of different sdist formats should be minimal (or none) and therefore we should try to minimize how much we (externally) show this quirk of the python packaging ecosystem.

Right.

I'd ignore these. Software Heritage only archives source code, and wheels/eggs are binaries, not source (even though most Python packages only contain plain text source that is indistinguishable from binaries).

Ok

I don't know of a Python standard to handle split sdists, so concrete examples would be great :)
Do you have an example of package with such setup? (I guess that's your FIXME :p)

It was the [3] link i forgot to reference back ;)
Checking back, out of 6077, i found only 3 others:

  • 4Suite-XML
  • Amara
  • asyncoro (examples this one, not doc)

Note: I also found some differences in naming conventions (some use multiple sdist file, one named with dash, another sdist file named with underscore...):

  • approx-dates
  • asyncio-pinger

As we can see, it's also because the naming convention change (- instead of _).
But, generally, i think what you proposed about revision and snapshot will take care of those small cases.

Do you have concrete numbers, that is, what is the ratio of packages that don't have a sdist?

Well, some without sdist exists (378 ~ 6.22%) [5]

During that check, i found some projects without release (402 ~ 6.62%) [4]

Mainly, they have an sdist (5297 ~ 87.16%)

[4] P290

[5] P291

Cheers,

In T421#21693, @olasd wrote:
  • Unpack all the sdist formats
  • If things are well, the contents are identical. In that case, the revision objects would end up with the same id; we can ignore that there ever was multiple formats, and just have a single branch pointing to a single revision for that version of the package in the snapshot
  • If the contents are different, load both and make the snapshot have a branch pointing to each format.

So, having one branch in the snapshot per distribution format (tar/zip/etc.) is a nice and clean way of handling this. It will also out of the box dedpulicate, having all branches pointing to the same directory object, in the (hopefully common) case that all release archives are identical.

Still, we should probably have a "master" branch, to ease navigation, shouldn't we? (What do we do for Debian packages on this?) If so, we need to decide an order of "favorite" distribution formats, e.g., "tar is better than zip which is better than foo, etc".
@ardumont can you investigate what pip does? e.g., what it downloads if there are multiple sdist formats? (Caveat: it is possible that that depends on the platform where pip is run…)

So, having one branch in the snapshot per distribution format (tar/zip/etc.) is a nice and clean way of handling this.

I will be using the artifact release filename as a branch name. For example, for the project blitz-ca, for release 0.1.1, that would give 2 branches: "blitz-ca-0.1.1.tar.bz2", blitz-ca-0.1.1.zip" (same for other releases...).

That will avoid building something ourselves. For now, it's building the branch using the release version (and optionally a rank for the multiple sdist case).
Something like:

  • branch 0.1.1 targetting blitz-ca-0.1.1.tar.bz2's uncompressed directory.
  • branch 0.1.1_1 targetting blitz-ca-0.1.1.zip's uncompressed directory.

That matches a little what we do in the debian loader.
In it, we use a more complete name package name (without the package name though) as branch name.
For example, that gives something like "jessie/updates/main/2.4.10-10+deb8u12" for a branch name [2]

It will also out of the box dedpulicate, having all branches pointing to the same directory object, in the (hopefully common) case that all release archives are identical.

yes

Still, we should probably have a "master" branch, to ease navigation, shouldn't we?

That does not seem to be necessary [1] [2] [3]

(What do we do for Debian packages on this?)

The debian loader does not enforce a master branch [2].
As far as i can tell, the branches are created as shown earlier (it's read from the loader debian task's parameters).

@ardumont can you investigate what pip does? e.g., what it downloads if there are multiple sdist formats? (Caveat: it is possible that that depends on the platform where pip is run…)

Yes, adding to the list of stuff i need to investigate further (pkg-info format, pypi api's metadata origin, ...).


[1] https://archive.softwareheritage.org/browse/revision/390bac7426d7e30d4d9b8f0bfeecb9cd1dad0daf/?origin=deb://Debian/packages/firefox&origin_type=deb
[2] https://archive.softwareheritage.org/browse/origin/deb/url/deb://Debian/packages/firefox/branches/
[3] https://archive.softwareheritage.org/browse/origin/deb/url/deb://Debian/packages/firefox/directory/

In T421#21696, @zack wrote:

Still, we should probably have a "master" branch, to ease navigation, shouldn't we? (What do we do for Debian packages on this?)

The debian loader doesn't provide a HEAD, only versioned branches. I have no idea what the UI picks as default branch (IMO it shouldn't pick, but I believe that ship has sailed by now)

If so, we need to decide an order of "favorite" distribution formats, e.g., "tar is better than zip which is better than foo, etc".
@ardumont can you investigate what pip does? e.g., what it downloads if there are multiple sdist formats? (Caveat: it is possible that that depends on the platform where pip is run…)

When forcing pip to download a sdist, it just downloads the tarball and does no further processing

test command:

pip download --no-binary :all: --no-deps Django==1.11.15

Then again, i'll check the pypi api's documentation. Hopefully, it's explained somewhere ;)

I'm still looking for that information.


In the mean time, for the potential PKG-INFO parsing, i have checked the subset of sdist files and so far:

  • all sdist files hold a PKG-INFO at the root of the uncompressed directory
  • most also hold a <uncompressed-dir>/<project-name>.egg-info/PKG-INFO but some are not:
...
     - -rw-rw-r-- richard/richard  1392 2013-08-24 12:09 Biggus-0.2/PKG-INFO
     - -rw-r--r-- itwl/avd       1392 2014-01-28 14:40 Biggus-0.3/PKG-INFO
     - -rw-r--r-- ithr/avd       2752 2014-02-07 15:35 Biggus-0.4.0/PKG-INFO
     - -rw-r--r-- ithr/avd       2752 2014-02-28 16:54 Biggus-0.5.0/PKG-INFO
     - -rw-r--r-- ithr/avd       2752 2014-08-21 17:29 Biggus-0.6.0/Biggus.egg-info/PKG-INFO
     - -rw-r--r-- ithr/avd       2752 2014-08-21 17:29 Biggus-0.6.0/PKG-INFO

In that example, both hold the same information.

Note: python3-pkginfo for reading PKG-INFO files.