Page MenuHomeSoftware Heritage

pypi.loader: Filter out null snapshot branches
ClosedPublic

Authored by ardumont on Thu, Nov 29, 9:28 PM.

Details

Summary

Related T1396

Test Plan

tox

Diff Detail

Repository
rDLDPY PyPI loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Thu, Nov 29, 9:28 PM
ardumont retitled this revision from pypi.loader: Filter out null snapshot branch to pypi.loader: Filter out null snapshot branches.Thu, Nov 29, 9:32 PM
anlambert accepted this revision.Thu, Nov 29, 9:40 PM

LGTM.

Regarding skipping package versions without the PKG-INFO file (T1396), do we really want to do that ?

I mean, there is still code to archive here besides the fact the package metadata (and thus the revision author) are not available.
This is an open question, I don't know what is the proper solution to handle those cases.

This revision is now accepted and ready to land.Thu, Nov 29, 9:40 PM
This revision was automatically updated to reflect the committed changes.

Regarding skipping package versions without the PKG-INFO file (T1396), do we really want to do that ?
I mean, there is still code to archive here besides the fact the package metadata (and thus the revision author) are not available.

That's a fair question.
but the answer is yes, we want to do that.

At one moment (pre-running phase), the implementation was using mainly the pypi api to solve the problem (no pkg-info parsing).
Then, it has been pointing out that as we are only client to the api, we could not vouch for it. So it's best to avoid using it for that part.

As an iterative tryout step, it then parsed the pkg-info file, then fallback to use the api to fill in the gap [2].
As the previous step though, it was deemed better to skip altogether [1]

So here we are, we are now skipping those packages without a pkg-info file.

This is an open question, I don't know what is the proper solution to handle those cases.

The only alternative i see would be to retrieve information by the api (which provides it).
As pointed out previously, no.

After that, it'd be interesting to keep reference of those no pkg-info package to check:

  • frequency
  • if it's even source packages (IIRC, i saw packages that were not really source ones but i have no example to provide for that).

[2] 5b6dbbd871f4b9a17954e380d97a8d6ed95c32a3

[3] b80666fd24b8d2ec98ea829d633da984ca0e317f

The only alternative i see would be to retrieve information by the api (which provides it).
As pointed out previously, no.

Another option could be to parse the content of the setup.py file, notably the keyword parameters
of the setup function. The PKG-INFO file gets generated from those so that could fill the gap.

The only alternative i see would be to retrieve information by the api (which provides it).
As pointed out previously, no.

Another option could be to parse the content of the setup.py file, notably the keyword parameters
of the setup function. The PKG-INFO file gets generated from those so that could fill the gap.

Interesting.
If there is no pkg-info, i'm wondering what the setup.py looks like now (if there is any)
;)

Below is an example of setup.py for a package without PKG-INFO file generated (https://pypi.org/project/configpy/0.2/#files).

from setuptools import setup

long_description = """
A config file parser with variable replacement, variable look-ahead 
and look-behind support.
"""

setup(
    name='configpy',
    description='Python Configuration File Parser',
    url='http://jkeyes.github.com/configpy/',
    long_description=long_description,
    author='John Keyes',
    author_email='configpy@keyes.ie',
    version='0.2',
    license="BSD",
    classifiers = [
        'Development Status :: 4 - Beta',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: BSD License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
    ],
    packages=['configpy'],
)