diff --git a/README.md b/README.md index ca0b613..737c01d 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,102 @@ swh-loader-pypi ==================== SWH PyPI loader's source code repository + +# What does the loader do? + +The PyPI loader visits and loads a PyPI project [1]. + +Each visit will result in: +- 1 snapshot (which targets n revisions ; 1 per release artifact) +- 1 revision (which targets 1 directory ; the release artifact uncompressed) + +[1] https://pypi.org/help/#packages + +## First visit + +Given a PyPI project (origin), the loader, for the first visit: + +- retrieves information for the given project (including releases) +- then for each associated release +- for each associated source distribution (type 'sdist') release + artifact (possibly many per release) +- retrieves the associated artifact archive (with checks) +- uncompresses locally the archive +- computes the hashes of the uncompressed directory +- then creates a revision (using PKG-INFO metadata file) + targetting such directory +- finally, creates a snapshot targetting all seen revisions + (uncompressed PyPI artifact and metadata). + +## Next visit + +The loader starts by checking if something changed since the last +visit. If nothing changed, the visit's snapshot is left +unchanged. The new visit targets the same snapshot. + +If something changed, the already seen release artifacts are skipped. +Only the new ones are loaded. In the end, the loader creates a new +snapshot based on the previous one. Thus, the new snapshot targets +both the old and new PyPI release artifacts. + +## Terminology + +- 1 project: a PyPI project (used as swh origin). This is a collection + of releases. + +- 1 release: a specific version of the (PyPi) project. It's a + collection of information and associated source release + artifacts (type 'sdist') + +- 1 release artifact: a source release artifact (distributed by a PyPI + maintainer). In swh, we are specifically + interested by the 'sdist' type (source code). + +## Edge cases + +- If no release provides release artifacts, those are skipped + +- If a release artifact holds no PKG-INFO file (root at the archive), + the release artifact is skipped. + +- If a problem occurs during a fetch action (e.g. release artifact + download), the load fails and the visit is marked as 'partial'. + +# Development + +## Configuration file + +### Location + +Either: +- /etc/softwareheritage/loader/pypi.yml +- ~/.config/swh/loader/pypi.yml +- ~/.swh/loader/svn.pypi + +### Configuration sample + +``` +storage: + cls: remote + args: + url: http://localhost:5002/ + +``` + +## Local run + +PyPI loader expects as input: +- project: a pypi project name (ex: arrow) +- project_url: uri to the pypi project (html page) +- project_metadata_url: uri to the pypi metadata information (json page) + +``` sh +$ python3 +Python 3.6.6 (default, Jun 27 2018, 14:44:17) +[GCC 8.1.0] on linux +Type "help", "copyright", "credits" or "license" for more information. +>>> import logging; logging.basicConfig(level=logging.DEBUG +>>> project='arrow; from swh.loader.pypi.tasks import LoadPyPI; +>>> LoadPyPI().run(project, 'https://pypi.org/pypi/%s/' % project, 'https://pypi.org/pypi/%s/json' % project) +```