diff --git a/PKG-INFO b/PKG-INFO index b04d0da..7d31452 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,136 +1,136 @@ Metadata-Version: 2.1 Name: swh.loader.pypi -Version: 0.0.3 +Version: 0.0.4 Summary: Software Heritage PyPI Loader Home-page: https://forge.softwareheritage.org/source/swh-loader-pypi Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-pypi Description: swh-loader-pypi ==================== SWH PyPI loader's source code repository # What does the loader do? The PyPI loader visits and loads a PyPI project [1]. Each visit will result in: - 1 snapshot (which targets n revisions ; 1 per release artifact) - 1 revision (which targets 1 directory ; the release artifact uncompressed) [1] https://pypi.org/help/#packages ## First visit Given a PyPI project (origin), the loader, for the first visit: - retrieves information for the given project (including releases) - then for each associated release - for each associated source distribution (type 'sdist') release artifact (possibly many per release) - retrieves the associated artifact archive (with checks) - uncompresses locally the archive - computes the hashes of the uncompressed directory - then creates a revision (using PKG-INFO metadata file) targeting such directory - finally, creates a snapshot targeting all seen revisions (uncompressed PyPI artifact and metadata). ## Next visit The loader starts by checking if something changed since the last visit. If nothing changed, the visit's snapshot is left unchanged. The new visit targets the same snapshot. If something changed, the already seen release artifacts are skipped. Only the new ones are loaded. In the end, the loader creates a new snapshot based on the previous one. Thus, the new snapshot targets both the old and new PyPI release artifacts. ## Terminology - 1 project: a PyPI project (used as swh origin). This is a collection of releases. - 1 release: a specific version of the (PyPi) project. It's a collection of information and associated source release artifacts (type 'sdist') - 1 release artifact: a source release artifact (distributed by a PyPI maintainer). In swh, we are specifically interested by the 'sdist' type (source code). ## Edge cases - If no release provides release artifacts, those are skipped - If a release artifact holds no PKG-INFO file (root at the archive), the release artifact is skipped. - If a problem occurs during a fetch action (e.g. release artifact download), the load fails and the visit is marked as 'partial'. # Development ## Configuration file ### Location Either: - /etc/softwareheritage/ - ~/.config/swh/ - ~/.swh/ Note: Will call that location $SWH_CONFIG_PATH ### Configuration sample $SWH_CONFIG_PATH/loader/pypi.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` ## Local run The built-in command-line will run the loader for a project in the main PyPI archive. For instance, to load arrow: ``` sh python3 -m swh.loader.pypi.loader arrow ``` If you need more control, you can use the loader directly. It expects three arguments: - project: a PyPI project name (f.e.: arrow) - project_url: URL of the PyPI project (human-readable html page) - project_metadata_url: URL of the PyPI metadata information (machine-parsable json document) ``` python import logging logging.basicConfig(level=logging.DEBUG) from swh.loader.pypi.tasks import LoadPyPI project='arrow' LoadPyPI().run(project, 'https://pypi.org/pypi/%s/' % project, 'https://pypi.org/pypi/%s/json' % project) ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/debian/control b/debian/control index 63d7599..673b7e6 100644 --- a/debian/control +++ b/debian/control @@ -1,29 +1,29 @@ Source: swh-loader-pypi Maintainer: Software Heritage developers Section: python Priority: optional Build-Depends: debhelper (>= 9), dh-python (>= 2), python3-all, python3-arrow, python3-nose, python3-pkginfo, python3-requests, python3-setuptools, python3-swh.core, python3-swh.loader.core (>= 0.0.34~), python3-swh.model (>= 0.0.27~), - python3-swh.storage, + python3-swh.storage (>= 0.0.108~), python3-swh.scheduler, python3-vcversioner Standards-Version: 3.9.6 Homepage: https://forge.softwareheritage.org/source/swh-loader-pypi.git Package: python3-swh.loader.pypi Architecture: all Depends: python3-swh.core, python3-swh.loader.core (>= 0.0.34~), python3-swh.model (>= 0.0.27~), - python3-swh.storage, + python3-swh.storage (>= 0.0.108~), ${misc:Depends}, ${python3:Depends} Description: Software Heritage PyPI Loader diff --git a/requirements-swh.txt b/requirements-swh.txt index 7f8da48..5155478 100644 --- a/requirements-swh.txt +++ b/requirements-swh.txt @@ -1,5 +1,5 @@ swh.core swh.model >= 0.0.27 -swh.storage +swh.storage >= 0.0.108 swh.scheduler swh.loader.core >= 0.0.34 diff --git a/swh.loader.pypi.egg-info/PKG-INFO b/swh.loader.pypi.egg-info/PKG-INFO index b04d0da..7d31452 100644 --- a/swh.loader.pypi.egg-info/PKG-INFO +++ b/swh.loader.pypi.egg-info/PKG-INFO @@ -1,136 +1,136 @@ Metadata-Version: 2.1 Name: swh.loader.pypi -Version: 0.0.3 +Version: 0.0.4 Summary: Software Heritage PyPI Loader Home-page: https://forge.softwareheritage.org/source/swh-loader-pypi Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-pypi Description: swh-loader-pypi ==================== SWH PyPI loader's source code repository # What does the loader do? The PyPI loader visits and loads a PyPI project [1]. Each visit will result in: - 1 snapshot (which targets n revisions ; 1 per release artifact) - 1 revision (which targets 1 directory ; the release artifact uncompressed) [1] https://pypi.org/help/#packages ## First visit Given a PyPI project (origin), the loader, for the first visit: - retrieves information for the given project (including releases) - then for each associated release - for each associated source distribution (type 'sdist') release artifact (possibly many per release) - retrieves the associated artifact archive (with checks) - uncompresses locally the archive - computes the hashes of the uncompressed directory - then creates a revision (using PKG-INFO metadata file) targeting such directory - finally, creates a snapshot targeting all seen revisions (uncompressed PyPI artifact and metadata). ## Next visit The loader starts by checking if something changed since the last visit. If nothing changed, the visit's snapshot is left unchanged. The new visit targets the same snapshot. If something changed, the already seen release artifacts are skipped. Only the new ones are loaded. In the end, the loader creates a new snapshot based on the previous one. Thus, the new snapshot targets both the old and new PyPI release artifacts. ## Terminology - 1 project: a PyPI project (used as swh origin). This is a collection of releases. - 1 release: a specific version of the (PyPi) project. It's a collection of information and associated source release artifacts (type 'sdist') - 1 release artifact: a source release artifact (distributed by a PyPI maintainer). In swh, we are specifically interested by the 'sdist' type (source code). ## Edge cases - If no release provides release artifacts, those are skipped - If a release artifact holds no PKG-INFO file (root at the archive), the release artifact is skipped. - If a problem occurs during a fetch action (e.g. release artifact download), the load fails and the visit is marked as 'partial'. # Development ## Configuration file ### Location Either: - /etc/softwareheritage/ - ~/.config/swh/ - ~/.swh/ Note: Will call that location $SWH_CONFIG_PATH ### Configuration sample $SWH_CONFIG_PATH/loader/pypi.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` ## Local run The built-in command-line will run the loader for a project in the main PyPI archive. For instance, to load arrow: ``` sh python3 -m swh.loader.pypi.loader arrow ``` If you need more control, you can use the loader directly. It expects three arguments: - project: a PyPI project name (f.e.: arrow) - project_url: URL of the PyPI project (human-readable html page) - project_metadata_url: URL of the PyPI metadata information (machine-parsable json document) ``` python import logging logging.basicConfig(level=logging.DEBUG) from swh.loader.pypi.tasks import LoadPyPI project='arrow' LoadPyPI().run(project, 'https://pypi.org/pypi/%s/' % project, 'https://pypi.org/pypi/%s/json' % project) ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh.loader.pypi.egg-info/requires.txt b/swh.loader.pypi.egg-info/requires.txt index 78f4d33..b267d74 100644 --- a/swh.loader.pypi.egg-info/requires.txt +++ b/swh.loader.pypi.egg-info/requires.txt @@ -1,13 +1,13 @@ arrow pkginfo requests setuptools swh.core swh.loader.core>=0.0.34 swh.model>=0.0.27 swh.scheduler -swh.storage +swh.storage>=0.0.108 vcversioner [testing] nose diff --git a/swh/loader/pypi/_version.py b/swh/loader/pypi/_version.py index 4fee5ff..d48e243 100644 --- a/swh/loader/pypi/_version.py +++ b/swh/loader/pypi/_version.py @@ -1,5 +1,5 @@ # This file is automatically generated by setup.py. -__version__ = '0.0.3' -__sha__ = 'gb237da9' -__revision__ = 'gb237da9' +__version__ = '0.0.4' +__sha__ = 'gc993a89' +__revision__ = 'gc993a89' diff --git a/swh/loader/pypi/loader.py b/swh/loader/pypi/loader.py index 32f5b53..797b787 100644 --- a/swh/loader/pypi/loader.py +++ b/swh/loader/pypi/loader.py @@ -1,307 +1,310 @@ # Copyright (C) 2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os import shutil from tempfile import mkdtemp import arrow from swh.loader.core.utils import clean_dangling_folders from swh.loader.core.loader import SWHLoader from swh.model.from_disk import Directory from swh.model.identifiers import ( revision_identifier, snapshot_identifier, identifier_to_bytes, normalize_timestamp ) +from swh.storage.algos.snapshot import snapshot_get_all_branches from .client import PyPIClient, PyPIProject TEMPORARY_DIR_PREFIX_PATTERN = 'swh.loader.pypi.' DEBUG_MODE = '** DEBUG MODE **' class PyPILoader(SWHLoader): CONFIG_BASE_FILENAME = 'loader/pypi' ADDITIONAL_CONFIG = { 'temp_directory': ('str', '/tmp/swh.loader.pypi/'), 'cache': ('bool', False), 'cache_dir': ('str', ''), 'debug': ('bool', False), # NOT FOR PRODUCTION } def __init__(self, client=None): super().__init__(logging_class='swh.loader.pypi.PyPILoader') self.origin_id = None if not client: temp_directory = self.config['temp_directory'] os.makedirs(temp_directory, exist_ok=True) self.temp_directory = mkdtemp( suffix='-%s' % os.getpid(), prefix=TEMPORARY_DIR_PREFIX_PATTERN, dir=temp_directory) self.pypi_client = PyPIClient( temp_directory=self.temp_directory, cache=self.config['cache'], cache_dir=self.config['cache_dir']) else: self.temp_directory = client.temp_directory self.pypi_client = client self.debug = self.config['debug'] self.done = False def pre_cleanup(self): """To prevent disk explosion if some other workers exploded in mid-air (OOM killed), we try and clean up dangling files. """ if self.debug: self.log.warn('%s Will not pre-clean up temp dir %s' % ( DEBUG_MODE, self.temp_directory )) return clean_dangling_folders(self.config['temp_directory'], pattern_check=TEMPORARY_DIR_PREFIX_PATTERN, log=self.log) def cleanup(self): """Clean up temporary disk use """ if self.debug: self.log.warn('%s Will not clean up temp dir %s' % ( DEBUG_MODE, self.temp_directory )) return if os.path.exists(self.temp_directory): self.log.debug('Clean up %s' % self.temp_directory) shutil.rmtree(self.temp_directory) def prepare_origin_visit(self, project_name, project_url, project_metadata_url=None): """Prepare the origin visit information Args: project_name (str): Project's simple name project_url (str): Project's main url project_metadata_url (str): Project's metadata url """ self.origin = { 'url': project_url, 'type': 'pypi', } self.visit_date = None # loader core will populate it def _known_artifacts(self, last_snapshot): """Retrieve the known releases/artifact for the origin_id. Args snapshot (dict): Last snapshot for the visit Returns: list of (filename, sha256) tuples. """ if not last_snapshot or 'branches' not in last_snapshot: return {} revs = [rev['target'] for rev in last_snapshot['branches'].values()] known_revisions = self.storage.revision_get(revs) ret = {} for revision in known_revisions: if 'original_artifact' in revision['metadata']: artifact = revision['metadata']['original_artifact'] ret[artifact['filename'], artifact['sha256']] = revision['id'] return ret def _last_snapshot(self): """Retrieve the last snapshot """ - return self.storage.snapshot_get_latest(self.origin_id) + snapshot = self.storage.snapshot_get_latest(self.origin_id) + if snapshot and snapshot.pop('next_branch', None): + return snapshot_get_all_branches(self.storage, snapshot['id']) def prepare(self, project_name, project_url, project_metadata_url=None): """Keep reference to the origin url (project) and the project metadata url Args: project_name (str): Project's simple name project_url (str): Project's main url project_metadata_url (str): Project's metadata url """ self.project_name = project_name self.origin_url = project_url self.project_metadata_url = project_metadata_url self.project = PyPIProject(self.pypi_client, self.project_name, self.project_metadata_url) self._prepare_state() def _prepare_state(self): """Initialize internal state (snapshot, contents, directories, etc...) This is called from `prepare` method. """ last_snapshot = self._last_snapshot() self.known_artifacts = self._known_artifacts(last_snapshot) # and the artifacts # that will be the source of data to retrieve self.new_artifacts = self.project.download_new_releases( self.known_artifacts ) # temporary state self._contents = [] self._directories = [] self._revisions = [] self._load_status = 'uneventful' self._visit_status = 'full' def fetch_data(self): """Called once per release artifact version (can be many for one release). This will for each call: - retrieve a release artifact (associated to a release version) - Uncompress it and compute the necessary information - Computes the swh objects Returns: True as long as data to fetch exist """ data = None if self.done: return False try: data = next(self.new_artifacts) self._load_status = 'eventful' except StopIteration: self.done = True return False project_info, author, release, artifact, dir_path = data dir_path = dir_path.encode('utf-8') directory = Directory.from_disk(path=dir_path, data=True) _objects = directory.collect() self._contents = _objects['content'].values() self._directories = _objects['directory'].values() date = normalize_timestamp( int(arrow.get(artifact['date']).timestamp)) name = release['name'].encode('utf-8') message = release['message'].encode('utf-8') if message: message = b'%s: %s' % (name, message) else: message = name _revision = { 'synthetic': True, 'metadata': { 'original_artifact': artifact, 'project': project_info, }, 'author': author, 'date': date, 'committer': author, 'committer_date': date, 'message': message, 'directory': directory.hash, 'parents': [], 'type': 'tar', } _revision['id'] = identifier_to_bytes( revision_identifier(_revision)) self._revisions.append(_revision) artifact_key = artifact['filename'], artifact['sha256'] self.known_artifacts[artifact_key] = _revision['id'] return not self.done def target_from_artifact(self, filename, sha256): target = self.known_artifacts.get((filename, sha256)) if target: return { 'target': target, 'target_type': 'revision', } return None def generate_and_load_snapshot(self): branches = {} for release, artifacts in self.project.all_release_artifacts().items(): default_release = self.project.default_release() if len(artifacts) == 1: # Only one artifact for this release, generate a single branch branch_name = 'releases/%s' % release filename, sha256 = artifacts[0] target = self.target_from_artifact(filename, sha256) branches[branch_name.encode('utf-8')] = target if release == default_release: branches[b'HEAD'] = { 'target_type': 'alias', 'target': branch_name.encode('utf-8'), } if not target: self._visit_status = 'partial' else: # Several artifacts for this release, generate a separate # pointer for each of them for filename, sha256 in artifacts: branch_name = 'releases/%s/%s' % (release, filename) target = self.target_from_artifact(filename, sha256) branches[branch_name.encode('utf-8')] = target if not target: self._visit_status = 'partial' snapshot = { 'branches': branches, } snapshot['id'] = identifier_to_bytes( snapshot_identifier(snapshot)) self.maybe_load_snapshot(snapshot) def store_data(self): """(override) This sends collected objects to storage. """ self.maybe_load_contents(self._contents) self.maybe_load_directories(self._directories) self.maybe_load_revisions(self._revisions) if self.done: self.generate_and_load_snapshot() self.flush() def load_status(self): return { 'status': self._load_status, } def visit_status(self): return self._visit_status if __name__ == '__main__': import logging import sys logging.basicConfig(level=logging.DEBUG) if len(sys.argv) != 2: logging.error('Usage: %s ' % sys.argv[0]) sys.exit(1) module_name = sys.argv[1] loader = PyPILoader() loader.load( module_name, 'https://pypi.org/projects/%s/' % module_name, 'https://pypi.org/pypi/%s/json' % module_name, ) diff --git a/version.txt b/version.txt index 9d293ed..906bd94 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.3-0-gb237da9 \ No newline at end of file +v0.0.4-0-gc993a89 \ No newline at end of file