diff --git a/README.md b/README.md index b6ee69e..b54e486 100644 --- a/README.md +++ b/README.md @@ -1,237 +1,251 @@ swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.debian` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.pypi` - `swh.lister.npm` - `swh.lister.phabricator` - `swh.lister.cran` - `swh.lister.cgit` +- `swh.lister.packagist` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`github`, `gitlab`, `debian`, `pypi`, `npm`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/ ~/.cache/swh/lister//` 2. create configuration file `~/.config/swh/lister_.yml` 3. Bootstrap the db instance schema ```lang=bash $ createdb lister- $ python3 -m swh.lister.cli --db-url postgres:///lister- ``` Note: This bootstraps a minimum data set needed for the lister to run. ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/lister_.yml`: ```lang=yml storage: cls: 'remote' args: url: 'http://localhost:5002/' scheduler: cls: 'remote' args: url: 'http://localhost:5008/' lister: cls: 'local' args: # see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls db: 'postgresql:///lister-' credentials: [] cache_responses: True cache_dir: /home/user/.cache/swh/lister// ``` Note: This expects storage (5002) and scheduler (5008) services to run locally ## lister-github Once configured, you can execute a GitHub lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.github.tasks import range_github_lister logging.basicConfig(level=logging.DEBUG) range_github_lister(364, 365) ... ``` ## lister-gitlab Once configured, you can execute a GitLab lister using the instructions detailed in the `python3` scripts below: ```lang=python import logging from swh.lister.gitlab.tasks import range_gitlab_lister logging.basicConfig(level=logging.DEBUG) range_gitlab_lister(1, 2, { 'instance': 'debian', 'api_baseurl': 'https://salsa.debian.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ```lang=python import logging from swh.lister.gitlab.tasks import full_gitlab_relister logging.basicConfig(level=logging.DEBUG) full_gitlab_relister({ 'instance': '0xacab', 'api_baseurl': 'https://0xacab.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ```lang=python import logging from swh.lister.gitlab.tasks import incremental_gitlab_lister logging.basicConfig(level=logging.DEBUG) incremental_gitlab_lister({ 'instance': 'freedesktop.org', 'api_baseurl': 'https://gitlab.freedesktop.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ## lister-debian Once configured, you can execute a Debian lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.debian.tasks import debian_lister logging.basicConfig(level=logging.DEBUG) debian_lister('Debian') ``` ## lister-pypi Once configured, you can execute a PyPI lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.pypi.tasks import pypi_lister logging.basicConfig(level=logging.DEBUG) pypi_lister() ``` ## lister-npm Once configured, you can execute a npm lister using the following instructions in a `python3` REPL: ```lang=python import logging from swh.lister.npm.tasks import npm_lister logging.basicConfig(level=logging.DEBUG) npm_lister() ``` ## lister-phabricator Once configured, you can execute a Phabricator lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.phabricator.tasks import incremental_phabricator_lister logging.basicConfig(level=logging.DEBUG) incremental_phabricator_lister(forge_url='https://forge.softwareheritage.org', api_token='XXXX') ``` ## lister-gnu Once configured, you can execute a PyPI lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.gnu.tasks import gnu_lister logging.basicConfig(level=logging.DEBUG) gnu_lister() ``` ## lister-cran Once configured, you can execute a CRAN lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.cran.tasks import cran_lister logging.basicConfig(level=logging.DEBUG) cran_lister() ``` ## lister-cgit Once configured, you can execute a cgit lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.cgit.tasks import cgit_lister logging.basicConfig(level=logging.DEBUG) # simple cgit instance cgit_lister(url='https://git.kernel.org/') # cgit instance whose listed repositories differ from the base url cgit_lister(url='https://cgit.kde.org/', url_prefix='https://anongit.kde.org/') ``` +## lister-packagist + +Once configured, you can execute a Packagist lister using the following instructions +in a `python3` script: + +```lang=python +import logging +from swh.lister.packagist.tasks import packagist_lister + +logging.basicConfig(level=logging.DEBUG) +packagist_lister() +``` + Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh/lister/cli.py b/swh/lister/cli.py index 3a6f38f..b8c51b0 100644 --- a/swh/lister/cli.py +++ b/swh/lister/cli.py @@ -1,158 +1,163 @@ # Copyright (C) 2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging import click from swh.core.cli import CONTEXT_SETTINGS logger = logging.getLogger(__name__) SUPPORTED_LISTERS = ['github', 'gitlab', 'bitbucket', 'debian', 'pypi', - 'npm', 'phabricator', 'gnu', 'cran', 'cgit'] + 'npm', 'phabricator', 'gnu', 'cran', 'cgit', 'packagist'] @click.group(name='lister', context_settings=CONTEXT_SETTINGS) @click.pass_context def lister(ctx): '''Software Heritage Lister tools.''' pass @lister.command(name='db-init', context_settings=CONTEXT_SETTINGS) @click.option( '--db-url', '-d', default='postgres:///lister-gitlab.com', help='SQLAlchemy DB URL; see ' '') # noqa @click.argument('listers', required=1, nargs=-1, type=click.Choice(SUPPORTED_LISTERS + ['all'])) @click.option('--drop-tables', '-D', is_flag=True, default=False, help='Drop tables before creating the database schema') @click.pass_context def cli(ctx, db_url, listers, drop_tables): """Initialize the database model for given listers. """ override_conf = { 'lister': { 'cls': 'local', 'args': {'db': db_url} } } if 'all' in listers: listers = SUPPORTED_LISTERS for lister in listers: logger.info('Initializing lister %s', lister) insert_minimum_data = None if lister == 'github': from .github.models import IndexingModelBase as ModelBase from .github.lister import GitHubLister _lister = GitHubLister( api_baseurl='https://api.github.com', override_config=override_conf) elif lister == 'bitbucket': from .bitbucket.models import IndexingModelBase as ModelBase from .bitbucket.lister import BitBucketLister _lister = BitBucketLister( api_baseurl='https://api.bitbucket.org/2.0', override_config=override_conf) elif lister == 'gitlab': from .gitlab.models import ModelBase from .gitlab.lister import GitLabLister _lister = GitLabLister( api_baseurl='https://gitlab.com/api/v4/', override_config=override_conf) elif lister == 'debian': from .debian.lister import DebianLister ModelBase = DebianLister.MODEL # noqa _lister = DebianLister(override_config=override_conf) def insert_minimum_data(lister): from swh.storage.schemata.distribution import ( Distribution, Area) d = Distribution( name='Debian', type='deb', mirror_uri='http://deb.debian.org/debian/') lister.db_session.add(d) areas = [] for distribution_name in ['stretch']: for area_name in ['main', 'contrib', 'non-free']: areas.append(Area( name='%s/%s' % (distribution_name, area_name), distribution=d, )) lister.db_session.add_all(areas) lister.db_session.commit() elif lister == 'pypi': from .pypi.models import ModelBase from .pypi.lister import PyPILister _lister = PyPILister(override_config=override_conf) elif lister == 'npm': from .npm.models import IndexingModelBase as ModelBase from .npm.models import NpmVisitModel from .npm.lister import NpmLister _lister = NpmLister(override_config=override_conf) if drop_tables: NpmVisitModel.metadata.drop_all(_lister.db_engine) NpmVisitModel.metadata.create_all(_lister.db_engine) elif lister == 'phabricator': from .phabricator.models import IndexingModelBase as ModelBase from .phabricator.lister import PhabricatorLister _lister = PhabricatorLister( forge_url='https://forge.softwareheritage.org', api_token='', override_config=override_conf) elif lister == 'gnu': from .gnu.models import ModelBase from .gnu.lister import GNULister _lister = GNULister(override_config=override_conf) elif lister == 'cran': from .cran.models import ModelBase from .cran.lister import CRANLister _lister = CRANLister(override_config=override_conf) elif lister == 'cgit': from .cgit.models import ModelBase from .cgit.lister import CGitLister _lister = CGitLister( url='http://git.savannah.gnu.org/cgit/', url_prefix='http://git.savannah.gnu.org/git/', override_config=override_conf) + elif lister == 'packagist': + from .packagist.models import ModelBase + from .packagist.lister import PackagistLister + _lister = PackagistLister(override_config=override_conf) + else: raise ValueError( 'Invalid lister %s: only supported listers are %s' % (lister, SUPPORTED_LISTERS)) if drop_tables: logger.info('Dropping tables for %s', lister) ModelBase.metadata.drop_all(_lister.db_engine) logger.info('Creating tables for %s', lister) ModelBase.metadata.create_all(_lister.db_engine) if insert_minimum_data: logger.info('Inserting minimal data for %s', lister) try: insert_minimum_data(_lister) except Exception: logger.warning( 'Failed to insert minimum data in %s', lister) if __name__ == '__main__': cli() diff --git a/swh/lister/core/tests/conftest.py b/swh/lister/core/tests/conftest.py index b8dd868..a1f9346 100644 --- a/swh/lister/core/tests/conftest.py +++ b/swh/lister/core/tests/conftest.py @@ -1,18 +1,19 @@ import pytest from swh.scheduler.tests.conftest import * # noqa @pytest.fixture(scope='session') def celery_includes(): return [ 'swh.lister.bitbucket.tasks', 'swh.lister.cgit.tasks', 'swh.lister.cran.tasks', 'swh.lister.debian.tasks', 'swh.lister.github.tasks', 'swh.lister.gitlab.tasks', 'swh.lister.gnu.tasks', 'swh.lister.npm.tasks', - 'swh.lister.pypi.tasks', + 'swh.lister.packagist.tasks', 'swh.lister.phabricator.tasks', + 'swh.lister.pypi.tasks', ] diff --git a/swh/lister/packagist/__init__.py b/swh/lister/packagist/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/swh/lister/packagist/lister.py b/swh/lister/packagist/lister.py new file mode 100644 index 0000000..29ddbc6 --- /dev/null +++ b/swh/lister/packagist/lister.py @@ -0,0 +1,84 @@ +# Copyright (C) 2019 the Software Heritage developers +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import random +import json +from .models import PackagistModel + +from swh.scheduler import utils +from swh.lister.core.simple_lister import SimpleLister +from swh.lister.core.lister_transports import ListerOnePageApiTransport + + +class PackagistLister(ListerOnePageApiTransport, SimpleLister): + """List packages available in the Packagist package manger. + + The lister sends the request to the url present in the class + variable `PAGE`, to receive a list of all the package names + present in the Packagist package manger. Iterates over all the + packages and constructs the metadata url of the package from + the name of the package and creates a loading task. + + Task: + Type: load-packagist + Policy: recurring + Args: + + + + Example: + Type: load-packagist + Policy: recurring + Args: + 'hypejunction/hypegamemechanics' + 'https://repo.packagist.org/p/hypejunction/hypegamemechanics.json' + + """ + MODEL = PackagistModel + LISTER_NAME = 'packagist' + PAGE = 'https://packagist.org/packages/list.json' + instance = 'packagist' + + def __init__(self, override_config=None): + ListerOnePageApiTransport .__init__(self) + SimpleLister.__init__(self, override_config=override_config) + + def task_dict(self, origin_type, origin_url, **kwargs): + """Return task format dict + + This is overridden from the lister_base as more information is + needed for the ingestion task creation. + + """ + return utils.create_task_dict('load-%s' % origin_type, 'recurring', + kwargs.get('name'), origin_url) + + def list_packages(self, response): + """List the actual packagist origins from the response. + + """ + response = json.loads(response.text) + packages = [name for name in response['packageNames']] + random.shuffle(packages) + return packages + + def get_model_from_repo(self, repo_name): + """Transform from repository representation to model + + """ + url = 'https://repo.packagist.org/p/%s.json' % repo_name + return { + 'uid': repo_name, + 'name': repo_name, + 'full_name': repo_name, + 'html_url': url, + 'origin_url': url, + 'origin_type': 'packagist', + } + + def transport_response_simplified(self, response): + """Transform response to list for model manipulation + + """ + return [self.get_model_from_repo(repo_name) for repo_name in response] diff --git a/swh/lister/packagist/models.py b/swh/lister/packagist/models.py new file mode 100644 index 0000000..36a6333 --- /dev/null +++ b/swh/lister/packagist/models.py @@ -0,0 +1,16 @@ +# Copyright (C) 2019 the Software Heritage developers +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from sqlalchemy import Column, String + +from ..core.models import ModelBase + + +class PackagistModel(ModelBase): + """a Packagist repository representation + + """ + __tablename__ = 'packagist_repo' + + uid = Column(String, primary_key=True) diff --git a/swh/lister/packagist/tasks.py b/swh/lister/packagist/tasks.py new file mode 100644 index 0000000..e17e892 --- /dev/null +++ b/swh/lister/packagist/tasks.py @@ -0,0 +1,17 @@ +# Copyright (C) 2019 the Software Heritage developers +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from swh.scheduler.celery_backend.config import app + +from .lister import PackagistLister + + +@app.task(name=__name__ + '.PackagistListerTask') +def packagist_lister(**lister_args): + PackagistLister(**lister_args).run() + + +@app.task(name=__name__ + '.ping') +def ping(): + return 'OK' diff --git a/swh/lister/packagist/tests/__init__.py b/swh/lister/packagist/tests/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/swh/lister/packagist/tests/api_response.json b/swh/lister/packagist/tests/api_response.json new file mode 100644 index 0000000..2e4843c --- /dev/null +++ b/swh/lister/packagist/tests/api_response.json @@ -0,0 +1,9 @@ +{ + "packageNames": [ + "0.0.0/composer-include-files", + "0.0.0/laravel-env-shim", + "0.0.1/try-make-package", + "0099ff/dialogflowphp", + "00f100/array_dot" + ] +} \ No newline at end of file diff --git a/swh/lister/packagist/tests/conftest.py b/swh/lister/packagist/tests/conftest.py new file mode 100644 index 0000000..507fef9 --- /dev/null +++ b/swh/lister/packagist/tests/conftest.py @@ -0,0 +1 @@ +from swh.lister.core.tests.conftest import * # noqa diff --git a/swh/lister/packagist/tests/test_lister.py b/swh/lister/packagist/tests/test_lister.py new file mode 100644 index 0000000..fb58424 --- /dev/null +++ b/swh/lister/packagist/tests/test_lister.py @@ -0,0 +1,66 @@ +# Copyright (C) 2019 the Software Heritage developers +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import unittest +import requests_mock +from unittest.mock import patch +from swh.lister.packagist.lister import PackagistLister +from swh.lister.core.tests.test_lister import HttpSimpleListerTester + + +expected_packages = ['0.0.0/composer-include-files', '0.0.0/laravel-env-shim', + '0.0.1/try-make-package', '0099ff/dialogflowphp', + '00f100/array_dot'] + +expected_model = { + 'uid': '0099ff/dialogflowphp', + 'name': '0099ff/dialogflowphp', + 'full_name': '0099ff/dialogflowphp', + 'html_url': + 'https://repo.packagist.org/p/0099ff/dialogflowphp.json', + 'origin_url': + 'https://repo.packagist.org/p/0099ff/dialogflowphp.json', + 'origin_type': 'packagist', + } + + +class PackagistListerTester(HttpSimpleListerTester, unittest.TestCase): + Lister = PackagistLister + PAGE = 'https://packagist.org/packages/list.json' + lister_subdir = 'packagist' + good_api_response_file = 'api_response.json' + entries = 5 + + @requests_mock.Mocker() + def test_list_packages(self, http_mocker): + """List packages from simple api page should retrieve all packages within + + """ + http_mocker.get(self.PAGE, text=self.mock_response) + fl = self.get_fl() + packages = fl.list_packages(self.get_api_response(0)) + + for package in expected_packages: + assert package in packages + + def test_transport_response_simplified(self): + """Test model created by the lister + + """ + fl = self.get_fl() + model = fl.transport_response_simplified(['0099ff/dialogflowphp']) + assert len(model) == 1 + for key, values in model[0].items(): + assert values == expected_model[key] + + def test_task_dict(self): + """Test the task creation of lister + + """ + fl = self.get_fl() + with patch('swh.lister.packagist.lister.utils.create_task_dict') as mock_create_tasks: # noqa + fl.task_dict(origin_type='packagist', origin_url='https://abc', + name='test_pack') + mock_create_tasks.assert_called_once_with( + 'load-packagist', 'recurring', 'test_pack', 'https://abc') diff --git a/swh/lister/packagist/tests/test_tasks.py b/swh/lister/packagist/tests/test_tasks.py new file mode 100644 index 0000000..cbe807d --- /dev/null +++ b/swh/lister/packagist/tests/test_tasks.py @@ -0,0 +1,31 @@ +# Copyright (C) 2019 the Software Heritage developers +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from unittest.mock import patch + + +def test_ping(swh_app, celery_session_worker): + res = swh_app.send_task( + 'swh.lister.packagist.tasks.ping') + assert res + res.wait() + assert res.successful() + assert res.result == 'OK' + + +@patch('swh.lister.packagist.tasks.PackagistLister') +def test_lister(lister, swh_app, celery_session_worker): + # setup the mocked PackagistLister + lister.return_value = lister + lister.run.return_value = None + + res = swh_app.send_task( + 'swh.lister.packagist.tasks.PackagistListerTask') + assert res + res.wait() + assert res.successful() + + lister.assert_called_once_with() + lister.db_last_index.assert_not_called() + lister.run.assert_called_once_with()