Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9344548
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
21 KB
Subscribers
None
View Options
diff --git a/README.md b/README.md
index 887b599..4d56957 100644
--- a/README.md
+++ b/README.md
@@ -1,193 +1,205 @@
swh-lister
==========
This component from the Software Heritage stack aims to produce listings
of software origins and their urls hosted on various public developer platforms
or package managers. As these operations are quite similar, it provides a set of
Python modules abstracting common software origins listing behaviors.
It also provides several lister implementations, contained in the
following Python modules:
- `swh.lister.bitbucket`
- `swh.lister.debian`
- `swh.lister.github`
- `swh.lister.gitlab`
- `swh.lister.pypi`
- `swh.lister.npm`
- `swh.lister.phabricator`
Dependencies
------------
All required dependencies can be found in the `requirements*.txt` files located
at the root of the repository.
Local deployment
----------------
## lister configuration
Each lister implemented so far by Software Heritage (`github`, `gitlab`, `debian`, `pypi`, `npm`)
must be configured by following the instructions below (please note that you have to replace
`<lister_name>` by one of the lister name introduced above).
### Preparation steps
1. `mkdir ~/.config/swh/ ~/.cache/swh/lister/<lister_name>/`
2. create configuration file `~/.config/swh/lister_<lister_name>.yml`
3. Bootstrap the db instance schema
```lang=bash
$ createdb lister-<lister_name>
$ python3 -m swh.lister.cli --db-url postgres:///lister-<lister_name> <lister_name>
```
Note: This bootstraps a minimum data set needed for the lister to run.
### Configuration file sample
Minimalistic configuration shared by all listers to add in file `~/.config/swh/lister_<lister_name>.yml`:
```lang=yml
storage:
cls: 'remote'
args:
url: 'http://localhost:5002/'
scheduler:
cls: 'remote'
args:
url: 'http://localhost:5008/'
lister:
cls: 'local'
args:
# see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls
db: 'postgresql:///lister-<lister_name>'
credentials: []
cache_responses: True
cache_dir: /home/user/.cache/swh/lister/<lister_name>/
```
Note: This expects storage (5002) and scheduler (5008) services to run locally
## lister-github
Once configured, you can execute a GitHub lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.github.tasks import range_github_lister
logging.basicConfig(level=logging.DEBUG)
range_github_lister(364, 365)
...
```
## lister-gitlab
Once configured, you can execute a GitLab lister using the instructions detailed in the `python3` scripts below:
```lang=python
import logging
from swh.lister.gitlab.tasks import range_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
range_gitlab_lister(1, 2, {
'instance': 'debian',
'api_baseurl': 'https://salsa.debian.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import full_gitlab_relister
logging.basicConfig(level=logging.DEBUG)
full_gitlab_relister({
'instance': '0xacab',
'api_baseurl': 'https://0xacab.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import incremental_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
incremental_gitlab_lister({
'instance': 'freedesktop.org',
'api_baseurl': 'https://gitlab.freedesktop.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
## lister-debian
Once configured, you can execute a Debian lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.debian.tasks import debian_lister
logging.basicConfig(level=logging.DEBUG)
debian_lister('Debian')
```
## lister-pypi
Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.pypi.tasks import pypi_lister
logging.basicConfig(level=logging.DEBUG)
pypi_lister()
```
## lister-npm
Once configured, you can execute a npm lister using the following instructions in a `python3` REPL:
```lang=python
import logging
from swh.lister.npm.tasks import npm_lister
logging.basicConfig(level=logging.DEBUG)
npm_lister()
```
## lister-phabricator
Once configured, you can execute a Phabricator lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.phabricator.tasks import incremental_phabricator_lister
logging.basicConfig(level=logging.DEBUG)
incremental_phabricator_lister(forge_url='https://forge.softwareheritage.org', api_token='XXXX')
```
+## lister-gnu
+
+Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
+
+```lang=python
+import logging
+from swh.lister.gnu.tasks import gnu_lister
+
+logging.basicConfig(level=logging.DEBUG)
+gnu_lister()
+```
+
Licensing
---------
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public License
along with this program.
\ No newline at end of file
diff --git a/swh/lister/cli.py b/swh/lister/cli.py
index e6563c9..22b520c 100644
--- a/swh/lister/cli.py
+++ b/swh/lister/cli.py
@@ -1,140 +1,145 @@
# Copyright (C) 2018 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import logging
import click
from swh.core.cli import CONTEXT_SETTINGS
logger = logging.getLogger(__name__)
SUPPORTED_LISTERS = ['github', 'gitlab', 'bitbucket', 'debian', 'pypi',
- 'npm', 'phabricator']
+ 'npm', 'phabricator', 'gnu']
@click.group(name='lister', context_settings=CONTEXT_SETTINGS)
@click.pass_context
def lister(ctx):
'''Software Heritage Lister tools.'''
pass
@lister.command(name='db-init', context_settings=CONTEXT_SETTINGS)
@click.option(
'--db-url', '-d', default='postgres:///lister-gitlab.com',
help='SQLAlchemy DB URL; see '
'<http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls>') # noqa
@click.argument('listers', required=1, nargs=-1,
type=click.Choice(SUPPORTED_LISTERS + ['all']))
@click.option('--drop-tables', '-D', is_flag=True, default=False,
help='Drop tables before creating the database schema')
@click.pass_context
def cli(ctx, db_url, listers, drop_tables):
"""Initialize the database model for given listers.
"""
override_conf = {
'lister': {
'cls': 'local',
'args': {'db': db_url}
}
}
if 'all' in listers:
listers = SUPPORTED_LISTERS
for lister in listers:
logger.info('Initializing lister %s', lister)
insert_minimum_data = None
if lister == 'github':
from .github.models import IndexingModelBase as ModelBase
from .github.lister import GitHubLister
_lister = GitHubLister(
api_baseurl='https://api.github.com',
override_config=override_conf)
elif lister == 'bitbucket':
from .bitbucket.models import IndexingModelBase as ModelBase
from .bitbucket.lister import BitBucketLister
_lister = BitBucketLister(
api_baseurl='https://api.bitbucket.org/2.0',
override_config=override_conf)
elif lister == 'gitlab':
from .gitlab.models import ModelBase
from .gitlab.lister import GitLabLister
_lister = GitLabLister(
api_baseurl='https://gitlab.com/api/v4/',
override_config=override_conf)
elif lister == 'debian':
from .debian.lister import DebianLister
ModelBase = DebianLister.MODEL # noqa
_lister = DebianLister(override_config=override_conf)
def insert_minimum_data(lister):
from swh.storage.schemata.distribution import (
Distribution, Area)
d = Distribution(
name='Debian',
type='deb',
mirror_uri='http://deb.debian.org/debian/')
lister.db_session.add(d)
areas = []
for distribution_name in ['stretch']:
for area_name in ['main', 'contrib', 'non-free']:
areas.append(Area(
name='%s/%s' % (distribution_name, area_name),
distribution=d,
))
lister.db_session.add_all(areas)
lister.db_session.commit()
elif lister == 'pypi':
from .pypi.models import ModelBase
from .pypi.lister import PyPILister
_lister = PyPILister(override_config=override_conf)
elif lister == 'npm':
from .npm.models import IndexingModelBase as ModelBase
from .npm.models import NpmVisitModel
from .npm.lister import NpmLister
_lister = NpmLister(override_config=override_conf)
if drop_tables:
NpmVisitModel.metadata.drop_all(_lister.db_engine)
NpmVisitModel.metadata.create_all(_lister.db_engine)
elif lister == 'phabricator':
from .phabricator.models import IndexingModelBase as ModelBase
from .phabricator.lister import PhabricatorLister
_lister = PhabricatorLister(
forge_url='https://forge.softwareheritage.org',
api_token='',
override_config=override_conf)
+ elif lister == 'gnu':
+ from .gnu.models import ModelBase
+ from .gnu.lister import GNULister
+ _lister = GNULister(override_config=override_conf)
+
else:
raise ValueError(
'Invalid lister %s: only supported listers are %s' %
(lister, SUPPORTED_LISTERS))
if drop_tables:
logger.info('Dropping tables for %s', lister)
ModelBase.metadata.drop_all(_lister.db_engine)
logger.info('Creating tables for %s', lister)
ModelBase.metadata.create_all(_lister.db_engine)
if insert_minimum_data:
logger.info('Inserting minimal data for %s', lister)
try:
insert_minimum_data(_lister)
except Exception:
logger.warning(
'Failed to insert minimum data in %s', lister)
if __name__ == '__main__':
cli()
diff --git a/swh/lister/core/tests/conftest.py b/swh/lister/core/tests/conftest.py
index 17ce8f2..16a9a07 100644
--- a/swh/lister/core/tests/conftest.py
+++ b/swh/lister/core/tests/conftest.py
@@ -1,15 +1,16 @@
import pytest
from swh.scheduler.tests.conftest import * # noqa
@pytest.fixture(scope='session')
def celery_includes():
return [
'swh.lister.bitbucket.tasks',
'swh.lister.debian.tasks',
'swh.lister.github.tasks',
'swh.lister.gitlab.tasks',
'swh.lister.npm.tasks',
'swh.lister.pypi.tasks',
'swh.lister.phabricator.tasks',
+ 'swh.lister.gnu.tasks'
]
diff --git a/swh/lister/gnu/__init__.py b/swh/lister/gnu/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/swh/lister/gnu/lister.py b/swh/lister/gnu/lister.py
new file mode 100644
index 0000000..bd821d4
--- /dev/null
+++ b/swh/lister/gnu/lister.py
@@ -0,0 +1,217 @@
+# Copyright (C) 2019 the Software Heritage developers
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+import random
+import gzip
+import json
+import os
+import requests
+from urllib.parse import urlparse
+
+from .models import GNUModel
+
+from swh.scheduler import utils
+from swh.lister.core.simple_lister import SimpleLister
+from swh.model.hashutil import MultiHash, HASH_BLOCK_SIZE
+
+
+class LocalResponse:
+ """Local Response class with iter_content api
+
+ """
+ def __init__(self, path):
+ self.path = path
+
+ def iter_content(self, chunk_size=None):
+ with open(self.path, 'rb') as f:
+ while True:
+ chunk = f.read(chunk_size)
+ if not chunk:
+ break
+ yield chunk
+
+
+class ArchiveFetcher:
+ """Http/Local client in charge of downloading archives from a
+ remote/local server.
+
+ Args:
+ temp_directory (str): Path to the temporary disk location used
+ for downloading the release artifacts
+
+ """
+ def __init__(self, temp_directory=None):
+ self.temp_directory = os.getcwd()
+ self.session = requests.session()
+ self.params = {
+ 'headers': {
+ 'User-Agent': 'Software Heritage Lister ( __devl__)'
+ }
+ }
+
+ def download(self, url):
+ """Download the remote tarball url locally.
+
+ Args:
+ url (str): Url (file or http*)
+
+ Raises:
+ ValueError in case of failing to query
+
+ Returns:
+ Tuple of local (filepath, hashes of filepath)
+
+ """
+ url_parsed = urlparse(url)
+ if url_parsed.scheme == 'file':
+ path = url_parsed.path
+ response = LocalResponse(path)
+ length = os.path.getsize(path)
+ else:
+ response = self.session.get(url, **self.params, stream=True)
+ if response.status_code != 200:
+ raise ValueError("Fail to query '%s'. Reason: %s" % (
+ url, response.status_code))
+ length = int(response.headers['content-length'])
+
+ filepath = os.path.join(self.temp_directory, os.path.basename(url))
+
+ h = MultiHash(length=length)
+ with open(filepath, 'wb') as f:
+ for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
+ h.update(chunk)
+ f.write(chunk)
+
+ actual_length = os.path.getsize(filepath)
+ if length != actual_length:
+ raise ValueError('Error when checking size: %s != %s' % (
+ length, actual_length))
+
+ return filepath
+
+
+class GNULister(SimpleLister, ArchiveFetcher):
+ MODEL = GNUModel
+ LISTER_NAME = 'gnu'
+ TREE_URL = 'https://ftp.gnu.org/tree.json.gz'
+
+ def __init__(self, override_config=None):
+ SimpleLister.__init__(self, override_config=override_config)
+ ArchiveFetcher.__init__(self, override_config=override_config)
+
+ def task_dict(self, origin_type, origin_url, **kwargs):
+ """(Override)
+ Return task format dict
+
+ This is overridden from the lister_base as more information is
+ needed for the ingestion task creation.
+
+ """
+ _type = 'load-%s' % origin_type
+ _policy = 'recurring'
+ project_name = kwargs.get('name')
+ project_metadata_url = kwargs.get('html_url')
+ return utils.create_task_dict(
+ _type, _policy, project_name, origin_url,
+ project_metadata_url=project_metadata_url)
+
+ def download_file(self):
+ '''
+ Downloads tree.json file and returns its location
+
+ Returns
+ File path of the downloaded file
+ '''
+ file_path, hash_dict = self.download(self.TREE_URL)
+ return file_path
+
+ def read_downloaded_file(self, file_path):
+ '''
+ Reads the downloaded file content and convert it into json format
+
+ Returns
+ File content in json format
+ '''
+ with gzip.GzipFile(file_path, 'r') as fin:
+ response = json.loads(fin.read().decode('utf-8'))
+ return response
+
+ def safely_issue_request(self, identifier):
+ '''(Override)Make network request with to download the file which
+ has file structure of the GNU website.
+
+ Args:
+ identifier: resource identifier
+ Returns:
+ server response
+ '''
+ file_path = self.download_file()
+ response = self.read_downloaded_file(file_path)
+ return response
+
+ def list_packages(self, response):
+ """(Override) List the actual gnu origins with their names and
+ time last updated from the response.
+
+ """
+ response = clean_up_response(response)
+ _packages = []
+ for directory in response:
+ content = directory['contents']
+ for repo in content:
+ if repo['type'] == 'directory':
+ repo_details = {
+ 'name': repo['name'],
+ 'url': self._get_project_url(directory['name'],
+ repo['name']),
+ 'time_modified': repo['time']
+ }
+ _packages.append(repo_details)
+ random.shuffle(_packages)
+ return _packages
+
+ def _get_project_url(self, dir_name, package_name):
+ """Returns project_url
+
+ """
+ return 'https://ftp.gnu.org/%s/%s/' % (dir_name, package_name)
+
+ def get_model_from_repo(self, repo):
+ """(Override) Transform from repository representation to model
+
+ """
+ return {
+ 'uid': repo['name'],
+ 'name': repo['name'],
+ 'full_name': repo['name'],
+ 'html_url': repo['url'],
+ 'origin_url': repo['url'],
+ 'time_last_upated': repo['time_modified'],
+ 'origin_type': 'gnu',
+ 'description': None,
+ }
+
+ def transport_response_simplified(self, response):
+ """(Override) Transform response to list for model manipulation
+
+ """
+ return [self.get_model_from_repo(repo) for repo in response]
+
+ def transport_request(self):
+ pass
+
+ def transport_response_to_string(self):
+ pass
+
+ def transport_quota_check(self):
+ pass
+
+
+def clean_up_response(response):
+ final_response = []
+ file_system = response[0]['content']
+ for directory in file_system:
+ if directory['name'] in ('gnu', 'mirrors', 'old-gnu'):
+ final_response.append(directory)
+ return final_response
diff --git a/swh/lister/gnu/models.py b/swh/lister/gnu/models.py
new file mode 100644
index 0000000..ebad039
--- /dev/null
+++ b/swh/lister/gnu/models.py
@@ -0,0 +1,17 @@
+# Copyright (C) 2019 the Software Heritage developers
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+from sqlalchemy import Column, String, Integer
+
+from ..core.models import ModelBase
+
+
+class GNUModel(ModelBase):
+ """a GNU repository representation
+
+ """
+ __tablename__ = 'gnu_repo'
+
+ uid = Column(String, primary_key=True)
+ time_last_upated = Column(Integer)
diff --git a/swh/lister/gnu/tasks.py b/swh/lister/gnu/tasks.py
new file mode 100644
index 0000000..251eccf
--- /dev/null
+++ b/swh/lister/gnu/tasks.py
@@ -0,0 +1,17 @@
+# Copyright (C) 2019 the Software Heritage developers
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+from swh.scheduler.celery_backend.config import app
+
+from .lister import GNULister
+
+
+@app.task(name=__name__ + '.GNUListerTask')
+def gnu_lister(**lister_args):
+ GNULister(**lister_args).run()
+
+
+@app.task(name=__name__ + '.ping')
+def ping():
+ return 'OK'
diff --git a/swh/lister/gnu/tests/__init__.py b/swh/lister/gnu/tests/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/swh/lister/gnu/tests/conftest.py b/swh/lister/gnu/tests/conftest.py
new file mode 100644
index 0000000..507fef9
--- /dev/null
+++ b/swh/lister/gnu/tests/conftest.py
@@ -0,0 +1 @@
+from swh.lister.core.tests.conftest import * # noqa
diff --git a/swh/lister/gnu/tests/test_tasks.py b/swh/lister/gnu/tests/test_tasks.py
new file mode 100644
index 0000000..4c82f77
--- /dev/null
+++ b/swh/lister/gnu/tests/test_tasks.py
@@ -0,0 +1,27 @@
+from unittest.mock import patch
+
+
+def test_ping(swh_app, celery_session_worker):
+ res = swh_app.send_task(
+ 'swh.lister.gnu.tasks.ping')
+ assert res
+ res.wait()
+ assert res.successful()
+ assert res.result == 'OK'
+
+
+@patch('swh.lister.gnu.tasks.GNULister')
+def test_lister(lister, swh_app, celery_session_worker):
+ # setup the mocked GNULister
+ lister.return_value = lister
+ lister.run.return_value = None
+
+ res = swh_app.send_task(
+ 'swh.lister.gnu.tasks.GNUListerTask')
+ assert res
+ res.wait()
+ assert res.successful()
+
+ lister.assert_called_once_with()
+ lister.db_last_index.assert_not_called()
+ lister.run.assert_called_once_with()
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Fri, Jul 4, 2:33 PM (2 d, 12 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3274573
Attached To
rDLS Listers
Event Timeline
Log In to Comment