Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9337615
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
25 KB
Subscribers
None
View Options
diff --git a/PKG-INFO b/PKG-INFO
index dadd9b2..15cfd4a 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,239 +1,239 @@
Metadata-Version: 2.1
Name: swh.lister
-Version: 0.0.29
+Version: 0.0.30
Summary: Software Heritage lister
Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Description: swh-lister
==========
This component from the Software Heritage stack aims to produce listings
of software origins and their urls hosted on various public developer platforms
or package managers. As these operations are quite similar, it provides a set of
Python modules abstracting common software origins listing behaviors.
It also provides several lister implementations, contained in the
following Python modules:
- `swh.lister.bitbucket`
- `swh.lister.debian`
- `swh.lister.github`
- `swh.lister.gitlab`
- `swh.lister.gnu`
- `swh.lister.pypi`
- `swh.lister.npm`
- `swh.lister.phabricator`
- `swh.lister.cran`
Dependencies
------------
All required dependencies can be found in the `requirements*.txt` files located
at the root of the repository.
Local deployment
----------------
## lister configuration
Each lister implemented so far by Software Heritage (`github`, `gitlab`, `debian`, `pypi`, `npm`)
must be configured by following the instructions below (please note that you have to replace
`<lister_name>` by one of the lister name introduced above).
### Preparation steps
1. `mkdir ~/.config/swh/ ~/.cache/swh/lister/<lister_name>/`
2. create configuration file `~/.config/swh/lister_<lister_name>.yml`
3. Bootstrap the db instance schema
```lang=bash
$ createdb lister-<lister_name>
$ python3 -m swh.lister.cli --db-url postgres:///lister-<lister_name> <lister_name>
```
Note: This bootstraps a minimum data set needed for the lister to run.
### Configuration file sample
Minimalistic configuration shared by all listers to add in file `~/.config/swh/lister_<lister_name>.yml`:
```lang=yml
storage:
cls: 'remote'
args:
url: 'http://localhost:5002/'
scheduler:
cls: 'remote'
args:
url: 'http://localhost:5008/'
lister:
cls: 'local'
args:
# see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls
db: 'postgresql:///lister-<lister_name>'
credentials: []
cache_responses: True
cache_dir: /home/user/.cache/swh/lister/<lister_name>/
```
Note: This expects storage (5002) and scheduler (5008) services to run locally
## lister-github
Once configured, you can execute a GitHub lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.github.tasks import range_github_lister
logging.basicConfig(level=logging.DEBUG)
range_github_lister(364, 365)
...
```
## lister-gitlab
Once configured, you can execute a GitLab lister using the instructions detailed in the `python3` scripts below:
```lang=python
import logging
from swh.lister.gitlab.tasks import range_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
range_gitlab_lister(1, 2, {
'instance': 'debian',
'api_baseurl': 'https://salsa.debian.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import full_gitlab_relister
logging.basicConfig(level=logging.DEBUG)
full_gitlab_relister({
'instance': '0xacab',
'api_baseurl': 'https://0xacab.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import incremental_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
incremental_gitlab_lister({
'instance': 'freedesktop.org',
'api_baseurl': 'https://gitlab.freedesktop.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
## lister-debian
Once configured, you can execute a Debian lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.debian.tasks import debian_lister
logging.basicConfig(level=logging.DEBUG)
debian_lister('Debian')
```
## lister-pypi
Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.pypi.tasks import pypi_lister
logging.basicConfig(level=logging.DEBUG)
pypi_lister()
```
## lister-npm
Once configured, you can execute a npm lister using the following instructions in a `python3` REPL:
```lang=python
import logging
from swh.lister.npm.tasks import npm_lister
logging.basicConfig(level=logging.DEBUG)
npm_lister()
```
## lister-phabricator
Once configured, you can execute a Phabricator lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.phabricator.tasks import incremental_phabricator_lister
logging.basicConfig(level=logging.DEBUG)
incremental_phabricator_lister(forge_url='https://forge.softwareheritage.org', api_token='XXXX')
```
## lister-gnu
Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.gnu.tasks import gnu_lister
logging.basicConfig(level=logging.DEBUG)
gnu_lister()
```
## lister-cran
Once configured, you can execute a CRAN lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.cran.tasks import cran_lister
logging.basicConfig(level=logging.DEBUG)
cran_lister()
```
Licensing
---------
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public License
along with this program.
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO
index dadd9b2..15cfd4a 100644
--- a/swh.lister.egg-info/PKG-INFO
+++ b/swh.lister.egg-info/PKG-INFO
@@ -1,239 +1,239 @@
Metadata-Version: 2.1
Name: swh.lister
-Version: 0.0.29
+Version: 0.0.30
Summary: Software Heritage lister
Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Description: swh-lister
==========
This component from the Software Heritage stack aims to produce listings
of software origins and their urls hosted on various public developer platforms
or package managers. As these operations are quite similar, it provides a set of
Python modules abstracting common software origins listing behaviors.
It also provides several lister implementations, contained in the
following Python modules:
- `swh.lister.bitbucket`
- `swh.lister.debian`
- `swh.lister.github`
- `swh.lister.gitlab`
- `swh.lister.gnu`
- `swh.lister.pypi`
- `swh.lister.npm`
- `swh.lister.phabricator`
- `swh.lister.cran`
Dependencies
------------
All required dependencies can be found in the `requirements*.txt` files located
at the root of the repository.
Local deployment
----------------
## lister configuration
Each lister implemented so far by Software Heritage (`github`, `gitlab`, `debian`, `pypi`, `npm`)
must be configured by following the instructions below (please note that you have to replace
`<lister_name>` by one of the lister name introduced above).
### Preparation steps
1. `mkdir ~/.config/swh/ ~/.cache/swh/lister/<lister_name>/`
2. create configuration file `~/.config/swh/lister_<lister_name>.yml`
3. Bootstrap the db instance schema
```lang=bash
$ createdb lister-<lister_name>
$ python3 -m swh.lister.cli --db-url postgres:///lister-<lister_name> <lister_name>
```
Note: This bootstraps a minimum data set needed for the lister to run.
### Configuration file sample
Minimalistic configuration shared by all listers to add in file `~/.config/swh/lister_<lister_name>.yml`:
```lang=yml
storage:
cls: 'remote'
args:
url: 'http://localhost:5002/'
scheduler:
cls: 'remote'
args:
url: 'http://localhost:5008/'
lister:
cls: 'local'
args:
# see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls
db: 'postgresql:///lister-<lister_name>'
credentials: []
cache_responses: True
cache_dir: /home/user/.cache/swh/lister/<lister_name>/
```
Note: This expects storage (5002) and scheduler (5008) services to run locally
## lister-github
Once configured, you can execute a GitHub lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.github.tasks import range_github_lister
logging.basicConfig(level=logging.DEBUG)
range_github_lister(364, 365)
...
```
## lister-gitlab
Once configured, you can execute a GitLab lister using the instructions detailed in the `python3` scripts below:
```lang=python
import logging
from swh.lister.gitlab.tasks import range_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
range_gitlab_lister(1, 2, {
'instance': 'debian',
'api_baseurl': 'https://salsa.debian.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import full_gitlab_relister
logging.basicConfig(level=logging.DEBUG)
full_gitlab_relister({
'instance': '0xacab',
'api_baseurl': 'https://0xacab.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
```lang=python
import logging
from swh.lister.gitlab.tasks import incremental_gitlab_lister
logging.basicConfig(level=logging.DEBUG)
incremental_gitlab_lister({
'instance': 'freedesktop.org',
'api_baseurl': 'https://gitlab.freedesktop.org/api/v4',
'sort': 'asc',
'per_page': 20
})
```
## lister-debian
Once configured, you can execute a Debian lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.debian.tasks import debian_lister
logging.basicConfig(level=logging.DEBUG)
debian_lister('Debian')
```
## lister-pypi
Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.pypi.tasks import pypi_lister
logging.basicConfig(level=logging.DEBUG)
pypi_lister()
```
## lister-npm
Once configured, you can execute a npm lister using the following instructions in a `python3` REPL:
```lang=python
import logging
from swh.lister.npm.tasks import npm_lister
logging.basicConfig(level=logging.DEBUG)
npm_lister()
```
## lister-phabricator
Once configured, you can execute a Phabricator lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.phabricator.tasks import incremental_phabricator_lister
logging.basicConfig(level=logging.DEBUG)
incremental_phabricator_lister(forge_url='https://forge.softwareheritage.org', api_token='XXXX')
```
## lister-gnu
Once configured, you can execute a PyPI lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.gnu.tasks import gnu_lister
logging.basicConfig(level=logging.DEBUG)
gnu_lister()
```
## lister-cran
Once configured, you can execute a CRAN lister using the following instructions in a `python3` script:
```lang=python
import logging
from swh.lister.cran.tasks import cran_lister
logging.basicConfig(level=logging.DEBUG)
cran_lister()
```
Licensing
---------
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public License
along with this program.
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
diff --git a/swh/lister/_version.py b/swh/lister/_version.py
index 64f59a9..687aa5e 100644
--- a/swh/lister/_version.py
+++ b/swh/lister/_version.py
@@ -1,5 +1,5 @@
# This file is automatically generated by setup.py.
-__version__ = '0.0.29'
-__sha__ = 'ge545315'
-__revision__ = 'ge545315'
+__version__ = '0.0.30'
+__sha__ = 'g52b1de8'
+__revision__ = 'g52b1de8'
diff --git a/swh/lister/cran/lister.py b/swh/lister/cran/lister.py
index 73eeac9..b25ab3b 100644
--- a/swh/lister/cran/lister.py
+++ b/swh/lister/cran/lister.py
@@ -1,118 +1,117 @@
# Copyright (C) 2019 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import subprocess
import json
import logging
import pkg_resources
from swh.lister.cran.models import CRANModel
from swh.scheduler.utils import create_task_dict
from swh.core import utils
from swh.lister.core.simple_lister import SimpleLister
class CRANLister(SimpleLister):
MODEL = CRANModel
LISTER_NAME = 'cran'
instance = 'cran'
def task_dict(self, origin_type, origin_url, **kwargs):
"""Return task format dict
This is overridden from the lister_base as more information is
needed for the ingestion task creation.
"""
return create_task_dict(
'load-%s' % origin_type, 'recurring',
- kwargs.get('name'), origin_url, kwargs.get('version'),
- project_metadata=kwargs.get('description'))
+ kwargs.get('name'), origin_url, kwargs.get('version'))
def r_script_request(self):
"""Runs R script which uses inbuilt API to return a json
response containing data about all the R packages
Returns:
List of dictionaries
example
[
{'Package': 'A3',
'Version': '1.0.0',
'Title':
'Accurate, Adaptable, and Accessible Error Metrics for
Predictive\nModels',
'Description':
'Supplies tools for tabulating and analyzing the results
of predictive models. The methods employed are ... '
}
{'Package': 'abbyyR',
'Version': '0.5.4',
'Title':
'Access to Abbyy Optical Character Recognition (OCR) API',
'Description': 'Get text from images of text using Abbyy
Cloud Optical Character\n ...'
}
...
]
"""
file_path = pkg_resources.resource_filename('swh.lister.cran',
'list_all_packages.R')
response = subprocess.run(file_path, stdout=subprocess.PIPE,
shell=False)
return json.loads(response.stdout)
def get_model_from_repo(self, repo):
"""Transform from repository representation to model
"""
project_url = 'https://cran.r-project.org/src/contrib' \
'/%(Package)s_%(Version)s.tar.gz' % repo
return {
'uid': repo["Package"],
'name': repo["Package"],
'full_name': repo["Title"],
'version': repo["Version"],
'html_url': project_url,
'origin_url': project_url,
'origin_type': 'cran',
}
def transport_response_simplified(self, response):
"""Transform response to list for model manipulation
"""
return [self.get_model_from_repo(repo) for repo in response]
def ingest_data(self, identifier, checks=False):
"""Rework the base ingest_data.
Request server endpoint which gives all in one go.
Simplify and filter response list of repositories. Inject
repo information into local db. Queue loader tasks for
linked repositories.
Args:
identifier: Resource identifier (unused)
checks (bool): Additional checks required (unused)
"""
response = self.r_script_request()
if not response:
return response, []
models_list = self.transport_response_simplified(response)
models_list = self.filter_before_inject(models_list)
all_injected = []
for models in utils.grouper(models_list, n=10000):
models = list(models)
logging.debug('models: %s' % len(models))
# inject into local db
injected = self.inject_repo_data_into_db(models)
# queue workers
self.create_missing_origins_and_tasks(models, injected)
all_injected.append(injected)
# flush
self.db_session.commit()
self.db_session = self.mk_session()
return response, all_injected
diff --git a/swh/lister/gitlab/lister.py b/swh/lister/gitlab/lister.py
index e463fc4..f8e7ead 100644
--- a/swh/lister/gitlab/lister.py
+++ b/swh/lister/gitlab/lister.py
@@ -1,84 +1,83 @@
# Copyright (C) 2018-2019 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import time
from urllib3.util import parse_url
from ..core.page_by_page_lister import PageByPageHttpLister
from .models import GitLabModel
class GitLabLister(PageByPageHttpLister):
# Template path expecting an integer that represents the page id
PATH_TEMPLATE = '/projects?page=%d&order_by=id'
MODEL = GitLabModel
LISTER_NAME = 'gitlab'
def __init__(self, api_baseurl, instance=None,
override_config=None, sort='asc', per_page=20):
super().__init__(api_baseurl=api_baseurl,
override_config=override_config)
if instance is None:
instance = parse_url(api_baseurl).host
self.instance = instance
self.PATH_TEMPLATE = '%s&sort=%s' % (self.PATH_TEMPLATE, sort)
if per_page != 20:
self.PATH_TEMPLATE = '%s&per_page=%s' % (
self.PATH_TEMPLATE, per_page)
def uid(self, repo):
return '%s/%s' % (self.instance, repo['path_with_namespace'])
def get_model_from_repo(self, repo):
return {
'instance': self.instance,
'uid': self.uid(repo),
'name': repo['name'],
'full_name': repo['path_with_namespace'],
'html_url': repo['web_url'],
'origin_url': repo['http_url_to_repo'],
'origin_type': 'git',
- 'description': repo['description'],
}
def transport_quota_check(self, response):
"""Deal with rate limit if any.
"""
# not all gitlab instance have rate limit
if 'RateLimit-Remaining' in response.headers:
reqs_remaining = int(response.headers['RateLimit-Remaining'])
if response.status_code == 403 and reqs_remaining == 0:
reset_at = int(response.headers['RateLimit-Reset'])
delay = min(reset_at - time.time(), 3600)
return True, delay
return False, 0
def _get_int(self, headers, key):
_val = headers.get(key)
if _val:
return int(_val)
def get_next_target_from_response(self, response):
"""Determine the next page identifier.
"""
return self._get_int(response.headers, 'x-next-page')
def get_pages_information(self):
"""Determine pages information.
"""
response = self.transport_head(identifier=1)
if not response.ok:
raise ValueError(
'Problem during information fetch: %s' % response.status_code)
h = response.headers
return (self._get_int(h, 'x-total'),
self._get_int(h, 'x-total-pages'),
self._get_int(h, 'x-per-page'))
def transport_response_simplified(self, response):
repos = response.json()
return [self.get_model_from_repo(repo) for repo in repos]
diff --git a/version.txt b/version.txt
index 142780a..1054e71 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-v0.0.29-0-ge545315
\ No newline at end of file
+v0.0.30-0-g52b1de8
\ No newline at end of file
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Jul 4 2025, 8:11 AM (9 w, 6 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3293986
Attached To
rDLS Listers
Event Timeline
Log In to Comment