diff --git a/README.md b/README.md index acc86e2..f5c29d1 100644 --- a/README.md +++ b/README.md @@ -1,232 +1,233 @@ swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.debian` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.pypi` - `swh.lister.npm` - `swh.lister.phabricator` - `swh.lister.cran` +- `swh.lister.cgit` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`github`, `gitlab`, `debian`, `pypi`, `npm`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/ ~/.cache/swh/lister//` 2. create configuration file `~/.config/swh/lister_.yml` 3. Bootstrap the db instance schema ```lang=bash $ createdb lister- $ python3 -m swh.lister.cli --db-url postgres:///lister- ``` Note: This bootstraps a minimum data set needed for the lister to run. ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/lister_.yml`: ```lang=yml storage: cls: 'remote' args: url: 'http://localhost:5002/' scheduler: cls: 'remote' args: url: 'http://localhost:5008/' lister: cls: 'local' args: # see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls db: 'postgresql:///lister-' credentials: [] cache_responses: True cache_dir: /home/user/.cache/swh/lister// ``` Note: This expects storage (5002) and scheduler (5008) services to run locally ## lister-github Once configured, you can execute a GitHub lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.github.tasks import range_github_lister logging.basicConfig(level=logging.DEBUG) range_github_lister(364, 365) ... ``` ## lister-gitlab Once configured, you can execute a GitLab lister using the instructions detailed in the `python3` scripts below: ```lang=python import logging from swh.lister.gitlab.tasks import range_gitlab_lister logging.basicConfig(level=logging.DEBUG) range_gitlab_lister(1, 2, { 'instance': 'debian', 'api_baseurl': 'https://salsa.debian.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ```lang=python import logging from swh.lister.gitlab.tasks import full_gitlab_relister logging.basicConfig(level=logging.DEBUG) full_gitlab_relister({ 'instance': '0xacab', 'api_baseurl': 'https://0xacab.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ```lang=python import logging from swh.lister.gitlab.tasks import incremental_gitlab_lister logging.basicConfig(level=logging.DEBUG) incremental_gitlab_lister({ 'instance': 'freedesktop.org', 'api_baseurl': 'https://gitlab.freedesktop.org/api/v4', 'sort': 'asc', 'per_page': 20 }) ``` ## lister-debian Once configured, you can execute a Debian lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.debian.tasks import debian_lister logging.basicConfig(level=logging.DEBUG) debian_lister('Debian') ``` ## lister-pypi Once configured, you can execute a PyPI lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.pypi.tasks import pypi_lister logging.basicConfig(level=logging.DEBUG) pypi_lister() ``` ## lister-npm Once configured, you can execute a npm lister using the following instructions in a `python3` REPL: ```lang=python import logging from swh.lister.npm.tasks import npm_lister logging.basicConfig(level=logging.DEBUG) npm_lister() ``` ## lister-phabricator Once configured, you can execute a Phabricator lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.phabricator.tasks import incremental_phabricator_lister logging.basicConfig(level=logging.DEBUG) incremental_phabricator_lister(forge_url='https://forge.softwareheritage.org', api_token='XXXX') ``` ## lister-gnu Once configured, you can execute a PyPI lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.gnu.tasks import gnu_lister logging.basicConfig(level=logging.DEBUG) gnu_lister() ``` ## lister-cran Once configured, you can execute a CRAN lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.cran.tasks import cran_lister logging.basicConfig(level=logging.DEBUG) cran_lister() ``` ## lister-cgit Once configured, you can execute a cgit lister using the following instructions in a `python3` script: ```lang=python import logging from swh.lister.cgit.tasks import cgit_lister logging.basicConfig(level=logging.DEBUG) cgit_lister(base_url='http://git.savannah.gnu.org/cgit/') ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh/lister/cgit/lister.py b/swh/lister/cgit/lister.py index a16f922..4f5db9b 100644 --- a/swh/lister/cgit/lister.py +++ b/swh/lister/cgit/lister.py @@ -1,180 +1,276 @@ # Copyright (C) 2019 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import random from bs4 import BeautifulSoup from collections import defaultdict import requests -import urllib.parse +from urllib.parse import urlparse from .models import CGitModel from swh.lister.core.simple_lister import SimpleLister from swh.lister.core.lister_transports import ListerOnePageApiTransport class CGitLister(ListerOnePageApiTransport, SimpleLister): MODEL = CGitModel LISTER_NAME = 'cgit' - PAGE = '' + PAGE = None def __init__(self, base_url, instance=None, override_config=None): - if not base_url.endswith('/'): - base_url = base_url+'/' - self.PAGE = base_url - # This part removes any suffix from the base url and stores it in - # next_url. For example for base_url = https://git.kernel.org/pub/scm/ - # it will convert it into https://git.kernel.org and then attach - # the suffix - (part1, part2, next_url) = self.PAGE.split('/', 2) - self.next_url = part1 + '//' + next_url + self.PAGE = base_url + url = urlparse(self.PAGE) + self.url_netloc = find_netloc(url) if not instance: - instance = urllib.parse.urlparse(base_url).hostname + instance = url.hostname self.instance = instance ListerOnePageApiTransport .__init__(self) SimpleLister.__init__(self, override_config=override_config) def list_packages(self, response): """List the actual cgit instance origins from the response. + Find the repos in all the pages by parsing over the HTML of + the `base_url`. Find the details for all the repos and return + them in the format of list of dictionaries. + """ repos_details = [] - soup = BeautifulSoup(response.text, features="html.parser") \ - .find('div', {"class": "content"}) - repos = soup.find_all("tr", {"class": ""}) + repos = get_repo_list(response) + soup = make_repo_soup(response) + pages = self.get_page(soup) + if len(pages) > 1: + repos.extend(self.get_all_pages(pages)) + for repo in repos: repo_name = repo.a.text repo_url = self.get_url(repo) origin_url = find_origin_url(repo_url) try: time = repo.span['title'] except Exception: time = None if origin_url is not None: repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, - }) + }) random.shuffle(repos_details) return repos_details + def get_page(self, soup): + """Find URL of all pages + + Finds URL of all the pages that are present by parsing over the HTML of + pagination present at the end of the page. + + Args: + soup (Beautifulsoup): a beautifulsoup object of base URL + + Returns: + list: URL of all the pages present for a cgit instance + + """ + pages = soup.find('div', {"class": "content"}).find_all('li') + + if not pages: + return [self.PAGE] + + return [self.get_url(page) for page in pages] + + def get_all_pages(self, pages): + """Find repos from all the pages + + Make the request for all the pages (except the first) present for a + particular cgit instance and finds the repos that are available + for each and every page. + + Args: + pages ([str]): list of urls of all the pages present for a + particular cgit instance + + Returns: + List of beautifulsoup object of all the repositories (url) row + present in all the pages(except first). + + """ + all_repos = [] + for page in pages[1:]: + response = requests.get(page) + repos = get_repo_list(response) + all_repos.extend(repos) + + return all_repos + def get_url(self, repo): """Finds url of a repo page. Finds the url of a repo page by parsing over the html of the row of that repo present in the base url. Args: - repo: a beautifulsoup object of the html code of the repo row - present in base url. + repo (Beautifulsoup): a beautifulsoup object of the repository + row present in base url. Returns: string: The url of a repo. + """ suffix = repo.a['href'] - return self.next_url + suffix + return self.url_netloc + suffix def get_model_from_repo(self, repo): """Transform from repository representation to model. """ return { 'uid': self.PAGE + repo['name'], 'name': repo['name'], 'full_name': repo['name'], 'html_url': repo['origin_url'], 'origin_url': repo['origin_url'], 'origin_type': 'git', 'time_updated': repo['time'], + 'instance': self.instance, } - def transport_response_simplified(self, response): + def transport_response_simplified(self, repos_details): """Transform response to list for model manipulation. """ - return [self.get_model_from_repo(repo) for repo in response] + return [self.get_model_from_repo(repo) for repo in repos_details] + + +def find_netloc(url): + """Finds the network location from then base_url + + All the url in the repo are relative to the network location part of base + url, so we need to compute it to reconstruct all the urls. + + Args: + url (urllib): urllib object of base_url + + Returns: + string: Scheme and Network location part in the base URL. + + Example: + For base_url = https://git.kernel.org/pub/scm/ + >>> find_netloc(url) + 'https://git.kernel.org' + + """ + return '%s://%s' % (url.scheme, url.netloc) + + +def get_repo_list(response): + """Find all the rows with repo for a particualar page on the base url + + Finds all the repos on page and retuens a list of all the repos. Each + element of the list is a beautifulsoup object representing a repo. + + Args: + response (Response): server response + + Returns: + List of all the repos on a page. + + """ + repo_soup = make_repo_soup(response) + return repo_soup \ + .find('div', {"class": "content"}).find_all("tr", {"class": ""}) + + +def make_repo_soup(response): + """Makes BeautifulSoup object of the response + + """ + return BeautifulSoup(response.text, features="html.parser") def find_origin_url(repo_url): """Finds origin url for a repo. Finds the origin url for a particular repo by parsing over the page of that repo. Args: repo_url: URL of the repo. Returns: string: Origin url for the repo. Examples: >>> find_origin_url( 'http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') 'https://git.savannah.gnu.org/git/fbvbconv-py.git' """ response = requests.get(repo_url) - soup = BeautifulSoup(response.text, features="html.parser") + repo_soup = make_repo_soup(response) - origin_urls = find_all_origin_url(soup) + origin_urls = find_all_origin_url(repo_soup) return priority_origin_url(origin_urls) def find_all_origin_url(soup): - """ + """Finds all possible origin url for a repo. + Finds all the origin url for a particular repo by parsing over the html of repo page. Args: - soup: a beautifulsoup object of the html code of the repo. + soup: a beautifulsoup object repo representation. Returns: - dictionary: All possible origin urls with their protocol as key. + dictionary: All possible origin urls for a repository (dict with + key 'protocol', value the associated url). Examples: If soup is beautifulsoup object of the html code at http://git.savannah.gnu.org/cgit/fbvbconv-py.git/ >>> print(find_all_origin_url(soup)) { 'https': 'https://git.savannah.gnu.org/git/fbvbconv-py.git', 'ssh': 'ssh://git.savannah.gnu.org/srv/git/fbvbconv-py.git', 'git': 'git://git.savannah.gnu.org/fbvbconv-py.git'} """ origin_urls = defaultdict(dict) found_clone_word = False for i in soup.find_all('tr'): if found_clone_word: link = i.text protocol = link[:link.find(':')] origin_urls[protocol] = link if i.text == 'Clone': found_clone_word = True return origin_urls def priority_origin_url(origin_url): """Finds the highest priority link for a particular repo. Priority order is https>http>git>ssh. Args: - origin_urls: A dictionary of origin links with their protocol as key. + origin_urls (Dict): All possible origin urls for a repository + (key 'protocol', value the associated url) Returns: - string: URL with the highest priority. + Url (str) with the highest priority. """ for protocol in ['https', 'http', 'git', 'ssh']: if protocol in origin_url: return origin_url[protocol] diff --git a/swh/lister/cgit/models.py b/swh/lister/cgit/models.py index 8ecf40f..4e16798 100644 --- a/swh/lister/cgit/models.py +++ b/swh/lister/cgit/models.py @@ -1,17 +1,18 @@ # Copyright (C) 2019 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from sqlalchemy import Column, String from ..core.models import ModelBase class CGitModel(ModelBase): """a CGit repository representation """ __tablename__ = 'cgit_repo' uid = Column(String, primary_key=True) time_updated = Column(String) + instance = Column(String, index=True) diff --git a/swh/lister/cgit/tests/test_lister.py b/swh/lister/cgit/tests/test_lister.py index 600758a..e3c3610 100644 --- a/swh/lister/cgit/tests/test_lister.py +++ b/swh/lister/cgit/tests/test_lister.py @@ -1,40 +1,50 @@ # Copyright (C) 2019 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from bs4 import BeautifulSoup +from urllib.parse import urlparse from swh.lister.cgit.lister import priority_origin_url, find_all_origin_url +from swh.lister.cgit.lister import find_netloc def test_find_all_origin_url(): f = open('swh/lister/cgit/tests/api_response.html') soup = BeautifulSoup(f.read(), features="html.parser") expected_output = {'https': 'https://git.savannah.gnu.org/git/' 'fbvbconv-py.git', 'ssh': 'ssh://git.savannah.gnu.org/srv/git/' 'fbvbconv-py.git', 'git': 'git://git.savannah.gnu.org/fbvbconv-py.git'} output = find_all_origin_url(soup) for protocol, url in expected_output.items(): assert url == output[protocol] def test_priority_origin_url(): first_input = {'https': 'https://kernel.googlesource.com/pub/scm/docs/' 'man-pages/man-pages.git', 'git': 'git://git.kernel.org/pub/scm/docs/man-pages/' 'man-pages.git'} second_input = {'git': 'git://git.savannah.gnu.org/perl-pesel.git', 'ssh': 'ssh://git.savannah.gnu.org/srv/git/perl-pesel.git'} third_input = {} assert (priority_origin_url(first_input) == 'https://kernel.googlesource.com/pub/scm/docs/man-pages/' 'man-pages.git') assert (priority_origin_url(second_input) == 'git://git.savannah.gnu.org/perl-pesel.git') assert priority_origin_url(third_input) is None + + +def test_find_netloc(): + first_url = urlparse('http://git.savannah.gnu.org/cgit/') + second_url = urlparse('https://cgit.kde.org/') + + assert find_netloc(first_url) == 'http://git.savannah.gnu.org' + assert find_netloc(second_url) == 'https://cgit.kde.org'