No OneTemporary
Actions

Size

46 KB

Subscribers

None

View Options

	diff --git a/CONTRIBUTORS b/CONTRIBUTORS
	index 16792d5..8dfc445 100644
	--- a/CONTRIBUTORS
	+++ b/CONTRIBUTORS
	@@ -1,6 +1,7 @@
	Archit Agrawal
	Avi Kelman (fiendish)
	Léni Gauffier
	Yann Gautier
	Sushant Sushant
	-Hezekiah Maina
	\ No newline at end of file
	+Hezekiah Maina
	+Boris Baldassari
	diff --git a/README.md b/README.md
	index 79e96e1..5b5b27e 100644
	--- a/README.md
	+++ b/README.md
	@@ -1,100 +1,101 @@
	swh-lister
	==========

	This component from the Software Heritage stack aims to produce listings
	of software origins and their urls hosted on various public developer platforms
	or package managers. As these operations are quite similar, it provides a set of
	Python modules abstracting common software origins listing behaviors.

	It also provides several lister implementations, contained in the
	following Python modules:

	- `swh.lister.bitbucket`
	- `swh.lister.cgit`
	- `swh.lister.cran`
	- `swh.lister.debian`
	- `swh.lister.gitea`
	- `swh.lister.github`
	- `swh.lister.gitlab`
	- `swh.lister.gnu`
	- `swh.lister.launchpad`
	- `swh.lister.npm`
	- `swh.lister.packagist`
	- `swh.lister.phabricator`
	- `swh.lister.pypi`
	+- `swh.lister.tuleap`

	Dependencies
	------------

	All required dependencies can be found in the `requirements*.txt` files located
	at the root of the repository.

	Local deployment
	----------------

	## lister configuration

	Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`,
	-`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`)
	+`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`)
	must be configured by following the instructions below (please note that you have to replace
	`<lister_name>` by one of the lister name introduced above).

	### Preparation steps

	1. `mkdir ~/.config/swh/`
	2. create configuration file `~/.config/swh/listers.yml`

	### Configuration file sample

	Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`:

	```lang=yml
	scheduler:
	cls: 'remote'
	args:
	url: 'http://localhost:5008/'

	credentials: {}
	```

	Note: This expects scheduler (5008) service to run locally

	## Executing a lister

	Once configured, a lister can be executed by using the `swh` CLI tool with the
	following options and commands:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister <lister_name> [lister_parameters]
	```

	Examples:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi
	```

	Licensing
	---------

	This program is free software: you can redistribute it and/or modify it under
	the terms of the GNU General Public License as published by the Free Software
	Foundation, either version 3 of the License, or (at your option) any later
	version.

	This program is distributed in the hope that it will be useful, but WITHOUT ANY
	WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
	PARTICULAR PURPOSE. See the GNU General Public License for more details.

	See top-level LICENSE file for the full text of the GNU General Public License
	along with this program.
	diff --git a/docs/tutorial.rst b/docs/tutorial.rst
	index f3e6020..65a6db0 100644
	--- a/docs/tutorial.rst
	+++ b/docs/tutorial.rst
	@@ -1,363 +1,382 @@
	.. _lister-tutorial:

	Tutorial: list the content of your favorite forge in just a few steps
	=====================================================================

	Overview
	--------

	The three major phases of work in Software Heritage's preservation process, on the
	technical side, are listing software sources, scheduling updates and *loading the
	software artifacts into the archive*.

	A previous effort in 2017 consisted in designing the framework to make lister a
	straightforward "fill in the blanks" process, based on gained experience on the
	diversity found in the listed services. This is the second iteration on the lister
	framework design, comprising a library and an API which is easier to work with and less
	"magic" (read implicit). This new design is part of a larger effort in redesigning the
	scheduling system for the recurring tasks updating the content of the archive.

	.. _fundamentals:

	Fundamentals
	------------

	Fundamentally, a basic lister must follow these steps:

	1. Issue a network request for a service endpoint.
	2. Convert the response data into a model object.
	3. Send the model object to the scheduler.

	Steps 1 and 3 are generic problems, that are often already solved by helpers or in other
	listers. That leaves us mainly to implement step 2, which is simple when the remote
	service provides an API.

	.. _prerequisites:

	Prerequisites
	-------------

	Skills:

	* object-oriented Python
	* requesting remote services through HTTP
	* scrapping if no API is offered

	Analysis of the target service. Prepare the following elements to write the lister:

	* instance names and URLs
	* requesting scheme: base URL, path, query_string, POST data, headers
	* authentication types and which one to support, if any
	* rate-limiting: HTTP codes and headers used
	* data format: JSON/XML/HTML/...?
	* mapping between remote data and needed data (ListedOrigin model, internal state)

	We will now walk through the steps to build a new lister.
	Please use this template to start with: :download:`new_lister_template.py`

	.. _lister-declaration:

	Lister declaration
	------------------

	In order to write a lister, two basic elements are required. These are the
	:py:class:`Lister` base class and the :py:class:`ListedOrigin` scheduler model class.
	Optionally, for listers that need to keep a state and support incremental listing, an
	additional object :py:class:`ListerState` will come into play.

	Each lister must subclass :py:class:`Lister <swh.lister.pattern.Lister>` either directly
	or through a subclass such as :py:class:`StatelessLister
	<swh.lister.pattern.StatelessLister>` for stateless ones.

	We extensively type-annotate our listers, as any new code, which makes proeminent that
	those lister classes are generic, and take the following parameters:

	* :py:class:`Lister`: the lister state type, the page type
	* :py:class:`StatelessLister`: only the page type

	You can can start by declaring a stateless lister and leave the implementation of state
	for later if the listing needs it. We will see how to in :ref:`handling-lister-state`.

	Both the lister state type and the page type are user-defined types. However, while the
	page type may only exist as a type annotation, the state type for a stateful lister must
	be associated with a concrete object. The state type is commonly defined as a dataclass
	whereas the page type is often a mere annotation, potentially given a nice alias.

	Example lister declaration::

	NewForgePage = List[Dict[str, Any]]

	@dataclass
	class NewForgeListerState:
	...

	class NewForgeLister(Lister[NewForgeListerState, NewForgePage]):
	LISTER_NAME = "My"
	...

	The new lister must declare a name through the :py:attr:`LISTER_NAME` class attribute.

	.. _lister-construction:

	Lister construction
	-------------------

	The lister constructor is only required to ask for a :py:class:`SchedulerInterface`
	object to pass to the base class. But it does not mean that it is all that's needed for
	it to useful. A lister need information on which remote service to talk to. It needs an
	URL.

	Some services are centralized and offered by a single organization. Think of Github.
	Others are offered by many people across the Internet, each using a different hosting,
	each providing specific data. Think of the many Gitlab instances. We need a name to
	identify each instance, and even if there is only one, we need its URL to access it
	concretely.

	Now, you may think of any strategy to infer the information or hardcode it, but the base
	class needs an URL and an instance name. In any case, for a multi-instance service, you
	better be explicit and require the URL as constructor argument. We recommend the URL to
	be some form of a base URL, to be concatenated with any variable part appearing either
	because there exist multiple instances or the URL need recomputation in the listing
	process.

	If we need any credentials to access a remote service, and do so in our polite but
	persistent fashion (remember that we want fresh information), you are encouraged to
	provide support for authenticated access. The base class support handling credentials as
	a set of identifier/secret pair. It knows how to load from a secrets store the right
	ones for the current ("lister name", "instance name") setting, if none were originally
	provided through the task parameters. You can ask for other types of access tokens in a
	separate parameter, but then you lose this advantage.

	Example of a typical lister constructor::

	def __init__(
	self,
	scheduler: SchedulerInterface,
	url: str,
	instance: str,
	credentials: CredentialsType = None,
	):
	super().__init__(
	scheduler=scheduler, url=url, instance=instance, credentials=credentials,
	)
	...

	.. _core-lister-functionality:

	Core lister functionality
	-------------------------

	For the lister to contribute data to the archive, you now have to write the logic to
	fetch data from the remote service, and format it in the canonical form the scheduler
	expects, as outined in :ref:`fundamentals`. To this purpose, the two methods to
	implement are::

	def get_pages(self) -> Iterator[NewForgePage]:
	...

	def get_origins_from_page(self, page: NewForgePage) -> Iterator[ListedOrigin]:
	...

	Those two core functions are called by the principal lister method,
	:py:meth:`Lister.run`, found in the base class.

	:py:meth:`get_pages` is the guts of the lister. It takes no arguments and must produce
	data pages. An iterator is fine here, as the :py:meth:`Lister.run` method only mean to
	iterate in a single pass on it. This method gets its input from a network request to a
	remote service's endpoint to retrieve the data we long for.

	Depending on whether the data is adequately structured for our purpose can be tricky.
	Here you may have to show off your data scraping skills, or just consume a well-designed
	API. Those aspects are discussed more specifically in the section
	:ref:`handling-specific-topics`.

	In any case, we want the data we return to be usefully filtered and structured. The
	easiest way to create an iterator is to use the `yield` keyword. Yield each data page
	you have structured in accordance with the page type you have declared. The page type
	exists only for static type checking of data passed from :py:meth:`get_pages` to
	:py:meth:`get_origins_from_page`; you can choose whatever fits the bill.

	:py:meth:`get_origins_from_page` is simpler. For each individual software origin you
	have received in the page, you convert and yield a :py:class:`ListedOrigin` model
	object. This datatype has the following mandatory fields:

	* lister id: you generally fill this with the value of :py:attr:`self.lister_obj.id`

	* visit type: the type of software distribution format the service provides. For use by
	a corresponding loader. It is an identifier, so you have to either use an existing
	value or craft a new one if you get off the beaten track and tackle a new software
	source. But then you will have to discuss the name with the core developers.

	Example: Phabricator is a forge that can handle Git or SVN repositories. The visit
	type would be "git" when listing such a repo that provides a Git URL that we can load.

	* origin URL: an URL that, combined with the visit type, will serve as the input of
	loader.

	This datatype can also further be detailed with the optional fields:

	* last update date: freshness information on this origin, which is useful to the
	scheduler for optimizing its scheduling decisions. Fill it if provided by the service,
	at no substantial additional runtime cost, e.g. in the same request.

	* extra loader arguments: extra parameters to be passed to the loader for it to be
	able to load the origin. It is needed for example when additional context is needed
	along with the URL to effectively load from the origin.

	See the definition of ListedOrigin_.

	Now that that we showed how those two methods operate, let's put it together by showing
	how they fit in the principal :py:meth:`Lister.run` method::

	def run(self) -> ListerStats:

	full_stats = ListerStats()

	try:
	for page in self.get_pages():
	full_stats.pages += 1
	origins = self.get_origins_from_page(page)
	full_stats.origins += self.send_origins(origins)
	self.commit_page(page)
	finally:
	self.finalize()
	if self.updated:
	self.set_state_in_scheduler()

	return full_stats

	:py:meth:`Lister.send_origins` is the method that sends listed origins to the scheduler.

	The :py:class:`ListerState` datastructure, defined along the base lister class, is used
	to compute the number of listed pages and origins in a single lister run. It is useful
	both for the scheduler that automatically collects this information and to test the
	lister.

	You see that the bulk of a lister run consists in streaming data gathered from the
	remote service to the scheduler. And this is done under a ``try...finally`` construct to
	have the lister state reliably recorded in case of unhandled error. We will explain the
	role of the remaining methods and attributes appearing here in the next section as it is
	related to the lister state.

	.. _ListedOrigin: https://archive.softwareheritage.org/browse/swh:1:rev:03460207a17d82635ef5a6f12358392143eb9eef/?origin_url=https://forge.softwareheritage.org/source/swh-scheduler.git&path=swh/scheduler/model.py&revision=03460207a17d82635ef5a6f12358392143eb9eef#L134-L177

	.. _handling-lister-state:

	Handling lister state
	---------------------

	With what we have covered until now you can write a stateless lister. Unfortunately,
	some services provide too much data to efficiently deal with it in a one-shot fashion.
	Listing a given software source can take several hours or days to process. Our listers
	can also give valid output, but fail on an unexpected condition and would have to start
	over. As we want to be able to resume the listing process from a given element, provided
	by the remote service and guaranteed to be ordered, such as a date or a numeric
	identifier, we need to deal with state.

	The remaining part of the lister API is reserved for dealing with lister state.

	If the service to list has no pagination, then the data set to handle is small enough to
	not require keeping lister state. In the opposite case, you will have to determine which
	piece of information should be recorded in the lister state. As said earlier, we
	recommend declaring a dataclass for the lister state::

	@dataclass
	class NewForgeListerState:
	current: str = ""

	class NewForgeLister(Lister[NewForgeListerState, NewForgePage]):
	...

	A pair of methods, :py:meth:`state_from_dict` and :py:meth:`state_to_dict` are used to
	respectively import lister state from the scheduler and export lister state to the
	scheduler. Some fields may need help to be serialized to the scheduler, such as dates,
	so this needs to be handled there.

	Where is the state used? Taking the general case of a paginating service, the lister
	state is used at the beginning of the :py:meth:`get_pages` method to initialize the
	variables associated with the last listing progress. That way we can start from an
	arbitrary element, or just the first one if there is no last lister state.

	The :py:meth:`commit_page` is called on successful page processing, after the new
	origins are sent to the scheduler. Here you should mainly update the lister state by
	taking into account the new page processed, e.g. advance a date or serial field.

	Finally, upon either completion or error, the :py:meth:`finalize` is called. There you
	must set attribute :py:attr:`updated` to True if you were successful in advancing in the
	listing process. To do this you will commonly retrieve the latest saved lister state
	from the scheduler and compare with your current lister state. If lister state was
	updated, ultimately the current lister state will be recorded in the scheduler.

	We have now seen the stateful lister API. Note that some listers may implement more
	flexibility in the use of lister state. Some allow an `incremental` parameter that
	governs whether or not we will do a stateful listing or not. It is up to you to support
	additional functionality if it seems relevant.

	.. _handling-specific-topics:

	Handling specific topics
	------------------------

	Here is a quick coverage of common topics left out from lister construction and
	:py:meth:`get_pages` descriptions.

	Sessions
	^^^^^^^^

	When requesting a web service repeatedly, most parameters including headers do not
	change and could be set up once initially. We recommend setting up a e.g. HTTP session,
	as instance attribute so that further requesting code can focus on what really changes.
	Some ubiquitous HTTP headers include "Accept" to set to the service response format and
	"User-Agent" for which we provide a recommended value :py:const:`USER_AGENT` to be
	imported from :py:mod:`swh.lister`. Authentication is also commonly provided through
	headers, so you can also set it up in the session.

	Transport error handling
	^^^^^^^^^^^^^^^^^^^^^^^^

	We generally recommend logging every unhandleable error with the response content and
	then immediately stop the listing by doing an equivalent of
	:py:meth:`Response.raise_for_status` from the `requests` library. As for rate-limiting
	errors, we have a strategy of using a flexible decorator to handle the retrying for us.
	It is based on the `tenacity` library and accessible as :py:func:`throttling_retry` from
	:py:mod:`swh.lister.utils`.

	Pagination
	^^^^^^^^^^

	This one is a moving target. You have to understand how the pagination mechanics of the
	particular service works. Some guidelines though. The identifier may be minimal (an id
	to pass as query parameter), compound (a set of such parameters) or complete (a whole
	URL). If the service provides the next URL, use it. The piece of information may be
	found either in the response body, or in a header. Once identified, you still have to
	implement the logic of requesting and extracting it in a loop and quitting the loop when
	there is no more data to fetch.

	Page results
	^^^^^^^^^^^^

	First, when retrieving page results, which involves some protocols and parsing logic,
	please make sure that any deviance from what was expected will result in an
	informational error. You also have to simplify the results, both with filtering request
	parameters if the service supports it, and by extracting from the response only the
	information needed into a structured page. This all makes for easier debugging.

	+Misc files
	+^^^^^^^^^^^^^^^
	+
	+There are also a few files that need to be modified outside of the lister directory, namely:
	+
	+* `/setup.py` to add your lister to the end of the list in the setup section:
	+
	+ entry_points="""
	+ [swh.cli.subcommands]
	+ lister=swh.lister.cli
	+ [swh.workers]
	+ lister.bitbucket=swh.lister.bitbucket:register
	+ lister.cgit=swh.lister.cgit:register
	+ ..."""
	+
	+* `/swh/lister/tests/test_cli.py` to get a default set of parameters in scheduler-related tests.
	+* `/README.md` to reference the new lister.
	+* `/CONTRIBUTORS` to add your name.
	+
	Testing your lister
	-------------------

	When developing a new lister, it's important to test. For this, add the tests
	(check `swh/lister/*/tests/`) and register the celery tasks in the main
	conftest.py (`swh/lister/core/tests/conftest.py`).

	Another important step is to actually run it within the docker-dev
	(:ref:`run-lister-tutorial`).

	More about listers
	------------------

	See current implemented listers as examples (GitHub_, Bitbucket_, CGit_, GitLab_ ).

	.. _GitHub: https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/github/lister.py
	.. _Bitbucket: https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/bitbucket/lister.py
	.. _CGit: https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/cgit/lister.py
	.. _GitLab: https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gitlab/lister.py
	diff --git a/setup.py b/setup.py
	index 9a408a5..c8a6898 100755
	--- a/setup.py
	+++ b/setup.py
	@@ -1,86 +1,87 @@
	#!/usr/bin/env python3
	# Copyright (C) 2015-2020 The Software Heritage developers
	# See the AUTHORS file at the top-level directory of this distribution
	# License: GNU General Public License version 3, or any later version
	# See top-level LICENSE file for more information

	from io import open
	from os import path

	from setuptools import find_packages, setup

	here = path.abspath(path.dirname(__file__))

	# Get the long description from the README file
	with open(path.join(here, "README.md"), encoding="utf-8") as f:
	long_description = f.read()


	def parse_requirements(name=None):
	if name:
	reqf = "requirements-%s.txt" % name
	else:
	reqf = "requirements.txt"

	requirements = []
	if not path.exists(reqf):
	return requirements

	with open(reqf) as f:
	for line in f.readlines():
	line = line.strip()
	if not line or line.startswith("#"):
	continue
	requirements.append(line)
	return requirements


	setup(
	name="swh.lister",
	description="Software Heritage lister",
	long_description=long_description,
	long_description_content_type="text/markdown",
	python_requires=">=3.7",
	author="Software Heritage developers",
	author_email="swh-devel@inria.fr",
	url="https://forge.softwareheritage.org/diffusion/DLSGH/",
	packages=find_packages(),
	install_requires=parse_requirements() + parse_requirements("swh"),
	tests_require=parse_requirements("test"),
	setup_requires=["setuptools-scm"],
	extras_require={"testing": parse_requirements("test")},
	use_scm_version=True,
	include_package_data=True,
	entry_points="""
	[swh.cli.subcommands]
	lister=swh.lister.cli
	[swh.workers]
	lister.bitbucket=swh.lister.bitbucket:register
	lister.cgit=swh.lister.cgit:register
	lister.cran=swh.lister.cran:register
	lister.debian=swh.lister.debian:register
	lister.gitea=swh.lister.gitea:register
	lister.github=swh.lister.github:register
	lister.gitlab=swh.lister.gitlab:register
	lister.gnu=swh.lister.gnu:register
	lister.launchpad=swh.lister.launchpad:register
	lister.npm=swh.lister.npm:register
	lister.packagist=swh.lister.packagist:register
	lister.phabricator=swh.lister.phabricator:register
	lister.pypi=swh.lister.pypi:register
	lister.sourceforge=swh.lister.sourceforge:register
	+ lister.tuleap=swh.lister.tuleap:register
	""",
	classifiers=[
	"Programming Language :: Python :: 3",
	"Intended Audience :: Developers",
	"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
	"Operating System :: OS Independent",
	"Development Status :: 5 - Production/Stable",
	],
	project_urls={
	"Bug Reports": "https://forge.softwareheritage.org/maniphest",
	"Funding": "https://www.softwareheritage.org/donate",
	"Source": "https://forge.softwareheritage.org/source/swh-lister",
	"Documentation": "https://docs.softwareheritage.org/devel/swh-lister/",
	},
	)
	diff --git a/swh/lister/tests/test_cli.py b/swh/lister/tests/test_cli.py
	index 53ec7f2..4a0bff3 100644
	--- a/swh/lister/tests/test_cli.py
	+++ b/swh/lister/tests/test_cli.py
	@@ -1,42 +1,43 @@
	# Copyright (C) 2019-2021 The Software Heritage developers
	# See the AUTHORS file at the top-level directory of this distribution
	# License: GNU General Public License version 3, or any later version
	# See top-level LICENSE file for more information

	import pytest

	from swh.lister.cli import SUPPORTED_LISTERS, get_lister

	lister_args = {
	"cgit": {"url": "https://git.eclipse.org/c/",},
	"phabricator": {
	"instance": "softwareheritage",
	"url": "https://forge.softwareheritage.org/api/diffusion.repository.search",
	"api_token": "bogus",
	},
	"gitea": {"url": "https://try.gitea.io/api/v1/",},
	+ "tuleap": {"url": "https://tuleap.net",},
	"gitlab": {"url": "https://gitlab.ow2.org/api/v4", "instance": "ow2",},
	}


	def test_get_lister_wrong_input():
	"""Unsupported lister should raise"""
	with pytest.raises(ValueError) as e:
	get_lister("unknown", "db-url")

	assert "Invalid lister" in str(e.value)


	def test_get_lister(swh_scheduler_config):
	"""Instantiating a supported lister should be ok

	"""
	# Drop launchpad lister from the lister to check, its test setup is more involved
	# than the other listers and it's not currently done here
	for lister_name in SUPPORTED_LISTERS:
	lst = get_lister(
	lister_name,
	scheduler={"cls": "local", **swh_scheduler_config},
	**lister_args.get(lister_name, {}),
	)
	assert hasattr(lst, "run")
	diff --git a/swh/lister/tuleap/__init__.py b/swh/lister/tuleap/__init__.py
	new file mode 100644
	index 0000000..49ccbd8
	--- /dev/null
	+++ b/swh/lister/tuleap/__init__.py
	@@ -0,0 +1,12 @@
	+# Copyright (C) 2021 the Software Heritage developers
	+# License: GNU General Public License version 3, or any later version
	+# See top-level LICENSE file for more information
	+
	+
	+def register():
	+ from .lister import TuleapLister
	+
	+ return {
	+ "lister": TuleapLister,
	+ "task_modules": ["%s.tasks" % __name__],
	+ }
	diff --git a/swh/lister/tuleap/lister.py b/swh/lister/tuleap/lister.py
	new file mode 100644
	index 0000000..6145359
	--- /dev/null
	+++ b/swh/lister/tuleap/lister.py
	@@ -0,0 +1,150 @@
	+# Copyright (C) 2021 The Software Heritage developers
	+# See the AUTHORS file at the top-level directory of this distribution
	+# License: GNU General Public License version 3, or any later version
	+# See top-level LICENSE file for more information
	+
	+import logging
	+from typing import Any, Dict, Iterator, List, Optional
	+from urllib.parse import urljoin
	+
	+import iso8601
	+import requests
	+from tenacity.before_sleep import before_sleep_log
	+from urllib3.util import parse_url
	+
	+from swh.lister.utils import throttling_retry
	+from swh.scheduler.interface import SchedulerInterface
	+from swh.scheduler.model import ListedOrigin
	+
	+from .. import USER_AGENT
	+from ..pattern import CredentialsType, StatelessLister
	+
	+logger = logging.getLogger(__name__)
	+
	+RepoPage = Dict[str, Any]
	+
	+
	+class TuleapLister(StatelessLister[RepoPage]):
	+ """List origins from Tuleap.
	+
	+ Tuleap provides SVN and Git repositories hosting.
	+
	+ Tuleap API getting started:
	+ https://tuleap.net/doc/en/user-guide/integration/rest.html
	+ Tuleap API reference:
	+ https://tuleap.net/api/explorer/
	+
	+ Using the API we first request a list of projects, and from there request their
	+ associated repositories individually. Everything is paginated, code uses throttling
	+ at the individual GET call level."""
	+
	+ LISTER_NAME = "tuleap"
	+
	+ REPO_LIST_PATH = "/api"
	+ REPO_GIT_PATH = "plugins/git/"
	+ REPO_SVN_PATH = "plugins/svn/"
	+
	+ def __init__(
	+ self,
	+ scheduler: SchedulerInterface,
	+ url: str,
	+ instance: Optional[str] = None,
	+ credentials: CredentialsType = None,
	+ ):
	+ if instance is None:
	+ instance = parse_url(url).host
	+
	+ super().__init__(
	+ scheduler=scheduler, credentials=credentials, url=url, instance=instance,
	+ )
	+
	+ self.session = requests.Session()
	+ self.session.headers.update(
	+ {"Accept": "application/json", "User-Agent": USER_AGENT,}
	+ )
	+
	+ @throttling_retry(before_sleep=before_sleep_log(logger, logging.WARNING))
	+ def page_request(self, url: str, params: Dict[str, Any]) -> requests.Response:
	+
	+ logger.info("Fetching URL %s with params %s", url, params)
	+
	+ response = self.session.get(url, params=params)
	+ if response.status_code != 200:
	+ logger.warning(
	+ "Unexpected HTTP status code %s on %s: %s",
	+ response.status_code,
	+ response.url,
	+ response.content,
	+ )
	+ response.raise_for_status()
	+
	+ return response
	+
	+ @classmethod
	+ def results_simplified(cls, url: str, repo_type: str, repo: RepoPage) -> RepoPage:
	+ if repo_type == "git":
	+ prefix_url = TuleapLister.REPO_GIT_PATH
	+ else:
	+ prefix_url = TuleapLister.REPO_SVN_PATH
	+ rep = {
	+ "project": repo["name"],
	+ "type": repo_type,
	+ "uri": urljoin(url, f"{prefix_url}{repo['path']}"),
	+ "last_update_date": repo["last_update_date"],
	+ }
	+ return rep
	+
	+ def _get_repositories(self, url_repo) -> List[Dict[str, Any]]:
	+ ret = self.page_request(url_repo, {})
	+ reps_list = ret.json()["repositories"]
	+ limit = int(ret.headers["X-PAGINATION-LIMIT-MAX"])
	+ offset = int(ret.headers["X-PAGINATION-LIMIT"])
	+ size = int(ret.headers["X-PAGINATION-SIZE"])
	+ while offset < size:
	+ url_offset = url_repo + "?offset=" + str(offset) + "&limit=" + str(limit)
	+ ret = self.page_request(url_offset, {}).json()
	+ reps_list = reps_list + ret["repositories"]
	+ offset += limit
	+ return reps_list
	+
	+ def get_pages(self) -> Iterator[RepoPage]:
	+ # base with trailing slash, path without leading slash for urljoin
	+ url_api: str = urljoin(self.url, self.REPO_LIST_PATH)
	+ url_projects = url_api + "/projects/"
	+
	+ # Get the list of projects.
	+ response = self.page_request(url_projects, {})
	+ projects_list = response.json()
	+ limit = int(response.headers["X-PAGINATION-LIMIT-MAX"])
	+ offset = int(response.headers["X-PAGINATION-LIMIT"])
	+ size = int(response.headers["X-PAGINATION-SIZE"])
	+ while offset < size:
	+ url_offset = (
	+ url_projects + "?offset=" + str(offset) + "&limit=" + str(limit)
	+ )
	+ ret = self.page_request(url_offset, {}).json()
	+ projects_list = projects_list + ret
	+ offset += limit
	+
	+ # Get list of repositories for each project.
	+ for p in projects_list:
	+ p_id = p["id"]
	+
	+ # Fetch Git repositories for project
	+ url_git = url_projects + str(p_id) + "/git"
	+ repos = self._get_repositories(url_git)
	+ for repo in repos:
	+ yield self.results_simplified(url_api, "git", repo)
	+
	+ def get_origins_from_page(self, page: RepoPage) -> Iterator[ListedOrigin]:
	+ """Convert a page of Tuleap repositories into a list of ListedOrigins.
	+
	+ """
	+ assert self.lister_obj.id is not None
	+
	+ yield ListedOrigin(
	+ lister_id=self.lister_obj.id,
	+ url=page["uri"],
	+ visit_type=page["type"],
	+ last_update=iso8601.parse_date(page["last_update_date"]),
	+ )
	diff --git a/swh/lister/tuleap/tasks.py b/swh/lister/tuleap/tasks.py
	new file mode 100644
	index 0000000..12b80ce
	--- /dev/null
	+++ b/swh/lister/tuleap/tasks.py
	@@ -0,0 +1,21 @@
	+# Copyright (C) 2021 the Software Heritage developers
	+# License: GNU General Public License version 3, or any later version
	+# See top-level LICENSE file for more information
	+
	+from typing import Dict
	+
	+from celery import shared_task
	+
	+from .lister import TuleapLister
	+
	+
	+@shared_task(name=__name__ + ".FullTuleapLister")
	+def list_tuleap_full(**lister_args) -> Dict[str, int]:
	+ """Full update of a Tuleap instance"""
	+ lister = TuleapLister.from_configfile(**lister_args)
	+ return lister.run().dict()
	+
	+
	+@shared_task(name=__name__ + ".ping")
	+def _ping() -> str:
	+ return "OK"
	diff --git a/swh/lister/tuleap/tests/__init__.py b/swh/lister/tuleap/tests/__init__.py
	new file mode 100644
	index 0000000..e69de29
	diff --git a/swh/lister/tuleap/tests/data/https_tuleap.net/projects b/swh/lister/tuleap/tests/data/https_tuleap.net/projects
	new file mode 100644
	index 0000000..8092122
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/data/https_tuleap.net/projects
	@@ -0,0 +1 @@
	+[{"resources":[{"type":"git","uri":"projects/685/git"},{"type":"trackers","uri":"projects/685/trackers"},{"type":"backlog","uri":"projects/685/backlog"},{"type":"milestones","uri":"projects/685/milestones"},{"type":"plannings","uri":"projects/685/plannings"},{"type":"labeled_items","uri":"projects/685/labeled_items"},{"type":"svn","uri":"projects/685/svn"},{"type":"testmanagement_campaigns","uri":"projects/685/testmanagement_campaigns"},{"type":"testmanagement_definitions","uri":"projects/685/testmanagement_definitions"},{"type":"testmanagement_nodes","uri":"projects/685/testmanagement_nodes"},{"type":"project_services","uri":"projects/685/project_services"},{"type":"user_groups","uri":"projects/685/user_groups"},{"type":"phpwiki","uri":"projects/685/phpwiki"},{"type":"heartbeats","uri":"projects/685/heartbeats"},{"type":"labels","uri":"projects/685/labels"}],"additional_informations":[],"is_member_of":false,"description":"Manjaro Memo Documentation est un projet Sphinx portant sur l'utilisation et la maj de Manjaro (et de ses outils) ainsi que sur Systemd et Journactl. Il comprendra tout un ensemble de commande pour se servir correctement de ce système dérivé d'Archlinux.","additional_fields":[{"name":"project_desc_name:full_desc","value":""},{"name":"project_desc_name:other_comments","value":""}],"id":685,"uri":"projects/685","label":"Manjaro Memo Documentation","shortname":"manjaromemodoc","status":"active","access":"public","is_template":false},{"resources":[{"type":"git","uri":"projects/309/git"},{"type":"trackers","uri":"projects/309/trackers"},{"type":"backlog","uri":"projects/309/backlog"},{"type":"milestones","uri":"projects/309/milestones"},{"type":"plannings","uri":"projects/309/plannings"},{"type":"labeled_items","uri":"projects/309/labeled_items"},{"type":"svn","uri":"projects/309/svn"},{"type":"testmanagement_campaigns","uri":"projects/309/testmanagement_campaigns"},{"type":"testmanagement_definitions","uri":"projects/309/testmanagement_definitions"},{"type":"testmanagement_nodes","uri":"projects/309/testmanagement_nodes"},{"type":"project_services","uri":"projects/309/project_services"},{"type":"user_groups","uri":"projects/309/user_groups"},{"type":"phpwiki","uri":"projects/309/phpwiki"},{"type":"heartbeats","uri":"projects/309/heartbeats"},{"type":"labels","uri":"projects/309/labels"}],"additional_informations":[],"is_member_of":false,"description":"a library for audio and music analysis","additional_fields":[{"name":"project_desc_name:full_desc","value":""},{"name":"project_desc_name:other_comments","value":""}],"id":309,"uri":"projects/309","label":"aubio","shortname":"aubio","status":"active","access":"public","is_template":false},{"resources": [{"type": "git", "uri": "projects/1080/git"}, {"type": "trackers", "uri": "projects/1080/trackers"}, {"type": "backlog", "uri": "projects/1080/backlog"}, {"type": "milestones", "uri": "projects/1080/milestones"}, {"type": "plannings", "uri": "projects/1080/plannings"}, {"type": "labeled_items", "uri": "projects/1080/labeled_items"}, {"type": "svn", "uri": "projects/1080/svn"}, {"type": "testmanagement_campaigns", "uri": "projects/1080/testmanagement_campaigns"}, {"type": "testmanagement_definitions", "uri": "projects/1080/testmanagement_definitions"}, {"type": "testmanagement_nodes", "uri": "projects/1080/testmanagement_nodes"}, {"type": "project_services", "uri": "projects/1080/project_services"}, {"type": "user_groups", "uri": "projects/1080/user_groups"}, {"type": "phpwiki", "uri": "projects/1080/phpwiki"}, {"type": "heartbeats", "uri": "projects/1080/heartbeats"}, {"type": "labels", "uri": "projects/1080/labels"}], "additional_informations": {"agiledashboard": {"root_planning": {"id": 168, "uri": "route-not-yet-implemented", "label": "Sprint Planning", "project": {"id": 1080, "uri": "projects/1080", "label": null}, "milestone_tracker": {"id": 848, "uri": "trackers/848", "label": "Releases", "project": {"id": 1080, "uri": "projects/1080", "label": "CLI generate:stuff"}}, "backlog_trackers": [{"id": 831, "uri": "trackers/831"}, {"id": 846, "uri": "trackers/846"}], "milestones_uri": "plannings/168/milestones"}}}, "is_member_of": false, "description": "A CLI to help functional and performance testing in Tuleap", "additional_fields": [{"name": "project_desc_name:full_desc", "value": ""}, {"name": "project_desc_name:other_comments", "value": ""}], "id": 1080, "uri": "projects/1080", "label": "CLI generate:stuff", "shortname": "service-cleanup", "status": "active", "access": "public", "is_template": false}]
	diff --git a/swh/lister/tuleap/tests/data/https_tuleap.net/repo_1 b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_1
	new file mode 100644
	index 0000000..8f206d3
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_1
	@@ -0,0 +1 @@
	+{"repositories":[{"id":295,"uri":"git/295","name":"manjaro-memo-documentation","label":"manjaro-memo-documentation","path":"manjaromemodoc/manjaro-memo-documentation.git","path_without_project":"","description":"-- Default description --","last_update_date":"2020-10-03T15:27:02+02:00","permissions":"None","server":"None","html_url":"/plugins/git/manjaromemodoc/manjaro-memo-documentation","additional_information":[]}]}
	diff --git a/swh/lister/tuleap/tests/data/https_tuleap.net/repo_2 b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_2
	new file mode 100644
	index 0000000..36e6f0f
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_2
	@@ -0,0 +1 @@
	+{"repositories":[{"id":309,"uri":"git/309","name":"myaurora","label":"myaurora","path":"myaurora/myaurora.git","path_without_project":"","description":"-- Default description --","last_update_date":"2021-03-04T08:43:40+01:00","permissions":"None","server":"None","html_url":"/plugins/git/myaurora/myaurora","additional_information":[]}]}
	diff --git a/swh/lister/tuleap/tests/data/https_tuleap.net/repo_3 b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_3
	new file mode 100644
	index 0000000..04bd5c9
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/data/https_tuleap.net/repo_3
	@@ -0,0 +1 @@
	+{"repositories":[]}
	diff --git a/swh/lister/tuleap/tests/test_lister.py b/swh/lister/tuleap/tests/test_lister.py
	new file mode 100644
	index 0000000..5e74d35
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/test_lister.py
	@@ -0,0 +1,171 @@
	+# Copyright (C) 2021 The Software Heritage developers
	+# See the AUTHORS file at the top-level directory of this distribution
	+# License: GNU General Public License version 3, or any later version
	+# See top-level LICENSE file for more information
	+
	+import json
	+from pathlib import Path
	+from typing import Dict, List, Tuple
	+
	+import pytest
	+import requests
	+
	+from swh.lister.tuleap.lister import RepoPage, TuleapLister
	+from swh.scheduler.model import ListedOrigin
	+
	+TULEAP_URL = "https://tuleap.net/"
	+TULEAP_PROJECTS_URL = TULEAP_URL + "api/projects/"
	+TULEAP_REPO_1_URL = TULEAP_URL + "api/projects/685/git" # manjaromemodoc
	+TULEAP_REPO_2_URL = TULEAP_URL + "api/projects/309/git" # myaurora
	+TULEAP_REPO_3_URL = TULEAP_URL + "api/projects/1080/git" # tuleap cleanup module
	+
	+GIT_REPOS = (
	+ "https://tuleap.net/plugins/git/manjaromemodoc/manjaro-memo-documentation.git",
	+ "https://tuleap.net/plugins/git/myaurora/myaurora.git",
	+)
	+
	+
	+@pytest.fixture
	+def tuleap_projects(datadir) -> Tuple[str, Dict[str, str], List[str]]:
	+ text = Path(datadir, "https_tuleap.net", "projects").read_text()
	+ headers = {
	+ "X-PAGINATION-LIMIT-MAX": "50",
	+ "X-PAGINATION-LIMIT": "10",
	+ "X-PAGINATION-SIZE": "2",
	+ }
	+ repo_json = json.loads(text)
	+ projects = [p["shortname"] for p in repo_json]
	+ return text, headers, projects
	+
	+
	+@pytest.fixture
	+def tuleap_repo_1(datadir) -> Tuple[str, Dict[str, str], List[RepoPage], List[str]]:
	+ text = Path(datadir, "https_tuleap.net", "repo_1").read_text()
	+ headers = {
	+ "X-PAGINATION-LIMIT-MAX": "50",
	+ "X-PAGINATION-LIMIT": "10",
	+ "X-PAGINATION-SIZE": "1",
	+ }
	+ reps = json.loads(text)
	+ page_results = []
	+ for r in reps["repositories"]:
	+ page_results.append(
	+ TuleapLister.results_simplified(url=TULEAP_URL, repo_type="git", repo=r)
	+ )
	+ origin_urls = [r["uri"] for r in page_results]
	+ return text, headers, page_results, origin_urls
	+
	+
	+@pytest.fixture
	+def tuleap_repo_2(datadir) -> Tuple[str, Dict[str, str], List[RepoPage], List[str]]:
	+ text = Path(datadir, "https_tuleap.net", "repo_2").read_text()
	+ headers = {
	+ "X-PAGINATION-LIMIT-MAX": "50",
	+ "X-PAGINATION-LIMIT": "10",
	+ "X-PAGINATION-SIZE": "1",
	+ }
	+ reps = json.loads(text)
	+ page_results = []
	+ for r in reps["repositories"]:
	+ page_results.append(
	+ TuleapLister.results_simplified(url=TULEAP_URL, repo_type="git", repo=r)
	+ )
	+ origin_urls = [r["uri"] for r in page_results]
	+ return text, headers, page_results, origin_urls
	+
	+
	+@pytest.fixture
	+def tuleap_repo_3(datadir) -> Tuple[str, Dict[str, str], List[RepoPage], List[str]]:
	+ text = Path(datadir, "https_tuleap.net", "repo_3").read_text()
	+ headers = {
	+ "X-PAGINATION-LIMIT-MAX": "50",
	+ "X-PAGINATION-LIMIT": "10",
	+ "X-PAGINATION-SIZE": "0",
	+ }
	+ reps = json.loads(text)
	+ page_results = []
	+ for r in reps["repositories"]:
	+ page_results.append(
	+ TuleapLister.results_simplified(url=TULEAP_URL, repo_type="git", repo=r)
	+ )
	+ origin_urls = [r["uri"] for r in page_results]
	+ return text, headers, page_results, origin_urls
	+
	+
	+def check_listed_origins(lister_urls: List[str], scheduler_origins: List[ListedOrigin]):
	+ """Asserts that the two collections have the same origin URLs.
	+
	+ Does not test last_update."""
	+
	+ sorted_lister_urls = list(sorted(lister_urls))
	+ sorted_scheduler_origins = list(sorted(scheduler_origins))
	+
	+ assert len(sorted_lister_urls) == len(sorted_scheduler_origins)
	+
	+ for l_url, s_origin in zip(sorted_lister_urls, sorted_scheduler_origins):
	+ assert l_url == s_origin.url
	+
	+
	+def test_tuleap_full_listing(
	+ swh_scheduler,
	+ requests_mock,
	+ mocker,
	+ tuleap_projects,
	+ tuleap_repo_1,
	+ tuleap_repo_2,
	+ tuleap_repo_3,
	+):
	+ """Covers full listing of multiple pages, rate-limit, page size (required for test),
	+ checking page results and listed origins, statelessness."""
	+
	+ lister = TuleapLister(
	+ scheduler=swh_scheduler, url=TULEAP_URL, instance="tuleap.net"
	+ )
	+
	+ p_text, p_headers, p_projects = tuleap_projects
	+ r1_text, r1_headers, r1_result, r1_origin_urls = tuleap_repo_1
	+ r2_text, r2_headers, r2_result, r2_origin_urls = tuleap_repo_2
	+ r3_text, r3_headers, r3_result, r3_origin_urls = tuleap_repo_3
	+
	+ requests_mock.get(TULEAP_PROJECTS_URL, text=p_text, headers=p_headers)
	+ requests_mock.get(TULEAP_REPO_1_URL, text=r1_text, headers=r1_headers)
	+ requests_mock.get(
	+ TULEAP_REPO_2_URL,
	+ [
	+ {"status_code": requests.codes.too_many_requests},
	+ {"text": r2_text, "headers": r2_headers},
	+ ],
	+ )
	+ requests_mock.get(TULEAP_REPO_3_URL, text=r3_text, headers=r3_headers)
	+
	+ # end test setup
	+
	+ stats = lister.run()
	+
	+ # start test checks
	+ assert stats.pages == 2
	+ assert stats.origins == 2
	+
	+ scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
	+
	+ check_listed_origins(
	+ r1_origin_urls + r2_origin_urls + r3_origin_urls, scheduler_origins
	+ )
	+ check_listed_origins(GIT_REPOS, scheduler_origins)
	+
	+ assert lister.get_state_from_scheduler() is None
	+
	+
	+@pytest.mark.parametrize("http_code", [400, 500, 502])
	+def test_tuleap_list_http_error(swh_scheduler, requests_mock, http_code):
	+ """Test handling of some HTTP errors commonly encountered"""
	+
	+ lister = TuleapLister(scheduler=swh_scheduler, url=TULEAP_URL)
	+
	+ requests_mock.get(TULEAP_PROJECTS_URL, status_code=http_code)
	+
	+ with pytest.raises(requests.HTTPError):
	+ lister.run()
	+
	+ scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
	+ assert len(scheduler_origins) == 0
	diff --git a/swh/lister/tuleap/tests/test_tasks.py b/swh/lister/tuleap/tests/test_tasks.py
	new file mode 100644
	index 0000000..a9b3cf2
	--- /dev/null
	+++ b/swh/lister/tuleap/tests/test_tasks.py
	@@ -0,0 +1,50 @@
	+# Copyright (C) 2021 The Software Heritage developers
	+# See the AUTHORS file at the top-level directory of this distribution
	+# License: GNU General Public License version 3, or any later version
	+# See top-level LICENSE file for more information
	+
	+from swh.lister.pattern import ListerStats
	+
	+
	+def test_ping(swh_scheduler_celery_app, swh_scheduler_celery_worker):
	+ res = swh_scheduler_celery_app.send_task("swh.lister.tuleap.tasks.ping")
	+ assert res
	+ res.wait()
	+ assert res.successful()
	+ assert res.result == "OK"
	+
	+
	+def test_full_listing(swh_scheduler_celery_app, swh_scheduler_celery_worker, mocker):
	+ lister = mocker.patch("swh.lister.tuleap.tasks.TuleapLister")
	+ lister.from_configfile.return_value = lister
	+ lister.run.return_value = ListerStats(pages=10, origins=500)
	+
	+ kwargs = dict(url="https://tuleap.net")
	+ res = swh_scheduler_celery_app.send_task(
	+ "swh.lister.tuleap.tasks.FullTuleapLister", kwargs=kwargs,
	+ )
	+ assert res
	+ res.wait()
	+ assert res.successful()
	+
	+ lister.from_configfile.assert_called_once_with(**kwargs)
	+ lister.run.assert_called_once_with()
	+
	+
	+def test_full_listing_params(
	+ swh_scheduler_celery_app, swh_scheduler_celery_worker, mocker
	+):
	+ lister = mocker.patch("swh.lister.tuleap.tasks.TuleapLister")
	+ lister.from_configfile.return_value = lister
	+ lister.run.return_value = ListerStats(pages=10, origins=500)
	+
	+ kwargs = dict(url="https://tuleap.net", instance="tuleap.net",)
	+ res = swh_scheduler_celery_app.send_task(
	+ "swh.lister.tuleap.tasks.FullTuleapLister", kwargs=kwargs,
	+ )
	+ assert res
	+ res.wait()
	+ assert res.successful()
	+
	+ lister.from_configfile.assert_called_once_with(**kwargs)
	+ lister.run.assert_called_once_with()

File Metadata

Mime Type: text/x-diff
Expires: Mon, Aug 18, 11:36 PM (1 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3290677

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions