No OneTemporary
Actions

Size

22 KB

Subscribers

None

View Options

	diff --git a/PKG-INFO b/PKG-INFO
	index 98e33d1..996946d 100644
	--- a/PKG-INFO
	+++ b/PKG-INFO
	@@ -1,127 +1,127 @@
	Metadata-Version: 2.1
	Name: swh.lister
	-Version: 2.6.2
	+Version: 2.6.3
	Summary: Software Heritage lister
	Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/
	Author: Software Heritage developers
	Author-email: swh-devel@inria.fr
	License: UNKNOWN
	Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
	Project-URL: Funding, https://www.softwareheritage.org/donate
	Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister
	Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/
	Platform: UNKNOWN
	Classifier: Programming Language :: Python :: 3
	Classifier: Intended Audience :: Developers
	Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
	Classifier: Operating System :: OS Independent
	Classifier: Development Status :: 5 - Production/Stable
	Requires-Python: >=3.7
	Description-Content-Type: text/markdown
	Provides-Extra: testing
	License-File: LICENSE

	swh-lister
	==========

	This component from the Software Heritage stack aims to produce listings
	of software origins and their urls hosted on various public developer platforms
	or package managers. As these operations are quite similar, it provides a set of
	Python modules abstracting common software origins listing behaviors.

	It also provides several lister implementations, contained in the
	following Python modules:

	- `swh.lister.bitbucket`
	- `swh.lister.cgit`
	- `swh.lister.cran`
	- `swh.lister.debian`
	- `swh.lister.gitea`
	- `swh.lister.github`
	- `swh.lister.gitlab`
	- `swh.lister.gnu`
	- `swh.lister.launchpad`
	- `swh.lister.maven`
	- `swh.lister.npm`
	- `swh.lister.packagist`
	- `swh.lister.phabricator`
	- `swh.lister.pypi`
	- `swh.lister.tuleap`

	Dependencies
	------------

	All required dependencies can be found in the `requirements*.txt` files located
	at the root of the repository.

	Local deployment
	----------------

	## lister configuration

	Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`,
	`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`)
	must be configured by following the instructions below (please note that you have to replace
	`<lister_name>` by one of the lister name introduced above).

	### Preparation steps

	1. `mkdir ~/.config/swh/`
	2. create configuration file `~/.config/swh/listers.yml`

	### Configuration file sample

	Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`:

	```lang=yml
	scheduler:
	cls: 'remote'
	args:
	url: 'http://localhost:5008/'

	credentials: {}
	```

	Note: This expects scheduler (5008) service to run locally

	## Executing a lister

	Once configured, a lister can be executed by using the `swh` CLI tool with the
	following options and commands:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister <lister_name> [lister_parameters]
	```

	Examples:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi
	```

	Licensing
	---------

	This program is free software: you can redistribute it and/or modify it under
	the terms of the GNU General Public License as published by the Free Software
	Foundation, either version 3 of the License, or (at your option) any later
	version.

	This program is distributed in the hope that it will be useful, but WITHOUT ANY
	WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
	PARTICULAR PURPOSE. See the GNU General Public License for more details.

	See top-level LICENSE file for the full text of the GNU General Public License
	along with this program.


	diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO
	index 98e33d1..996946d 100644
	--- a/swh.lister.egg-info/PKG-INFO
	+++ b/swh.lister.egg-info/PKG-INFO
	@@ -1,127 +1,127 @@
	Metadata-Version: 2.1
	Name: swh.lister
	-Version: 2.6.2
	+Version: 2.6.3
	Summary: Software Heritage lister
	Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/
	Author: Software Heritage developers
	Author-email: swh-devel@inria.fr
	License: UNKNOWN
	Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
	Project-URL: Funding, https://www.softwareheritage.org/donate
	Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister
	Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/
	Platform: UNKNOWN
	Classifier: Programming Language :: Python :: 3
	Classifier: Intended Audience :: Developers
	Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
	Classifier: Operating System :: OS Independent
	Classifier: Development Status :: 5 - Production/Stable
	Requires-Python: >=3.7
	Description-Content-Type: text/markdown
	Provides-Extra: testing
	License-File: LICENSE

	swh-lister
	==========

	This component from the Software Heritage stack aims to produce listings
	of software origins and their urls hosted on various public developer platforms
	or package managers. As these operations are quite similar, it provides a set of
	Python modules abstracting common software origins listing behaviors.

	It also provides several lister implementations, contained in the
	following Python modules:

	- `swh.lister.bitbucket`
	- `swh.lister.cgit`
	- `swh.lister.cran`
	- `swh.lister.debian`
	- `swh.lister.gitea`
	- `swh.lister.github`
	- `swh.lister.gitlab`
	- `swh.lister.gnu`
	- `swh.lister.launchpad`
	- `swh.lister.maven`
	- `swh.lister.npm`
	- `swh.lister.packagist`
	- `swh.lister.phabricator`
	- `swh.lister.pypi`
	- `swh.lister.tuleap`

	Dependencies
	------------

	All required dependencies can be found in the `requirements*.txt` files located
	at the root of the repository.

	Local deployment
	----------------

	## lister configuration

	Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`,
	`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`)
	must be configured by following the instructions below (please note that you have to replace
	`<lister_name>` by one of the lister name introduced above).

	### Preparation steps

	1. `mkdir ~/.config/swh/`
	2. create configuration file `~/.config/swh/listers.yml`

	### Configuration file sample

	Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`:

	```lang=yml
	scheduler:
	cls: 'remote'
	args:
	url: 'http://localhost:5008/'

	credentials: {}
	```

	Note: This expects scheduler (5008) service to run locally

	## Executing a lister

	Once configured, a lister can be executed by using the `swh` CLI tool with the
	following options and commands:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister <lister_name> [lister_parameters]
	```

	Examples:

	```
	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm

	$ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi
	```

	Licensing
	---------

	This program is free software: you can redistribute it and/or modify it under
	the terms of the GNU General Public License as published by the Free Software
	Foundation, either version 3 of the License, or (at your option) any later
	version.

	This program is distributed in the hope that it will be useful, but WITHOUT ANY
	WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
	PARTICULAR PURPOSE. See the GNU General Public License for more details.

	See top-level LICENSE file for the full text of the GNU General Public License
	along with this program.


	diff --git a/swh/lister/maven/lister.py b/swh/lister/maven/lister.py
	index f5601b8..2d57550 100644
	--- a/swh/lister/maven/lister.py
	+++ b/swh/lister/maven/lister.py
	@@ -1,361 +1,361 @@
	# Copyright (C) 2021 The Software Heritage developers
	# See the AUTHORS file at the top-level directory of this distribution
	# License: GNU General Public License version 3, or any later version
	# See top-level LICENSE file for more information

	from dataclasses import asdict, dataclass
	from datetime import datetime, timezone
	import logging
	import re
	from typing import Any, Dict, Iterator, Optional
	from urllib.parse import urljoin

	import requests
	from tenacity.before_sleep import before_sleep_log
	from urllib3.util import parse_url
	import xmltodict

	from swh.lister.utils import throttling_retry
	from swh.scheduler.interface import SchedulerInterface
	from swh.scheduler.model import ListedOrigin

	from .. import USER_AGENT
	from ..pattern import CredentialsType, Lister

	logger = logging.getLogger(__name__)

	RepoPage = Dict[str, Any]


	@dataclass
	class MavenListerState:
	"""State of the MavenLister"""

	last_seen_doc: int = -1
	"""Last doc ID ingested during an incremental pass

	"""

	last_seen_pom: int = -1
	"""Last doc ID related to a pom and ingested during
	an incremental pass

	"""


	class MavenLister(Lister[MavenListerState, RepoPage]):
	"""List origins from a Maven repository.

	Maven Central provides artifacts for Java builds.
	It includes POM files and source archives, which we download to get
	the source code of artifacts and links to their scm repository.

	This lister yields origins of types: git/svn/hg or whatever the Artifacts
	use as repository type, plus maven types for the maven loader (tgz, jar)."""

	LISTER_NAME = "maven"

	def __init__(
	self,
	scheduler: SchedulerInterface,
	url: str,
	index_url: str = None,
	instance: Optional[str] = None,
	credentials: CredentialsType = None,
	incremental: bool = True,
	):
	"""Lister class for Maven repositories.

	Args:
	url: main URL of the Maven repository, i.e. url of the base index
	used to fetch maven artifacts. For Maven central use
	https://repo1.maven.org/maven2/
	index_url: the URL to download the exported text indexes from.
	Would typically be a local host running the export docker image.
	See README.md in this directory for more information.
	instance: Name of maven instance. Defaults to url's network location
	if unset.
	incremental: bool, defaults to True. Defines if incremental listing
	is activated or not.

	"""
	self.BASE_URL = url
	self.INDEX_URL = index_url
	self.incremental = incremental

	if instance is None:
	instance = parse_url(url).host

	super().__init__(
	scheduler=scheduler, credentials=credentials, url=url, instance=instance,
	)

	self.session = requests.Session()
	self.session.headers.update(
	{"Accept": "application/json", "User-Agent": USER_AGENT,}
	)

	def state_from_dict(self, d: Dict[str, Any]) -> MavenListerState:
	return MavenListerState(**d)

	def state_to_dict(self, state: MavenListerState) -> Dict[str, Any]:
	return asdict(state)

	@throttling_retry(before_sleep=before_sleep_log(logger, logging.WARNING))
	def page_request(self, url: str, params: Dict[str, Any]) -> requests.Response:

	logger.info("Fetching URL %s with params %s", url, params)

	response = self.session.get(url, params=params)
	if response.status_code != 200:
	logger.warning(
	"Unexpected HTTP status code %s on %s: %s",
	response.status_code,
	response.url,
	response.content,
	)
	response.raise_for_status()

	return response

	def get_pages(self) -> Iterator[RepoPage]:
	""" Retrieve and parse exported maven indexes to
	identify all pom files and src archives.
	"""

	# Example of returned RepoPage's:
	# [
	# {
	# "type": "maven",
	# "url": "https://maven.xwiki.org/..-5.4.2-sources.jar",
	# "time": 1626109619335,
	# "gid": "org.xwiki.platform",
	# "aid": "xwiki-platform-wikistream-events-xwiki",
	# "version": "5.4.2"
	# },
	# {
	# "type": "scm",
	# "url": "scm:git:git://github.com/openengsb/openengsb-framework.git",
	# "project": "openengsb-framework",
	# },
	# ...
	# ]

	# Download the main text index file.
	logger.info("Downloading text index from %s.", self.INDEX_URL)
	assert self.INDEX_URL is not None
	response = requests.get(self.INDEX_URL, stream=True)
	response.raise_for_status()

	# Prepare regexes to parse index exports.

	# Parse doc id.
	# Example line: "doc 13"
	re_doc = re.compile(r"^doc (?P<doc>\d+)$")

	# Parse gid, aid, version, classifier, extension.
	# Example line: " value al.aldi\|sprova4j\|0.1.0\|sources\|jar"
	re_val = re.compile(
	r"^\s{4}value (?P<gid>[^\|]+)\\|(?P<aid>[^\|]+)\\|(?P<version>[^\|]+)\\|"
	+ r"(?P<classifier>[^\|]+)\\|(?P<ext>[^\|]+)$"
	)

	# Parse last modification time.
	# Example line: " value jar\|1626109619335\|14316\|2\|2\|0\|jar"
	re_time = re.compile(
	r"^\s{4}value ([^\|]+)\\|(?P<mtime>[^\|]+)\\|([^\|]+)\\|([^\|]+)\\|([^\|]+)"
	+ r"\\|([^\|]+)\\|([^\|]+)$"
	)

	# Read file line by line and process it
	out_pom: Dict = {}
	jar_src: Dict = {}
	doc_id: int = 0
	jar_src["doc"] = None
	url_src = None

	iterator = response.iter_lines(chunk_size=1024)
	for line_bytes in iterator:
	# Read the index text export and get URLs and SCMs.
	line = line_bytes.decode(errors="ignore")
	m_doc = re_doc.match(line)
	if m_doc is not None:
	doc_id = int(m_doc.group("doc"))
	if (
	self.incremental
	and self.state
	and self.state.last_seen_doc
	and self.state.last_seen_doc >= doc_id
	):
	# jar_src["doc"] contains the id of the current document, whatever
	# its type (scm or jar).
	jar_src["doc"] = None
	else:
	jar_src["doc"] = doc_id
	else:
	# If incremental mode, we don't record any line that is
	# before our last recorded doc id.
	if self.incremental and jar_src["doc"] is None:
	continue
	m_val = re_val.match(line)
	if m_val is not None:
	(gid, aid, version, classifier, ext) = m_val.groups()
	ext = ext.strip()
	path = "/".join(gid.split("."))
	if classifier == "NA" and ext.lower() == "pom":
	# If incremental mode, we don't record any line that is
	# before our last recorded doc id.
	if (
	self.incremental
	and self.state
	and self.state.last_seen_pom
	and self.state.last_seen_pom >= doc_id
	):
	continue
	url_path = f"{path}/{aid}/{version}/{aid}-{version}.{ext}"
	url_pom = urljoin(self.BASE_URL, url_path,)
	out_pom[url_pom] = doc_id
	elif (
	classifier.lower() == "sources" or ("src" in classifier)
	) and ext.lower() in ("zip", "jar"):
	url_path = (
	f"{path}/{aid}/{version}/{aid}-{version}-{classifier}.{ext}"
	)
	url_src = urljoin(self.BASE_URL, url_path)
	jar_src["gid"] = gid
	jar_src["aid"] = aid
	jar_src["version"] = version
	else:
	m_time = re_time.match(line)
	if m_time is not None and url_src is not None:
	time = m_time.group("mtime")
	jar_src["time"] = int(time)
	artifact_metadata_d = {
	"type": "maven",
	"url": url_src,
	**jar_src,
	}
	logger.debug(
	"* Yielding jar %s: %s", url_src, artifact_metadata_d
	)
	yield artifact_metadata_d
	url_src = None

	logger.info("Found %s poms.", len(out_pom))

	# Now fetch pom files and scan them for scm info.

	logger.info("Fetching poms..")
	for pom in out_pom:
	text = self.page_request(pom, {})
	try:
	project = xmltodict.parse(text.content.decode())
	if "scm" in project["project"]:
	if "connection" in project["project"]["scm"]:
	scm = project["project"]["scm"]["connection"]
	gid = project["project"]["groupId"]
	aid = project["project"]["artifactId"]
	artifact_metadata_d = {
	"type": "scm",
	"doc": out_pom[pom],
	"url": scm,
	"project": f"{gid}.{aid}",
	}
	logger.debug("* Yielding pom %s: %s", pom, artifact_metadata_d)
	yield artifact_metadata_d
	else:
	logger.debug("No scm.connection in pom %s", pom)
	else:
	logger.debug("No scm in pom %s", pom)
	except xmltodict.expat.ExpatError as error:
	logger.info("Could not parse POM %s XML: %s. Next.", pom, error)

	def get_origins_from_page(self, page: RepoPage) -> Iterator[ListedOrigin]:
	"""Convert a page of Maven repositories into a list of ListedOrigins.

	"""
	assert self.lister_obj.id is not None
	scm_types_ok = ("git", "svn", "hg", "cvs", "bzr")
	if page["type"] == "scm":
	# If origin is a scm url: detect scm type and yield.
	# Note that the official format is:
	# scm:git:git://github.com/openengsb/openengsb-framework.git
	# but many, many projects directly put the repo url, so we have to
	# detect the content to match it properly.
	m_scm = re.match(r"^scm:(?P<type>[^:]+):(?P<url>.*)$", page["url"])
	if m_scm is not None:
	scm_type = m_scm.group("type")
	if scm_type in scm_types_ok:
	scm_url = m_scm.group("url")
	origin = ListedOrigin(
	lister_id=self.lister_obj.id, url=scm_url, visit_type=scm_type,
	)
	yield origin
	else:
	if page["url"].endswith(".git"):
	origin = ListedOrigin(
	lister_id=self.lister_obj.id, url=page["url"], visit_type="git",
	)
	yield origin
	else:
	# Origin is a source archive:
	last_update_dt = None
	last_update_iso = ""
	last_update_seconds = str(page["time"])[:-3]
	try:
	last_update_dt = datetime.fromtimestamp(int(last_update_seconds))
	last_update_dt_tz = last_update_dt.astimezone(timezone.utc)
	except OverflowError:
	logger.warning("- Failed to convert datetime %s.", last_update_seconds)
	if last_update_dt:
	last_update_iso = last_update_dt_tz.isoformat()
	origin = ListedOrigin(
	lister_id=self.lister_obj.id,
	url=page["url"],
	visit_type=page["type"],
	- last_update=last_update_dt,
	+ last_update=last_update_dt_tz,
	extra_loader_arguments={
	"artifacts": [
	{
	"time": last_update_iso,
	"gid": page["gid"],
	"aid": page["aid"],
	"version": page["version"],
	"base_url": self.BASE_URL,
	}
	]
	},
	)
	yield origin

	def commit_page(self, page: RepoPage) -> None:
	"""Update currently stored state using the latest listed doc.

	Note: this is a noop for full listing mode

	"""
	if self.incremental and self.state:
	# We need to differentiate the two state counters according
	# to the type of origin.
	if page["type"] == "maven" and page["doc"] > self.state.last_seen_doc:
	self.state.last_seen_doc = page["doc"]
	elif page["type"] == "scm" and page["doc"] > self.state.last_seen_pom:
	self.state.last_seen_doc = page["doc"]
	self.state.last_seen_pom = page["doc"]

	def finalize(self) -> None:
	"""Finalize the lister state, set update if any progress has been made.

	Note: this is a noop for full listing mode

	"""
	if self.incremental and self.state:
	last_seen_doc = self.state.last_seen_doc
	last_seen_pom = self.state.last_seen_pom

	scheduler_state = self.get_state_from_scheduler()
	if last_seen_doc and last_seen_pom:
	if (scheduler_state.last_seen_doc < last_seen_doc) or (
	scheduler_state.last_seen_pom < last_seen_pom
	):
	self.updated = True

File Metadata

Mime Type: text/x-diff
Expires: Thu, Jul 3, 11:53 AM (3 d, 8 h ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3246643

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions