Changeset View
Standalone View
swh/loader/package/jar/loader.py
- This file was added.
# Copyright (C) 2019-2021 The Software Heritage developers | |||||
# See the AUTHORS file at the top-level directory of this distribution | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
from datetime import datetime, timezone | |||||
import hashlib | |||||
import json | |||||
import logging | |||||
from os import path | |||||
import string | |||||
from typing import Any, Dict, Iterator, Mapping, Optional, Sequence, Tuple, Union | |||||
from urllib.parse import urlparse | |||||
import attr | |||||
from swh.loader.package.loader import ( | |||||
BasePackageInfo, | |||||
PackageLoader, | |||||
PartialExtID, | |||||
RawExtrinsicMetadataCore, | |||||
) | |||||
from swh.loader.package.utils import release_name | |||||
from swh.model.model import ( | |||||
MetadataAuthority, | |||||
MetadataAuthorityType, | |||||
Person, | |||||
Revision, | |||||
RevisionType, | |||||
Sha1Git, | |||||
TimestampWithTimezone, | |||||
) | |||||
from swh.storage.interface import StorageInterface | |||||
logger = logging.getLogger(__name__) | |||||
SWH_PERSON = Person( | |||||
name=b"Software Heritage", | |||||
fullname=b"Software Heritage", | |||||
email=b"robot@softwareheritage.org", | |||||
) | |||||
REVISION_MESSAGE = b"swh-loader-package: synthetic revision message" | |||||
@attr.s | |||||
class JarPackageInfo(BasePackageInfo): | |||||
time = attr.ib(type=Union[str, datetime]) | |||||
"""Timestamp of the jar file on the server""" | |||||
raw_info = attr.ib(type=Dict[str, Any]) | |||||
vlorentz: (not for gnu)
Are you sure versions can only have one package, and it will never change? I'd… | |||||
Done Inline ActionsGood point for the gnu! Thanks. Yes, a single jar has a unique set of coordinates (gid, aid, version) and this is part of the maven spec so I don't think that collisions could happen (or should be allowed). I didn't include the time or url because artifacts can be hosted at several places, and they should still be the same. We don't want to re-archive an artifact (gid, aid, version) that is already ingested or present elsewhere, even if the url has changed. And "time" is the publication time on the server so it could change on a different server but would nevertheless be the same artifact. And no, I've never seen an artifact with spaces inside the name. That should not happen. borisbaldassari: Good point for the gnu! Thanks.
Yes, a single jar has a unique set of coordinates (gid, aid… | |||||
Not Done Inline Actions
I does not look like we can be *sure* they are the same. And if they are not, we will miss some source code. It's better to load it twice than risk losing it. The result of the loader is deduplicated in the storage anyway. Additionally, "$gid $aid $version" is specific to Maven (if I'm not mistaken), so it feels out of place for a generic JAR loader. vlorentz: > and they should still be the same
I does not look like we can be *sure* they are the same. | |||||
Done Inline ActionsHum. I'm a bit lost here, and what you say has huge implications.
I really can't imagine a situation where a source jar is republished with the same gid/aid/version and different content. This goes quite directly against the maven principle of unique coordinates, and if a security fix or a change is needed then one must republish the jar with an incremented version. The only extremely rare case I can think of is when there is an IP infringement, for it could threaten the hosting entity. Do you have any other situation in mind? In other words, what information would you add or change to the extid? url, timestamp? From the maven perspective, the vast majority of artifacts have a single url, which is where they are officially published by their author. People can still mirror any artifacts on other repositories, but that would be quite rare. Adding the url in the extid would re-ingest them, which is ok but imho unnecessary since they'll be identical. Nobody wants to create a second artifact with the same coordinates, apart for local testing maybe. Please bear in mind that I'm not arguing, I'm just trying to understand the requirements correctly.
Maybe I should have used 'maven' rather than 'jar' for the loader name then. It was really intended for the maven lister and ecosystem, and the gid/aid/version parameters, as it is currently, are actually required. Making it a generic jar loader would imply to strip this information (both in the constructor and for the extrinsic metadata), but then it would be closer to the 'archive' loader. The main difference between the archive loader and the jar loader is the metadata. It should be quite easy to add the jar extension to the archive loader, and I can take care of that. WDYT? borisbaldassari: Hum. I'm a bit lost here, and what you say has huge implications.
> I does not look like we… | |||||
Not Done Inline ActionsSorry, I missed the notification.
Even on https://repo1.maven.org/maven2/ alone, I found a couple of POMs with the same (gid, aid, version). I can look in my logs to give you an example later, if you want. Additionally, I don't see any guarantee that (gid, aid, version) are unique across all repositories. For example, if Atlassian publishes an artifact with a given (gid, aid, version) in their own repository, what is stopping me from uploading a different one with the exact same triple in https://repo1.maven.org/maven2/ ?
I don't think we can assume this. At scale, if people can do something stupid, you can be sure at least one person will do it. For example in PyPI we find a ton of packages will tarballs whose name does not match the package name, even if the tooling is supposed to prevent it.
At least the URL, yes. The timestamp would be great too, if possible.
I see. Renaming would be nice, then. vlorentz: Sorry, I missed the notification.
> I really can't imagine a situation where a source jar is… | |||||
gid = attr.ib(type=str) | |||||
"""Group ID of the maven artifact""" | |||||
aid = attr.ib(type=str) | |||||
"""Artifact ID of the maven artifact""" | |||||
version = attr.ib(type=str) | |||||
"""Version of the maven artifact""" | |||||
# default format for maven artifacts | |||||
MANIFEST_FORMAT = string.Template("$gid $aid $version") | |||||
def extid(self, manifest_format: Optional[string.Template] = None) -> PartialExtID: | |||||
"""Returns a unique intrinsic identifier of this package info | |||||
``manifest_format`` allows overriding the class' default MANIFEST_FORMAT""" | |||||
manifest_format = manifest_format or self.MANIFEST_FORMAT | |||||
manifest = manifest_format.substitute( | |||||
{"gid": self.gid, "aid": self.aid, "version": self.version} | |||||
) | |||||
return ("maven-jar", hashlib.sha256(manifest.encode()).digest()) | |||||
@classmethod | |||||
def from_metadata(cls, a_metadata: Dict[str, Any]) -> "JarPackageInfo": | |||||
url = a_metadata["url"] | |||||
filename = a_metadata.get("filename") | |||||
gid = a_metadata["gid"] | |||||
aid = a_metadata["aid"] | |||||
version = a_metadata["version"] | |||||
meta = ({"gid": gid, "aid": aid, "version": version},) | |||||
return cls( | |||||
url=url, | |||||
filename=filename if filename else path.split(url)[-1], | |||||
raw_info=a_metadata, | |||||
time=a_metadata["time"], | |||||
gid=gid, | |||||
aid=aid, | |||||
version=version, | |||||
directory_extrinsic_metadata=[ | |||||
RawExtrinsicMetadataCore( | |||||
format="maven-pom-json", metadata=json.dumps(meta).encode(), | |||||
) | |||||
], | |||||
) | |||||
class JarLoader(PackageLoader[JarPackageInfo]): | |||||
"""Load jar origin's artifact files into swh archive | |||||
""" | |||||
visit_type = "jar" | |||||
def __init__( | |||||
self, | |||||
storage: StorageInterface, | |||||
url: str, | |||||
artifacts: Sequence[Dict[str, Any]], | |||||
extid_manifest_format: Optional[str] = None, | |||||
max_content_size: Optional[int] = None, | |||||
snapshot_append: bool = False, | |||||
): | |||||
f"""Loader constructor. | |||||
For now, this is the lister's task output. | |||||
Args: | |||||
url: Origin url | |||||
artifacts: List of single artifact information with keys: | |||||
- **time**: the timestamp of the jar file as an int | |||||
- **url**: the artifact url to retrieve filename | |||||
- **filename**: optionally, the file's name | |||||
- **gid**: artifact's groupId | |||||
- **aid**: artifact's artifactId | |||||
- **version**: artifact's version | |||||
extid_manifest_format: template string used to format a manifest, | |||||
which is hashed to get the extid of a package. | |||||
Defaults to {JarPackageInfo.MANIFEST_FORMAT!r} | |||||
snapshot_append: if :const:`True`, append latest snapshot content to | |||||
the new snapshot created by the loader | |||||
""" | |||||
super().__init__(storage=storage, url=url, max_content_size=max_content_size) | |||||
self.artifacts = artifacts # assume order is enforced in the lister | |||||
self.extid_manifest_format = ( | |||||
None | |||||
if extid_manifest_format is None | |||||
else string.Template(extid_manifest_format) | |||||
) | |||||
self.snapshot_append = snapshot_append | |||||
def get_versions(self) -> Sequence[str]: | |||||
versions = [] | |||||
for jar in self.artifacts: | |||||
v = jar.get("version") | |||||
if v: | |||||
versions.append(v) | |||||
return versions | |||||
def get_default_version(self) -> str: | |||||
# Returning the last item -- there should be only one version anyway. | |||||
return self.artifacts[-1]["version"] | |||||
def get_metadata_authority(self): | |||||
p_url = urlparse(self.url) | |||||
return MetadataAuthority( | |||||
type=MetadataAuthorityType.FORGE, | |||||
Done Inline ActionsThis will crash if p_info.time is a datetime vlorentz: This will crash if `p_info.time` is a `datetime` | |||||
Done Inline ActionsRight. Fixed. borisbaldassari: Right. Fixed. | |||||
url=f"{p_url.scheme}://{p_url.netloc}/", | |||||
metadata={}, | |||||
) | |||||
def get_package_info(self, version: str) -> Iterator[Tuple[str, JarPackageInfo]]: | |||||
a_metadata = self.artifacts[0] | |||||
yield release_name(a_metadata["version"]), JarPackageInfo.from_metadata( | |||||
a_metadata | |||||
) | |||||
def build_revision( | |||||
self, p_info: JarPackageInfo, uncompressed_path: str, directory: Sha1Git | |||||
) -> Optional[Revision]: | |||||
time = p_info.time | |||||
if isinstance(time, datetime): | |||||
parsed_time = time | |||||
else: # assume it's a timestamp (in milliseconds) | |||||
raw_time = int(str(p_info.time)) | |||||
parsed_time = datetime.fromtimestamp(raw_time / 1e3) | |||||
parsed_time = parsed_time.astimezone(tz=timezone.utc) | |||||
normalized_time = TimestampWithTimezone.from_datetime(parsed_time) | |||||
return Revision( | |||||
type=RevisionType.TAR, | |||||
message=REVISION_MESSAGE, | |||||
date=normalized_time, | |||||
author=SWH_PERSON, | |||||
committer=SWH_PERSON, | |||||
committer_date=normalized_time, | |||||
parents=(), | |||||
directory=directory, | |||||
synthetic=True, | |||||
) | |||||
def extra_branches(self) -> Dict[bytes, Mapping[str, Any]]: | |||||
if not self.snapshot_append: | |||||
return {} | |||||
last_snapshot = self.last_snapshot() | |||||
return last_snapshot.to_dict()["branches"] if last_snapshot else {} |
(not for gnu)
Are you sure versions can only have one package, and it will never change? I'd rather include $time either way. It should also include a way to identify the Maven instance this is from, or they could collide.
Are the gid, aid, and version guaranteed not to contain spaces? (if they do, they may collide too)
And you should change the EXTID type (eg. "maven-jar")