diff --git a/PKG-INFO b/PKG-INFO index 7c37e58..e04be35 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,71 +1,71 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 2.3.0 +Version: 2.4.0 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-indexer/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. diff --git a/docs/metadata-workflow.rst b/docs/metadata-workflow.rst index 299ea19..4d99106 100644 --- a/docs/metadata-workflow.rst +++ b/docs/metadata-workflow.rst @@ -1,274 +1,274 @@ Metadata workflow ================= Intrinsic metadata ------------------ Indexing :term:`intrinsic metadata` requires extracting information from the lowest levels of the :ref:`Merkle DAG ` (directories, files, and content blobs) and associate them to the highest ones (origins). In order to deduplicate the work between origins, we split this work between multiple indexers, which coordinate with each other and save their results at each step in the indexer storage. Indexer architecture ^^^^^^^^^^^^^^^^^^^^ .. thumbnail:: images/tasks-metadata-indexers.svg Origin-Head Indexer ^^^^^^^^^^^^^^^^^^^ First, the Origin-Head indexer gets called externally, with an origin as argument (or multiple origins, that are handled sequentially). For now, its tasks are scheduled manually via recurring Scheduler tasks; but in the near future, the :term:`journal` will be used to do that. It first looks up the last :term:`snapshot` and determines what the main branch of origin is (the "Head branch") and what revision it points to (the "Head"). Intrinsic metadata for that origin will be extracted from that revision. It schedules a Directory Metadata Indexer task for the root directory of that revision. Directory and Content Metadata Indexers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ These two indexers do the hard part of the work. The Directory Metadata Indexer fetches the root directory associated with a revision, then extracts the metadata from that directory. To do so, it lists files in that directory, and looks for known names, such as :file:`codemeta.json`, :file:`package.json`, or :file:`pom.xml`. If there are any, it runs the Content Metadata Indexer on them, which in turn fetches their contents and runs them through extraction dictionaries/mappings. See below for details. Their results are saved in a database (the indexer storage), associated with the content and directory hashes. Origin Metadata Indexer ^^^^^^^^^^^^^^^^^^^^^^^ The job of this indexer is very simple: it takes an origin identifier and uses the Origin-Head and Directory indexers to get metadata from the head directory of an origin, and copies the metadata of the former to a new table, to associate it with the latter. The reason for this is to be able to perform searches on metadata, and efficiently find out which origins matched the pattern. Running that search on the ``directory_metadata`` table would require either a reverse lookup from directories to origins, which is costly. Translation from ecosystem-specific metadata to CodeMeta ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Intrinsic metadata is extracted from files provided with a project's source code, and translated using `CodeMeta`_'s `crosswalk table`_. All input formats supported so far are straightforward dictionaries (eg. JSON) or can be accessed as such (eg. XML); and the first part of the translation is to map their keys to a term in the CodeMeta vocabulary. This is done by parsing the crosswalk table's `CSV file`_ and using it as a map between these two vocabularies; and this does not require any format-specific code in the indexers. The second part is to normalize values. As language-specific metadata files each have their way(s) of formatting these values, we need to turn them into the data type required by CodeMeta. This normalization makes up for most of the code of :py:mod:`swh.indexer.metadata_dictionary`. .. _CodeMeta: https://codemeta.github.io/ .. _crosswalk table: https://codemeta.github.io/crosswalk/ .. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv Extrinsic metadata ------------------ The :term:`extrinsic metadata` indexer works very differently from the :term:`intrinsic metadata` indexers we saw above. While the latter extract metadata from software artefacts (files and directories) which are already a core part of the archive, the former extracts such data from API calls pulled from forges and package managers, or pushed via the :ref:`SWORD deposit `. In order to preserve original information verbatim, the Software Heritage itself stores the result of these calls, independently of indexers, in their own archive as described in the :ref:`extrinsic-metadata-specification`. In this section, we assume this information is already present in the archive, but in the "raw extrinsic metadata" form, which needs to be translated to a common vocabulary to be useful, as with intrinsic metadata. The common vocabulary we chose is JSON-LD, with both CodeMeta and `ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_) .. _ForgeFed's vocabulary: https://forgefed.org/vocabulary.html .. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/ Instead of the four-step architecture above, the extrinsic-metadata indexer is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`, and produces new indexed entries in the database as they come. The caveat is that, while intrinsic metadata are always unambiguously authoritative (they are contained by their own origin repository, therefore they were added by the origin's "owners"), extrinsic metadata can be authored by third-parties. Support for third-party authorities is currently not implemented for this reason; so extrinsic metadata is only indexed when provided by the same forge/package-repository as the origin the metadata is about. Metadata on non-origin objects (typically, directories), is also ignored for this reason, for now. Assuming the metadata was provided by such an authority, it is then passed to metadata mappings; identified by a mimetype (or custom format name) they declared rather than filenames. Implementation status --------------------- Supported intrinsic metadata ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following sources of intrinsic metadata are supported: * CodeMeta's `codemeta.json`_, * Maven's `pom.xml`_, * NPM's `package.json`_, * Python's `PKG-INFO`_, * Ruby's `.gemspec`_ .. _codemeta.json: https://codemeta.github.io/terms/ .. _pom.xml: https://maven.apache.org/pom.html .. _package.json: https://docs.npmjs.com/files/package.json .. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/ .. _.gemspec: https://guides.rubygems.org/specification-reference/ Supported extrinsic metadata ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following sources of extrinsic metadata are supported: * GitHub's `"repo" API `__ Supported JSON-LD properties ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following terms may be found in the output of the metadata translation (other than the `codemeta` mapping, which is the identity function, and therefore supports all properties): -.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta +.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta --exclude-mapping json-sword-codemeta --exclude-mapping sword-codemeta :nostderr: Tutorials --------- The rest of this page is made of two tutorials: one to index :term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball), and one to index :term:`extrinsic metadata` (ie. obtained via external means, such as GitHub's or GitLab's APIs). Adding support for additional ecosystem-specific intrinsic metadata ------------------------------------------------------------------- This section will guide you through adding code to the metadata indexer to detect and translate new metadata formats. First, you should start by picking one of the `CodeMeta crosswalks`_. Then create a new file in :file:`swh-indexer/swh/indexer/metadata_dictionary/`, that will contain your code, and create a new class that inherits from helper classes, with some documentation about your indexer: .. code-block:: python from .base import DictMapping, SingleFileIntrinsicMapping from swh.indexer.codemeta import CROSSWALK_TABLE class MyMapping(DictMapping, SingleFileIntrinsicMapping): """Dedicated class for ...""" name = 'my-mapping' filename = b'the-filename' mapping = CROSSWALK_TABLE['Name of the CodeMeta crosswalk'] .. _CodeMeta crosswalks: https://github.com/codemeta/codemeta/tree/master/crosswalks And reference it from :const:`swh.indexer.metadata_dictionary.INTRINSIC_MAPPINGS`. Then, add a ``string_fields`` attribute, that is the list of all keys whose values are simple text values. For instance, to `translate Python PKG-INFO`_, it's: .. code-block:: python string_fields = ['name', 'version', 'description', 'summary', 'author', 'author-email'] These values will be automatically added to the above list of supported terms. .. _translate Python PKG-INFO: https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/metadata_dictionary/python.py Last step to get your code working: add a ``translate`` method that will take a single byte string as argument, turn it into a Python dictionary, whose keys are the ones of the input document, and pass it to ``_translate_dict``. For instance, if the input document is in JSON, it can be as simple as: .. code-block:: python def translate(self, raw_content): raw_content = raw_content.decode() # bytes to str content_dict = json.loads(raw_content) # str to dict return self._translate_dict(content_dict) # convert to CodeMeta ``_translate_dict`` will do the heavy work of reading the crosswalk table for each of ``string_fields``, read the corresponding value in the ``content_dict``, and build a CodeMeta dictionary with the corresponding names from the crosswalk table. One last thing to run your code: add it to the list in :file:`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the code is aware of it. Now, you can run it: .. code-block:: shell python3 -m swh.indexer.metadata_dictionary MyMapping path/to/input/file and it will (hopefully) returns a CodeMeta object. If it works, well done! You can now improve your translation code further, by adding methods that will do more advanced conversion. For example, if there is a field named ``license`` containing an SPDX identifier, you must convert it to an URI, like this: .. code-block:: python def normalize_license(self, s): if isinstance(s, str): - return {"@id": "https://spdx.org/licenses/" + s} + return rdflib.URIRef("https://spdx.org/licenses/" + s) This method will automatically get called by ``_translate_dict`` when it finds a ``license`` field in ``content_dict``. Adding support for additional ecosystem-specific extrinsic metadata ------------------------------------------------------------------- [this section is a work in progress] diff --git a/mypy.ini b/mypy.ini index 0df07a7..d63e789 100644 --- a/mypy.ini +++ b/mypy.ini @@ -1,30 +1,33 @@ [mypy] namespace_packages = True warn_unused_ignores = True # 3rd party libraries without stubs (yet) [mypy-celery.*] ignore_missing_imports = True [mypy-confluent_kafka.*] ignore_missing_imports = True [mypy-magic.*] ignore_missing_imports = True [mypy-pkg_resources.*] ignore_missing_imports = True [mypy-psycopg2.*] ignore_missing_imports = True [mypy-pyld.*] ignore_missing_imports = True [mypy-pytest.*] ignore_missing_imports = True +[mypy-rdflib.*] +ignore_missing_imports = True + [mypy-xmltodict.*] ignore_missing_imports = True diff --git a/requirements.txt b/requirements.txt index d9532ee..4dd61a2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,10 +1,11 @@ python-magic >= 0.4.13 click # frozendict: dependency of pyld # the version 2.1.2 is causing segmentation faults # cf https://forge.softwareheritage.org/T3815 frozendict != 2.1.2 pyld +rdflib sentry-sdk typing-extensions xmltodict diff --git a/swh.indexer.egg-info/PKG-INFO b/swh.indexer.egg-info/PKG-INFO index 7c37e58..e04be35 100644 --- a/swh.indexer.egg-info/PKG-INFO +++ b/swh.indexer.egg-info/PKG-INFO @@ -1,71 +1,71 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 2.3.0 +Version: 2.4.0 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-indexer/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. diff --git a/swh.indexer.egg-info/SOURCES.txt b/swh.indexer.egg-info/SOURCES.txt index 5c309da..3dddac5 100644 --- a/swh.indexer.egg-info/SOURCES.txt +++ b/swh.indexer.egg-info/SOURCES.txt @@ -1,161 +1,166 @@ .git-blame-ignore-revs .gitignore .pre-commit-config.yaml AUTHORS CODE_OF_CONDUCT.md CONTRIBUTORS LICENSE MANIFEST.in Makefile Makefile.local README.md codemeta.json conftest.py mypy.ini pyproject.toml pytest.ini requirements-swh.txt requirements-test.txt requirements.txt setup.cfg setup.py tox.ini docs/.gitignore docs/Makefile docs/Makefile.local docs/README.md docs/cli.rst docs/conf.py docs/dev-info.rst docs/index.rst docs/metadata-workflow.rst docs/_static/.placeholder docs/_templates/.placeholder docs/images/.gitignore docs/images/Makefile docs/images/tasks-metadata-indexers.uml sql/bin/db-upgrade sql/bin/dot_add_content sql/doc/json sql/doc/json/.gitignore sql/doc/json/Makefile sql/doc/json/indexer_configuration.tool_configuration.schema.json sql/doc/json/revision_metadata.translated_metadata.json sql/json/.gitignore sql/json/Makefile sql/json/indexer_configuration.tool_configuration.schema.json sql/json/revision_metadata.translated_metadata.json swh/__init__.py swh.indexer.egg-info/PKG-INFO swh.indexer.egg-info/SOURCES.txt swh.indexer.egg-info/dependency_links.txt swh.indexer.egg-info/entry_points.txt swh.indexer.egg-info/requires.txt swh.indexer.egg-info/top_level.txt swh/indexer/__init__.py swh/indexer/cli.py swh/indexer/codemeta.py swh/indexer/fossology_license.py swh/indexer/indexer.py swh/indexer/journal_client.py swh/indexer/metadata.py swh/indexer/metadata_detector.py swh/indexer/mimetype.py +swh/indexer/namespaces.py swh/indexer/origin_head.py swh/indexer/py.typed swh/indexer/rehash.py swh/indexer/tasks.py swh/indexer/data/composer.csv +swh/indexer/data/nuget.csv swh/indexer/data/pubspec.csv swh/indexer/data/codemeta/CITATION swh/indexer/data/codemeta/LICENSE swh/indexer/data/codemeta/codemeta.jsonld swh/indexer/data/codemeta/crosswalk.csv swh/indexer/metadata_dictionary/__init__.py swh/indexer/metadata_dictionary/base.py swh/indexer/metadata_dictionary/cff.py swh/indexer/metadata_dictionary/codemeta.py swh/indexer/metadata_dictionary/composer.py swh/indexer/metadata_dictionary/dart.py swh/indexer/metadata_dictionary/github.py swh/indexer/metadata_dictionary/maven.py swh/indexer/metadata_dictionary/npm.py +swh/indexer/metadata_dictionary/nuget.py swh/indexer/metadata_dictionary/python.py swh/indexer/metadata_dictionary/ruby.py +swh/indexer/metadata_dictionary/utils.py swh/indexer/sql/10-superuser-init.sql swh/indexer/sql/20-enums.sql swh/indexer/sql/30-schema.sql swh/indexer/sql/50-data.sql swh/indexer/sql/50-func.sql swh/indexer/sql/60-indexes.sql swh/indexer/sql/upgrades/115.sql swh/indexer/sql/upgrades/116.sql swh/indexer/sql/upgrades/117.sql swh/indexer/sql/upgrades/118.sql swh/indexer/sql/upgrades/119.sql swh/indexer/sql/upgrades/120.sql swh/indexer/sql/upgrades/121.sql swh/indexer/sql/upgrades/122.sql swh/indexer/sql/upgrades/123.sql swh/indexer/sql/upgrades/124.sql swh/indexer/sql/upgrades/125.sql swh/indexer/sql/upgrades/126.sql swh/indexer/sql/upgrades/127.sql swh/indexer/sql/upgrades/128.sql swh/indexer/sql/upgrades/129.sql swh/indexer/sql/upgrades/130.sql swh/indexer/sql/upgrades/131.sql swh/indexer/sql/upgrades/132.sql swh/indexer/sql/upgrades/133.sql swh/indexer/sql/upgrades/134.sql swh/indexer/sql/upgrades/135.sql swh/indexer/storage/__init__.py swh/indexer/storage/converters.py swh/indexer/storage/db.py swh/indexer/storage/exc.py swh/indexer/storage/in_memory.py swh/indexer/storage/interface.py swh/indexer/storage/metrics.py swh/indexer/storage/model.py swh/indexer/storage/writer.py swh/indexer/storage/api/__init__.py swh/indexer/storage/api/client.py swh/indexer/storage/api/serializers.py swh/indexer/storage/api/server.py swh/indexer/tests/__init__.py swh/indexer/tests/conftest.py swh/indexer/tests/tasks.py swh/indexer/tests/test_cli.py swh/indexer/tests/test_codemeta.py swh/indexer/tests/test_fossology_license.py swh/indexer/tests/test_indexer.py swh/indexer/tests/test_journal_client.py swh/indexer/tests/test_metadata.py swh/indexer/tests/test_mimetype.py swh/indexer/tests/test_origin_head.py swh/indexer/tests/test_origin_metadata.py swh/indexer/tests/utils.py swh/indexer/tests/metadata_dictionary/__init__.py swh/indexer/tests/metadata_dictionary/test_cff.py swh/indexer/tests/metadata_dictionary/test_codemeta.py swh/indexer/tests/metadata_dictionary/test_composer.py swh/indexer/tests/metadata_dictionary/test_dart.py swh/indexer/tests/metadata_dictionary/test_github.py swh/indexer/tests/metadata_dictionary/test_maven.py swh/indexer/tests/metadata_dictionary/test_npm.py +swh/indexer/tests/metadata_dictionary/test_nuget.py swh/indexer/tests/metadata_dictionary/test_python.py swh/indexer/tests/metadata_dictionary/test_ruby.py swh/indexer/tests/storage/__init__.py swh/indexer/tests/storage/conftest.py swh/indexer/tests/storage/generate_data_test.py swh/indexer/tests/storage/test_api_client.py swh/indexer/tests/storage/test_converters.py swh/indexer/tests/storage/test_in_memory.py swh/indexer/tests/storage/test_init.py swh/indexer/tests/storage/test_metrics.py swh/indexer/tests/storage/test_model.py swh/indexer/tests/storage/test_server.py swh/indexer/tests/storage/test_storage.py swh/indexer/tests/zz_celery/README swh/indexer/tests/zz_celery/__init__.py swh/indexer/tests/zz_celery/test_tasks.py \ No newline at end of file diff --git a/swh.indexer.egg-info/requires.txt b/swh.indexer.egg-info/requires.txt index a418f0b..462c191 100644 --- a/swh.indexer.egg-info/requires.txt +++ b/swh.indexer.egg-info/requires.txt @@ -1,23 +1,24 @@ python-magic>=0.4.13 click frozendict!=2.1.2 pyld +rdflib sentry-sdk typing-extensions xmltodict swh.core[db,http]>=2.9 swh.model>=0.0.15 swh.objstorage>=0.2.2 swh.scheduler>=0.5.2 swh.storage>=0.22.0 swh.journal>=0.1.0 [testing] confluent-kafka hypothesis>=3.11.0 pytest pytest-mock swh.scheduler[testing]>=0.5.0 swh.storage[testing]>=0.10.0 types-click types-pyyaml diff --git a/swh/indexer/codemeta.py b/swh/indexer/codemeta.py index 6c4ef58..f1d00b1 100644 --- a/swh/indexer/codemeta.py +++ b/swh/indexer/codemeta.py @@ -1,220 +1,189 @@ -# Copyright (C) 2018 The Software Heritage developers +# Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import collections import csv import itertools import json import os.path import re from typing import Any, List from pyld import jsonld +import rdflib import swh.indexer +from swh.indexer.namespaces import ACTIVITYSTREAMS, CODEMETA, FORGEFED, SCHEMA _DATA_DIR = os.path.join(os.path.dirname(swh.indexer.__file__), "data") CROSSWALK_TABLE_PATH = os.path.join(_DATA_DIR, "codemeta", "crosswalk.csv") CODEMETA_CONTEXT_PATH = os.path.join(_DATA_DIR, "codemeta", "codemeta.jsonld") with open(CODEMETA_CONTEXT_PATH) as fd: CODEMETA_CONTEXT = json.load(fd) _EMPTY_PROCESSED_CONTEXT: Any = {"mappings": {}} _PROCESSED_CODEMETA_CONTEXT = jsonld.JsonLdProcessor().process_context( _EMPTY_PROCESSED_CONTEXT, CODEMETA_CONTEXT, None ) CODEMETA_CONTEXT_URL = "https://doi.org/10.5063/schema/codemeta-2.0" CODEMETA_ALTERNATE_CONTEXT_URLS = { ("https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld") } -CODEMETA_URI = "https://codemeta.github.io/terms/" -SCHEMA_URI = "http://schema.org/" -FORGEFED_URI = "https://forgefed.org/ns#" -ACTIVITYSTREAMS_URI = "https://www.w3.org/ns/activitystreams#" PROPERTY_BLACKLIST = { # CodeMeta properties that we cannot properly represent. - SCHEMA_URI + "softwareRequirements", - CODEMETA_URI + "softwareSuggestions", + SCHEMA.softwareRequirements, + CODEMETA.softwareSuggestions, # Duplicate of 'author' - SCHEMA_URI + "creator", + SCHEMA.creator, } _codemeta_field_separator = re.compile(r"\s*[,/]\s*") def make_absolute_uri(local_name): """Parses codemeta.jsonld, and returns the @id of terms it defines. >>> make_absolute_uri("name") 'http://schema.org/name' >>> make_absolute_uri("downloadUrl") 'http://schema.org/downloadUrl' >>> make_absolute_uri("referencePublication") 'https://codemeta.github.io/terms/referencePublication' """ uri = jsonld.JsonLdProcessor.get_context_value( _PROCESSED_CODEMETA_CONTEXT, local_name, "@id" ) - assert uri.startswith(("@", CODEMETA_URI, SCHEMA_URI)), (local_name, uri) + assert uri.startswith(("@", CODEMETA, SCHEMA)), (local_name, uri) return uri def _read_crosstable(fd): reader = csv.reader(fd) try: header = next(reader) except StopIteration: raise ValueError("empty file") data_sources = set(header) - {"Parent Type", "Property", "Type", "Description"} codemeta_translation = {data_source: {} for data_source in data_sources} terms = set() for line in reader: # For each canonical name local_name = dict(zip(header, line))["Property"] if not local_name: continue canonical_name = make_absolute_uri(local_name) - if canonical_name in PROPERTY_BLACKLIST: + if rdflib.URIRef(canonical_name) in PROPERTY_BLACKLIST: continue terms.add(canonical_name) for (col, value) in zip(header, line): # For each cell in the row if col in data_sources: # If that's not the parentType/property/type/description for local_name in _codemeta_field_separator.split(value): # For each of the data source's properties that maps # to this canonical name if local_name.strip(): - codemeta_translation[col][local_name.strip()] = canonical_name + codemeta_translation[col][local_name.strip()] = rdflib.URIRef( + canonical_name + ) return (terms, codemeta_translation) with open(CROSSWALK_TABLE_PATH) as fd: (CODEMETA_TERMS, CROSSWALK_TABLE) = _read_crosstable(fd) def _document_loader(url, options=None): """Document loader for pyld. Reads the local codemeta.jsonld file instead of fetching it from the Internet every single time.""" - if url == CODEMETA_CONTEXT_URL or url in CODEMETA_ALTERNATE_CONTEXT_URLS: + if ( + url.lower() == CODEMETA_CONTEXT_URL.lower() + or url in CODEMETA_ALTERNATE_CONTEXT_URLS + ): return { "contextUrl": None, "documentUrl": url, "document": CODEMETA_CONTEXT, } - elif url == CODEMETA_URI: + elif url == CODEMETA: raise Exception( "{} is CodeMeta's URI, use {} as context url".format( - CODEMETA_URI, CODEMETA_CONTEXT_URL + CODEMETA, CODEMETA_CONTEXT_URL ) ) else: raise Exception(url) def compact(doc, forgefed: bool): """Same as `pyld.jsonld.compact`, but in the context of CodeMeta. Args: forgefed: Whether to add ForgeFed and ActivityStreams as compact URIs. This is typically used for extrinsic metadata documents, which frequently use properties from these namespaces. """ contexts: List[Any] = [CODEMETA_CONTEXT_URL] if forgefed: - contexts.append({"as": ACTIVITYSTREAMS_URI, "forge": FORGEFED_URI}) + contexts.append({"as": str(ACTIVITYSTREAMS), "forge": str(FORGEFED)}) return jsonld.compact(doc, contexts, options={"documentLoader": _document_loader}) def expand(doc): """Same as `pyld.jsonld.expand`, but in the context of CodeMeta.""" return jsonld.expand(doc, options={"documentLoader": _document_loader}) -def merge_values(v1, v2): - """If v1 and v2 are of the form `{"@list": l1}` and `{"@list": l2}`, - returns `{"@list": l1 + l2}`. - Otherwise, make them lists (if they are not already) and concatenate - them. - - >>> merge_values('a', 'b') - ['a', 'b'] - >>> merge_values(['a', 'b'], 'c') - ['a', 'b', 'c'] - >>> merge_values({'@list': ['a', 'b']}, {'@list': ['c']}) - {'@list': ['a', 'b', 'c']} - """ - if v1 is None: - return v2 - elif v2 is None: - return v1 - elif isinstance(v1, dict) and set(v1) == {"@list"}: - assert isinstance(v1["@list"], list) - if isinstance(v2, dict) and set(v2) == {"@list"}: - assert isinstance(v2["@list"], list) - return {"@list": v1["@list"] + v2["@list"]} - else: - raise ValueError("Cannot merge %r and %r" % (v1, v2)) - else: - if isinstance(v2, dict) and "@list" in v2: - raise ValueError("Cannot merge %r and %r" % (v1, v2)) - if not isinstance(v1, list): - v1 = [v1] - if not isinstance(v2, list): - v2 = [v2] - return v1 + v2 - - def merge_documents(documents): """Takes a list of metadata dicts, each generated from a different metadata file, and merges them. Removes duplicates, if any.""" documents = list(itertools.chain.from_iterable(map(expand, documents))) merged_document = collections.defaultdict(list) for document in documents: for (key, values) in document.items(): if key == "@id": # @id does not get expanded to a list value = values # Only one @id is allowed, move it to sameAs if "@id" not in merged_document: merged_document["@id"] = value elif value != merged_document["@id"]: - if value not in merged_document[SCHEMA_URI + "sameAs"]: - merged_document[SCHEMA_URI + "sameAs"].append(value) + if value not in merged_document[SCHEMA.sameAs]: + merged_document[SCHEMA.sameAs].append(value) else: for value in values: if isinstance(value, dict) and set(value) == {"@list"}: # Value is of the form {'@list': [item1, item2]} # instead of the usual [item1, item2]. # We need to merge the inner lists (and mostly # preserve order). merged_value = merged_document.setdefault(key, {"@list": []}) for subvalue in value["@list"]: # merged_value must be of the form # {'@list': [item1, item2]}; as it is the same # type as value, which is an @list. if subvalue not in merged_value["@list"]: merged_value["@list"].append(subvalue) elif value not in merged_document[key]: merged_document[key].append(value) # XXX: we should set forgefed=True when merging extrinsic_metadata documents. # however, this function is only used to merge multiple files of the same # directory (which is only for intrinsic-metadata), so it is not an issue for now return compact(merged_document, forgefed=False) diff --git a/swh/indexer/data/nuget.csv b/swh/indexer/data/nuget.csv new file mode 100644 index 0000000..2155f10 --- /dev/null +++ b/swh/indexer/data/nuget.csv @@ -0,0 +1,68 @@ +Property,NuGet +codeRepository,repository.url +programmingLanguage, +runtimePlatform, +targetProduct, +applicationCategory, +applicationSubCategory, +downloadUrl, +fileSize, +installUrl, +memoryRequirements, +operatingSystem, +permissions, +processorRequirements, +releaseNotes,releaseNotes +softwareHelp, +softwareRequirements, +softwareVersion, +storageRequirements, +supportingData, +author,authors +citation, +contributor, +copyrightHolder, +copyrightYear, +dateCreated, +dateModified, +datePublished, +editor, +encoding, +fileFormat, +funder, +keywords,tags +license,license/licenseUrl +producer, +provider, +publisher, +sponsor, +version,version +isAccessibleForFree, +isPartOf, +hasPart, +position, +description,description/summary +identifier, +name,name +sameAs, +url,projectUrl +relatedLink, +givenName, +familyName, +email, +affiliation, +identifier,id +name, +address, +type, +id, +softwareSuggestions, +maintainer, +contIntegration, +buildInstructions, +developmentStatus, +embargoDate, +funding, +issueTracker, +referencePublication, +readme, diff --git a/swh/indexer/metadata_dictionary/__init__.py b/swh/indexer/metadata_dictionary/__init__.py index 2d67c15..99c2504 100644 --- a/swh/indexer/metadata_dictionary/__init__.py +++ b/swh/indexer/metadata_dictionary/__init__.py @@ -1,56 +1,59 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import collections from typing import Dict, Type import click -from . import cff, codemeta, composer, dart, github, maven, npm, python, ruby +from . import cff, codemeta, composer, dart, github, maven, npm, nuget, python, ruby from .base import BaseExtrinsicMapping, BaseIntrinsicMapping, BaseMapping INTRINSIC_MAPPINGS: Dict[str, Type[BaseIntrinsicMapping]] = { "CffMapping": cff.CffMapping, "CodemetaMapping": codemeta.CodemetaMapping, "GemspecMapping": ruby.GemspecMapping, "MavenMapping": maven.MavenMapping, "NpmMapping": npm.NpmMapping, "PubMapping": dart.PubspecMapping, "PythonPkginfoMapping": python.PythonPkginfoMapping, "ComposerMapping": composer.ComposerMapping, + "NuGetMapping": nuget.NuGetMapping, } EXTRINSIC_MAPPINGS: Dict[str, Type[BaseExtrinsicMapping]] = { "GitHubMapping": github.GitHubMapping, + "JsonSwordCodemetaMapping": codemeta.JsonSwordCodemetaMapping, + "SwordCodemetaMapping": codemeta.SwordCodemetaMapping, } MAPPINGS: Dict[str, Type[BaseMapping]] = {**INTRINSIC_MAPPINGS, **EXTRINSIC_MAPPINGS} def list_terms(): """Returns a dictionary with all supported CodeMeta terms as keys, and the mappings that support each of them as values.""" d = collections.defaultdict(set) for mapping in MAPPINGS.values(): for term in mapping.supported_terms(): d[term].add(mapping) return d @click.command() @click.argument("mapping_name") @click.argument("file_name") def main(mapping_name: str, file_name: str): from pprint import pprint with open(file_name, "rb") as fd: file_content = fd.read() res = MAPPINGS[mapping_name]().translate(file_content) pprint(res) if __name__ == "__main__": main() diff --git a/swh/indexer/metadata_dictionary/base.py b/swh/indexer/metadata_dictionary/base.py index 2ac4adc..657c6a4 100644 --- a/swh/indexer/metadata_dictionary/base.py +++ b/swh/indexer/metadata_dictionary/base.py @@ -1,270 +1,347 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json import logging from typing import Any, Callable, Dict, List, Optional, Tuple, TypeVar +import uuid +import xml.parsers.expat +from pyld import jsonld +import rdflib from typing_extensions import TypedDict +import xmltodict import yaml -from swh.indexer.codemeta import SCHEMA_URI, compact, merge_values +from swh.indexer.codemeta import _document_loader, compact +from swh.indexer.namespaces import RDF, SCHEMA from swh.indexer.storage.interface import Sha1 class DirectoryLsEntry(TypedDict): target: Sha1 sha1: Sha1 name: bytes type: str TTranslateCallable = TypeVar( - "TTranslateCallable", bound=Callable[[Any, Dict[str, Any], Any], None] + "TTranslateCallable", + bound=Callable[[Any, rdflib.Graph, rdflib.term.BNode, Any], None], ) -def produce_terms( - namespace: str, terms: List[str] -) -> Callable[[TTranslateCallable], TTranslateCallable]: +def produce_terms(*uris: str) -> Callable[[TTranslateCallable], TTranslateCallable]: """Returns a decorator that marks the decorated function as adding the given terms to the ``translated_metadata`` dict""" def decorator(f: TTranslateCallable) -> TTranslateCallable: if not hasattr(f, "produced_terms"): f.produced_terms = [] # type: ignore - f.produced_terms.extend(namespace + term for term in terms) # type: ignore + f.produced_terms.extend(uris) # type: ignore return f return decorator class BaseMapping: """Base class for :class:`BaseExtrinsicMapping` and :class:`BaseIntrinsicMapping`, not to be inherited directly.""" def __init__(self, log_suffix=""): self.log_suffix = log_suffix self.log = logging.getLogger( "%s.%s" % (self.__class__.__module__, self.__class__.__name__) ) @property def name(self): """A name of this mapping, used as an identifier in the indexer storage.""" raise NotImplementedError(f"{self.__class__.__name__}.name") - def translate(self, file_content: bytes) -> Optional[Dict]: - """Translates metadata, from the content of a file or of a RawExtrinsicMetadata - object.""" + def translate(self, raw_content: bytes) -> Optional[Dict]: + """ + Translates content by parsing content from a bytestring containing + mapping-specific data and translating with the appropriate mapping + to JSON-LD using the Codemeta and ForgeFed vocabularies. + + Args: + raw_content: raw content to translate + + Returns: + translated metadata in JSON friendly form needed for the content + if parseable, :const:`None` otherwise. + + """ raise NotImplementedError(f"{self.__class__.__name__}.translate") def normalize_translation(self, metadata: Dict[str, Any]) -> Dict[str, Any]: raise NotImplementedError(f"{self.__class__.__name__}.normalize_translation") class BaseExtrinsicMapping(BaseMapping): """Base class for extrinsic_metadata mappings to inherit from To implement a new mapping: - inherit this class - override translate function """ @classmethod def extrinsic_metadata_formats(cls) -> Tuple[str, ...]: """ Returns the list of extrinsic metadata formats which can be translated by this mapping """ raise NotImplementedError(f"{cls.__name__}.extrinsic_metadata_formats") def normalize_translation(self, metadata: Dict[str, Any]) -> Dict[str, Any]: return compact(metadata, forgefed=True) class BaseIntrinsicMapping(BaseMapping): """Base class for intrinsic-metadata mappings to inherit from To implement a new mapping: - inherit this class - override translate function """ @classmethod def detect_metadata_files(cls, file_entries: List[DirectoryLsEntry]) -> List[Sha1]: """ Returns the sha1 hashes of files which can be translated by this mapping """ raise NotImplementedError(f"{cls.__name__}.detect_metadata_files") def normalize_translation(self, metadata: Dict[str, Any]) -> Dict[str, Any]: return compact(metadata, forgefed=False) class SingleFileIntrinsicMapping(BaseIntrinsicMapping): """Base class for all intrinsic metadata mappings that use a single file as input.""" @property def filename(self): """The .json file to extract metadata from.""" raise NotImplementedError(f"{self.__class__.__name__}.filename") @classmethod def detect_metadata_files(cls, file_entries: List[DirectoryLsEntry]) -> List[Sha1]: for entry in file_entries: if entry["name"].lower() == cls.filename: return [entry["sha1"]] return [] class DictMapping(BaseMapping): """Base class for mappings that take as input a file that is mostly a key-value store (eg. a shallow JSON dict).""" - string_fields = [] # type: List[str] + string_fields: List[str] = [] """List of fields that are simple strings, and don't need any normalization.""" + uri_fields: List[str] = [] + """List of fields that are simple URIs, and don't need any + normalization.""" + @property def mapping(self): """A translation dict to map dict keys into a canonical name.""" raise NotImplementedError(f"{self.__class__.__name__}.mapping") @staticmethod def _normalize_method_name(name: str) -> str: return name.replace("-", "_") @classmethod def supported_terms(cls): # one-to-one mapping from the original key to a CodeMeta term simple_terms = { - term + str(term) for (key, term) in cls.mapping.items() - if key in cls.string_fields + if key in cls.string_fields + cls.uri_fields or hasattr(cls, "normalize_" + cls._normalize_method_name(key)) } # more complex mapping from the original key to JSON-LD complex_terms = { - term + str(term) for meth_name in dir(cls) if meth_name.startswith("translate_") for term in getattr(getattr(cls, meth_name), "produced_terms", []) } return simple_terms | complex_terms - def _translate_dict( - self, content_dict: Dict, *, normalize: bool = True - ) -> Dict[str, str]: + def _translate_dict(self, content_dict: Dict) -> Dict[str, Any]: """ Translates content by parsing content from a dict object and translating with the appropriate mapping Args: content_dict (dict): content dict to translate Returns: dict: translated metadata in json-friendly form needed for the indexer """ - translated_metadata = {"@type": SCHEMA_URI + "SoftwareSourceCode"} + graph = rdflib.Graph() + + # The main object being described (the SoftwareSourceCode) does not necessarily + # may or may not have an id. + # Either way, we temporarily use this URI to identify it. Unfortunately, + # we cannot use a blank node as we need to use it for JSON-LD framing later, + # and blank nodes cannot be used for framing in JSON-LD >= 1.1 + root_id = ( + "https://www.softwareheritage.org/schema/2022/indexer/tmp-node/" + + str(uuid.uuid4()) + ) + root = rdflib.URIRef(root_id) + graph.add((root, RDF.type, SCHEMA.SoftwareSourceCode)) + for k, v in content_dict.items(): # First, check if there is a specific translation # method for this key translation_method = getattr( self, "translate_" + self._normalize_method_name(k), None ) if translation_method: - translation_method(translated_metadata, v) + translation_method(graph, root, v) elif k in self.mapping: # if there is no method, but the key is known from the # crosswalk table codemeta_key = self.mapping[k] - # if there is a normalization method, use it on the value + # if there is a normalization method, use it on the value, + # and add its results to the triples normalization_method = getattr( self, "normalize_" + self._normalize_method_name(k), None ) if normalization_method: v = normalization_method(v) + if v is None: + pass + elif isinstance(v, list): + for item in reversed(v): + graph.add((root, codemeta_key, item)) + else: + graph.add((root, codemeta_key, v)) elif k in self.string_fields and isinstance(v, str): - pass + graph.add((root, codemeta_key, rdflib.Literal(v))) elif k in self.string_fields and isinstance(v, list): - v = [x for x in v if isinstance(x, str)] + for item in v: + graph.add((root, codemeta_key, rdflib.Literal(item))) + elif k in self.uri_fields and isinstance(v, str): + graph.add((root, codemeta_key, rdflib.URIRef(v))) + elif k in self.uri_fields and isinstance(v, list): + for item in v: + graph.add((root, codemeta_key, rdflib.URIRef(item))) else: continue - # set the translation metadata with the normalized value - if codemeta_key in translated_metadata: - translated_metadata[codemeta_key] = merge_values( - translated_metadata[codemeta_key], v - ) - else: - translated_metadata[codemeta_key] = v + self.extra_translation(graph, root, content_dict) - if normalize: - return self.normalize_translation(translated_metadata) - else: - return translated_metadata + # Convert from rdflib's internal graph representation to JSON + s = graph.serialize(format="application/ld+json") + # Load from JSON to a list of Python objects + jsonld_graph = json.loads(s) -class JsonMapping(DictMapping): - """Base class for all mappings that use JSON data as input.""" + # Use JSON-LD framing to turn the graph into a rooted tree + # frame = {"@type": str(SCHEMA.SoftwareSourceCode)} + translated_metadata = jsonld.frame( + jsonld_graph, + {"@id": root_id}, + options={ + "documentLoader": _document_loader, + "processingMode": "json-ld-1.1", + }, + ) - def translate(self, raw_content: bytes) -> Optional[Dict]: + # Remove the temporary id we added at the beginning + if isinstance(translated_metadata["@id"], list): + translated_metadata["@id"].remove(root_id) + else: + del translated_metadata["@id"] + + return self.normalize_translation(translated_metadata) + + def extra_translation( + self, graph: rdflib.Graph, root: rdflib.term.Node, d: Dict[str, Any] + ): + """Called at the end of the translation process, and may add arbitrary triples + to ``graph`` based on the input dictionary (passed as ``d``). """ - Translates content by parsing content from a bytestring containing - json data and translating with the appropriate mapping + pass - Args: - raw_content (bytes): raw content to translate - Returns: - dict: translated metadata in json-friendly form needed for - the indexer +class JsonMapping(DictMapping): + """Base class for all mappings that use JSON data as input.""" - """ + def translate(self, raw_content: bytes) -> Optional[Dict]: try: raw_content_string: str = raw_content.decode() except UnicodeDecodeError: self.log.warning("Error unidecoding from %s", self.log_suffix) return None try: content_dict = json.loads(raw_content_string) except json.JSONDecodeError: self.log.warning("Error unjsoning from %s", self.log_suffix) return None if isinstance(content_dict, dict): return self._translate_dict(content_dict) return None +class XmlMapping(DictMapping): + """Base class for all mappings that use XML data as input.""" + + def translate(self, raw_content: bytes) -> Optional[Dict]: + try: + d = xmltodict.parse(raw_content) + except xml.parsers.expat.ExpatError: + self.log.warning("Error parsing XML from %s", self.log_suffix) + return None + except UnicodeDecodeError: + self.log.warning("Error unidecoding XML from %s", self.log_suffix) + return None + except (LookupError, ValueError): + # unknown encoding or multi-byte encoding + self.log.warning("Error detecting XML encoding from %s", self.log_suffix) + return None + if not isinstance(d, dict): + self.log.warning("Skipping ill-formed XML content: %s", raw_content) + return None + return self._translate_dict(d) + + class SafeLoader(yaml.SafeLoader): yaml_implicit_resolvers = { k: [r for r in v if r[0] != "tag:yaml.org,2002:timestamp"] for k, v in yaml.SafeLoader.yaml_implicit_resolvers.items() } class YamlMapping(DictMapping, SingleFileIntrinsicMapping): """Base class for all mappings that use Yaml data as input.""" def translate(self, raw_content: bytes) -> Optional[Dict[str, str]]: raw_content_string: str = raw_content.decode() try: content_dict = yaml.load(raw_content_string, Loader=SafeLoader) except yaml.scanner.ScannerError: return None if isinstance(content_dict, dict): return self._translate_dict(content_dict) return None diff --git a/swh/indexer/metadata_dictionary/cff.py b/swh/indexer/metadata_dictionary/cff.py index 286ec77..12121cc 100644 --- a/swh/indexer/metadata_dictionary/cff.py +++ b/swh/indexer/metadata_dictionary/cff.py @@ -1,53 +1,63 @@ -from typing import Dict, List, Optional, Union +# Copyright (C) 2021-2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information -from swh.indexer.codemeta import CROSSWALK_TABLE, SCHEMA_URI +from typing import List + +from rdflib import BNode, Graph, Literal, URIRef +import rdflib.term + +from swh.indexer.codemeta import CROSSWALK_TABLE +from swh.indexer.namespaces import RDF, SCHEMA from .base import YamlMapping +from .utils import add_map + +DOI = URIRef("https://doi.org/") +SPDX = URIRef("https://spdx.org/licenses/") class CffMapping(YamlMapping): """Dedicated class for Citation (CITATION.cff) mapping and translation""" name = "cff" filename = b"CITATION.cff" mapping = CROSSWALK_TABLE["Citation File Format Core (CFF-Core) 1.0.2"] string_fields = ["keywords", "license", "abstract", "version", "doi"] - - def normalize_authors(self, d: List[dict]) -> Dict[str, list]: - result = [] - for author in d: - author_data: Dict[str, Optional[Union[str, Dict]]] = { - "@type": SCHEMA_URI + "Person" - } - if "orcid" in author and isinstance(author["orcid"], str): - author_data["@id"] = author["orcid"] - if "affiliation" in author and isinstance(author["affiliation"], str): - author_data[SCHEMA_URI + "affiliation"] = { - "@type": SCHEMA_URI + "Organization", - SCHEMA_URI + "name": author["affiliation"], - } - if "family-names" in author and isinstance(author["family-names"], str): - author_data[SCHEMA_URI + "familyName"] = author["family-names"] - if "given-names" in author and isinstance(author["given-names"], str): - author_data[SCHEMA_URI + "givenName"] = author["given-names"] - - result.append(author_data) - - result_final = {"@list": result} - return result_final - - def normalize_doi(self, s: str) -> Dict[str, str]: - if isinstance(s, str): - return {"@id": "https://doi.org/" + s} - - def normalize_license(self, s: str) -> Dict[str, str]: + uri_fields = ["repository-code"] + + def _translate_author(self, graph: Graph, author: dict) -> rdflib.term.Node: + node: rdflib.term.Node + if "orcid" in author and isinstance(author["orcid"], str): + node = URIRef(author["orcid"]) + else: + node = BNode() + graph.add((node, RDF.type, SCHEMA.Person)) + if "affiliation" in author and isinstance(author["affiliation"], str): + affiliation = BNode() + graph.add((node, SCHEMA.affiliation, affiliation)) + graph.add((affiliation, RDF.type, SCHEMA.Organization)) + graph.add((affiliation, SCHEMA.name, Literal(author["affiliation"]))) + if "family-names" in author and isinstance(author["family-names"], str): + graph.add((node, SCHEMA.familyName, Literal(author["family-names"]))) + if "given-names" in author and isinstance(author["given-names"], str): + graph.add((node, SCHEMA.givenName, Literal(author["given-names"]))) + return node + + def translate_authors( + self, graph: Graph, root: URIRef, authors: List[dict] + ) -> None: + add_map(graph, root, SCHEMA.author, self._translate_author, authors) + + def normalize_doi(self, s: str) -> URIRef: if isinstance(s, str): - return {"@id": "https://spdx.org/licenses/" + s} + return DOI + s - def normalize_repository_code(self, s: str) -> Dict[str, str]: + def normalize_license(self, s: str) -> URIRef: if isinstance(s, str): - return {"@id": s} + return SPDX + s - def normalize_date_released(self, s: str) -> Dict[str, str]: + def normalize_date_released(self, s: str) -> Literal: if isinstance(s, str): - return {"@value": s, "@type": SCHEMA_URI + "Date"} + return Literal(s, datatype=SCHEMA.Date) diff --git a/swh/indexer/metadata_dictionary/codemeta.py b/swh/indexer/metadata_dictionary/codemeta.py index f0f0d09..4da5eb6 100644 --- a/swh/indexer/metadata_dictionary/codemeta.py +++ b/swh/indexer/metadata_dictionary/codemeta.py @@ -1,31 +1,149 @@ -# Copyright (C) 2018-2019 The Software Heritage developers +# Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information +import collections import json -from typing import Any, Dict, List, Optional +import re +from typing import Any, Dict, List, Optional, Tuple +import xml.etree.ElementTree as ET -from swh.indexer.codemeta import CODEMETA_TERMS, expand +import xmltodict -from .base import SingleFileIntrinsicMapping +from swh.indexer.codemeta import CODEMETA_CONTEXT_URL, CODEMETA_TERMS, compact, expand + +from .base import BaseExtrinsicMapping, SingleFileIntrinsicMapping + +ATOM_URI = "http://www.w3.org/2005/Atom" + +_TAG_RE = re.compile(r"\{(?P.*?)\}(?P.*)") +_IGNORED_NAMESPACES = ("http://www.w3.org/2005/Atom",) class CodemetaMapping(SingleFileIntrinsicMapping): """ dedicated class for CodeMeta (codemeta.json) mapping and translation """ name = "codemeta" filename = b"codemeta.json" string_fields = None @classmethod def supported_terms(cls) -> List[str]: return [term for term in CODEMETA_TERMS if not term.startswith("@")] def translate(self, content: bytes) -> Optional[Dict[str, Any]]: try: return self.normalize_translation(expand(json.loads(content.decode()))) except Exception: return None + + +class SwordCodemetaMapping(BaseExtrinsicMapping): + """ + dedicated class for mapping and translation from JSON-LD statements + embedded in SWORD documents, optionally using Codemeta contexts, + as described in the :ref:`deposit-protocol`. + """ + + name = "sword-codemeta" + + @classmethod + def extrinsic_metadata_formats(cls) -> Tuple[str, ...]: + return ( + "sword-v2-atom-codemeta", + "sword-v2-atom-codemeta-v2", + ) + + @classmethod + def supported_terms(cls) -> List[str]: + return [term for term in CODEMETA_TERMS if not term.startswith("@")] + + def xml_to_jsonld(self, e: ET.Element) -> Dict[str, Any]: + doc: Dict[str, List[Dict[str, Any]]] = collections.defaultdict(list) + for child in e: + m = _TAG_RE.match(child.tag) + assert m, f"Tag with no namespace: {child}" + namespace = m.group("namespace") + localname = m.group("localname") + if namespace == ATOM_URI and localname in ("title", "name"): + # Convert Atom to Codemeta name; in case codemeta:name + # is not provided or different + doc["name"].append(self.xml_to_jsonld(child)) + elif namespace == ATOM_URI and localname in ("author", "email"): + # ditto for these author properties (note that author email is also + # covered by the previous test) + doc[localname].append(self.xml_to_jsonld(child)) + elif namespace in _IGNORED_NAMESPACES: + # SWORD-specific namespace that is not interesting to translate + pass + elif namespace.lower() == CODEMETA_CONTEXT_URL: + # It is a term defined by the context; write is as-is and JSON-LD + # expansion will convert it to a full URI based on + # "@context": CODEMETA_CONTEXT_URL + doc[localname].append(self.xml_to_jsonld(child)) + else: + # Otherwise, we already know the URI + doc[f"{namespace}{localname}"].append(self.xml_to_jsonld(child)) + + # The above needed doc values to be list to work; now we allow any type + # of value as key "@value" cannot have a list as value. + doc_: Dict[str, Any] = doc + + text = e.text.strip() if e.text else None + if text: + # TODO: check doc is empty, and raise mixed-content error otherwise? + doc_["@value"] = text + + return doc_ + + def translate(self, content: bytes) -> Optional[Dict[str, Any]]: + # Parse XML + root = ET.fromstring(content) + + # Transform to JSON-LD document + doc = self.xml_to_jsonld(root) + + # Add @context to JSON-LD expansion replaces the "codemeta:" prefix + # hash (which uses the context URL as namespace URI for historical + # reasons) into properties in `http://schema.org/` and + # `https://codemeta.github.io/terms/` namespaces + doc["@context"] = CODEMETA_CONTEXT_URL + + # Normalize as a Codemeta document + return self.normalize_translation(expand(doc)) + + def normalize_translation(self, metadata: Dict[str, Any]) -> Dict[str, Any]: + return compact(metadata, forgefed=False) + + +class JsonSwordCodemetaMapping(SwordCodemetaMapping): + """ + Variant of :class:`SwordCodemetaMapping` that reads the legacy + ``sword-v2-atom-codemeta-v2-in-json`` format and converts it back to + ``sword-v2-atom-codemeta-v2`` XML + """ + + name = "json-sword-codemeta" + + @classmethod + def extrinsic_metadata_formats(cls) -> Tuple[str, ...]: + return ("sword-v2-atom-codemeta-v2-in-json",) + + def translate(self, content: bytes) -> Optional[Dict[str, Any]]: + # ``content`` was generated by calling ``xmltodict.parse()`` on a XML document, + # so ``xmltodict.unparse()`` is guaranteed to return a document that is + # semantically equivalent to the original and pass it to SwordCodemetaMapping. + json_doc = json.loads(content) + + if json_doc.get("@xmlns") != ATOM_URI: + # Technically, non-default XMLNS were allowed, but it does not seem like + # anyone used them, so they do not need to be implemented here. + raise NotImplementedError(f"Unexpected XMLNS set: {json_doc}") + + # Root tag was stripped by swh-deposit + json_doc = {"entry": json_doc} + + return super().translate(xmltodict.unparse(json_doc)) diff --git a/swh/indexer/metadata_dictionary/composer.py b/swh/indexer/metadata_dictionary/composer.py index c02f5d8..a43fc23 100644 --- a/swh/indexer/metadata_dictionary/composer.py +++ b/swh/indexer/metadata_dictionary/composer.py @@ -1,56 +1,61 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os.path +from typing import Optional -from swh.indexer.codemeta import _DATA_DIR, SCHEMA_URI, _read_crosstable +from rdflib import BNode, Graph, Literal, URIRef + +from swh.indexer.codemeta import _DATA_DIR, _read_crosstable +from swh.indexer.namespaces import RDF, SCHEMA from .base import JsonMapping, SingleFileIntrinsicMapping +from .utils import add_map + +SPDX = URIRef("https://spdx.org/licenses/") + COMPOSER_TABLE_PATH = os.path.join(_DATA_DIR, "composer.csv") with open(COMPOSER_TABLE_PATH) as fd: (CODEMETA_TERMS, COMPOSER_TABLE) = _read_crosstable(fd) class ComposerMapping(JsonMapping, SingleFileIntrinsicMapping): """Dedicated class for Packagist(composer.json) mapping and translation""" name = "composer" mapping = COMPOSER_TABLE["Composer"] filename = b"composer.json" string_fields = [ "name", "description", "version", "keywords", - "homepage", "license", "author", "authors", ] - - def normalize_homepage(self, s): - if isinstance(s, str): - return {"@id": s} + uri_fields = ["homepage"] def normalize_license(self, s): if isinstance(s, str): - return {"@id": "https://spdx.org/licenses/" + s} + return SPDX + s - def normalize_authors(self, author_list): - authors = [] - for author in author_list: - author_obj = {"@type": SCHEMA_URI + "Person"} + def _translate_author(self, graph: Graph, author) -> Optional[BNode]: + if not isinstance(author, dict): + return None + node = BNode() + graph.add((node, RDF.type, SCHEMA.Person)) - if isinstance(author, dict): - if isinstance(author.get("name", None), str): - author_obj[SCHEMA_URI + "name"] = author.get("name", None) - if isinstance(author.get("email", None), str): - author_obj[SCHEMA_URI + "email"] = author.get("email", None) + if isinstance(author.get("name"), str): + graph.add((node, SCHEMA.name, Literal(author["name"]))) + if isinstance(author.get("email"), str): + graph.add((node, SCHEMA.email, Literal(author["email"]))) - authors.append(author_obj) + return node - return {"@list": authors} + def translate_authors(self, graph: Graph, root: URIRef, authors) -> None: + add_map(graph, root, SCHEMA.author, self._translate_author, authors) diff --git a/swh/indexer/metadata_dictionary/dart.py b/swh/indexer/metadata_dictionary/dart.py index 26cd7d5..ec6dfb2 100644 --- a/swh/indexer/metadata_dictionary/dart.py +++ b/swh/indexer/metadata_dictionary/dart.py @@ -1,74 +1,75 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os.path import re -from swh.indexer.codemeta import _DATA_DIR, SCHEMA_URI, _read_crosstable +from rdflib import RDF, BNode, Graph, Literal, URIRef + +from swh.indexer.codemeta import _DATA_DIR, _read_crosstable +from swh.indexer.namespaces import SCHEMA from .base import YamlMapping +from .utils import add_map + +SPDX = URIRef("https://spdx.org/licenses/") PUB_TABLE_PATH = os.path.join(_DATA_DIR, "pubspec.csv") with open(PUB_TABLE_PATH) as fd: (CODEMETA_TERMS, PUB_TABLE) = _read_crosstable(fd) def name_to_person(name): return { - "@type": SCHEMA_URI + "Person", - SCHEMA_URI + "name": name, + "@type": SCHEMA.Person, + SCHEMA.name: name, } class PubspecMapping(YamlMapping): name = "pubspec" filename = b"pubspec.yaml" mapping = PUB_TABLE["Pubspec"] string_fields = [ "repository", "keywords", "description", "name", - "homepage", "issue_tracker", "platforms", "license" # license will only be used with the SPDX Identifier ] + uri_fields = ["homepage"] def normalize_license(self, s): if isinstance(s, str): - return {"@id": "https://spdx.org/licenses/" + s} - - def normalize_homepage(self, s): - if isinstance(s, str): - return {"@id": s} + return SPDX + s - def normalize_author(self, s): - name_email_regex = "(?P.*?)( <(?P.*)>)" - author = {"@type": SCHEMA_URI + "Person"} + def _translate_author(self, graph, s): + name_email_re = re.compile("(?P.*?)( <(?P.*)>)") if isinstance(s, str): - match = re.search(name_email_regex, s) + author = BNode() + graph.add((author, RDF.type, SCHEMA.Person)) + match = name_email_re.search(s) if match: name = match.group("name") email = match.group("email") - author[SCHEMA_URI + "email"] = email + graph.add((author, SCHEMA.email, Literal(email))) else: name = s - author[SCHEMA_URI + "name"] = name + graph.add((author, SCHEMA.name, Literal(name))) - return {"@list": [author]} + return author - def normalize_authors(self, authors_list): - authors = {"@list": []} + def translate_author(self, graph: Graph, root, s) -> None: + add_map(graph, root, SCHEMA.author, self._translate_author, [s]) - if isinstance(authors_list, list): - for s in authors_list: - author = self.normalize_author(s)["@list"] - authors["@list"] += author - return authors + def translate_authors(self, graph: Graph, root, authors) -> None: + if isinstance(authors, list): + add_map(graph, root, SCHEMA.author, self._translate_author, authors) diff --git a/swh/indexer/metadata_dictionary/github.py b/swh/indexer/metadata_dictionary/github.py index 020c8d0..fe3b87e 100644 --- a/swh/indexer/metadata_dictionary/github.py +++ b/swh/indexer/metadata_dictionary/github.py @@ -1,130 +1,113 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -import json -from typing import Any, Dict, Tuple +from typing import Any, Tuple -from swh.indexer.codemeta import ACTIVITYSTREAMS_URI, CROSSWALK_TABLE, FORGEFED_URI +from rdflib import RDF, BNode, Graph, Literal, URIRef -from .base import BaseExtrinsicMapping, JsonMapping, produce_terms +from swh.indexer.codemeta import CROSSWALK_TABLE +from swh.indexer.namespaces import ACTIVITYSTREAMS, FORGEFED, SCHEMA +from .base import BaseExtrinsicMapping, JsonMapping, produce_terms +from .utils import prettyprint_graph # noqa -def _prettyprint(d): - print(json.dumps(d, indent=4)) +SPDX = URIRef("https://spdx.org/licenses/") class GitHubMapping(BaseExtrinsicMapping, JsonMapping): name = "github" mapping = CROSSWALK_TABLE["GitHub"] string_fields = [ "archive_url", "created_at", "updated_at", "description", "full_name", "html_url", "issues_url", ] @classmethod def extrinsic_metadata_formats(cls) -> Tuple[str, ...]: return ("application/vnd.github.v3+json",) - def _translate_dict(self, content_dict: Dict[str, Any], **kwargs) -> Dict[str, Any]: - d = super()._translate_dict(content_dict, **kwargs) - d["type"] = FORGEFED_URI + "Repository" - return d + def extra_translation(self, graph, root, content_dict): + graph.remove((root, RDF.type, SCHEMA.SoftwareSourceCode)) + graph.add((root, RDF.type, FORGEFED.Repository)) - @produce_terms(FORGEFED_URI, ["forks"]) - @produce_terms(ACTIVITYSTREAMS_URI, ["totalItems"]) - def translate_forks_count( - self, translated_metadata: Dict[str, Any], v: Any - ) -> None: + @produce_terms(FORGEFED.forks, ACTIVITYSTREAMS.totalItems) + def translate_forks_count(self, graph: Graph, root: BNode, v: Any) -> None: """ - >>> translated_metadata = {} - >>> GitHubMapping().translate_forks_count(translated_metadata, 42) - >>> _prettyprint(translated_metadata) + >>> graph = Graph() + >>> root = URIRef("http://example.org/test-software") + >>> GitHubMapping().translate_forks_count(graph, root, 42) + >>> prettyprint_graph(graph, root) { - "https://forgefed.org/ns#forks": [ - { - "@type": "https://www.w3.org/ns/activitystreams#OrderedCollection", - "https://www.w3.org/ns/activitystreams#totalItems": 42 - } - ] + "@id": ..., + "https://forgefed.org/ns#forks": { + "@type": "https://www.w3.org/ns/activitystreams#OrderedCollection", + "https://www.w3.org/ns/activitystreams#totalItems": 42 + } } """ if isinstance(v, int): - translated_metadata.setdefault(FORGEFED_URI + "forks", []).append( - { - "@type": ACTIVITYSTREAMS_URI + "OrderedCollection", - ACTIVITYSTREAMS_URI + "totalItems": v, - } - ) - - @produce_terms(ACTIVITYSTREAMS_URI, ["likes"]) - @produce_terms(ACTIVITYSTREAMS_URI, ["totalItems"]) - def translate_stargazers_count( - self, translated_metadata: Dict[str, Any], v: Any - ) -> None: + collection = BNode() + graph.add((root, FORGEFED.forks, collection)) + graph.add((collection, RDF.type, ACTIVITYSTREAMS.OrderedCollection)) + graph.add((collection, ACTIVITYSTREAMS.totalItems, Literal(v))) + + @produce_terms(ACTIVITYSTREAMS.likes, ACTIVITYSTREAMS.totalItems) + def translate_stargazers_count(self, graph: Graph, root: BNode, v: Any) -> None: """ - >>> translated_metadata = {} - >>> GitHubMapping().translate_stargazers_count(translated_metadata, 42) - >>> _prettyprint(translated_metadata) + >>> graph = Graph() + >>> root = URIRef("http://example.org/test-software") + >>> GitHubMapping().translate_stargazers_count(graph, root, 42) + >>> prettyprint_graph(graph, root) { - "https://www.w3.org/ns/activitystreams#likes": [ - { - "@type": "https://www.w3.org/ns/activitystreams#Collection", - "https://www.w3.org/ns/activitystreams#totalItems": 42 - } - ] + "@id": ..., + "https://www.w3.org/ns/activitystreams#likes": { + "@type": "https://www.w3.org/ns/activitystreams#Collection", + "https://www.w3.org/ns/activitystreams#totalItems": 42 + } } """ if isinstance(v, int): - translated_metadata.setdefault(ACTIVITYSTREAMS_URI + "likes", []).append( - { - "@type": ACTIVITYSTREAMS_URI + "Collection", - ACTIVITYSTREAMS_URI + "totalItems": v, - } - ) - - @produce_terms(ACTIVITYSTREAMS_URI, ["followers"]) - @produce_terms(ACTIVITYSTREAMS_URI, ["totalItems"]) - def translate_watchers_count( - self, translated_metadata: Dict[str, Any], v: Any - ) -> None: + collection = BNode() + graph.add((root, ACTIVITYSTREAMS.likes, collection)) + graph.add((collection, RDF.type, ACTIVITYSTREAMS.Collection)) + graph.add((collection, ACTIVITYSTREAMS.totalItems, Literal(v))) + + @produce_terms(ACTIVITYSTREAMS.followers, ACTIVITYSTREAMS.totalItems) + def translate_watchers_count(self, graph: Graph, root: BNode, v: Any) -> None: """ - >>> translated_metadata = {} - >>> GitHubMapping().translate_watchers_count(translated_metadata, 42) - >>> _prettyprint(translated_metadata) + >>> graph = Graph() + >>> root = URIRef("http://example.org/test-software") + >>> GitHubMapping().translate_watchers_count(graph, root, 42) + >>> prettyprint_graph(graph, root) { - "https://www.w3.org/ns/activitystreams#followers": [ - { - "@type": "https://www.w3.org/ns/activitystreams#Collection", - "https://www.w3.org/ns/activitystreams#totalItems": 42 - } - ] + "@id": ..., + "https://www.w3.org/ns/activitystreams#followers": { + "@type": "https://www.w3.org/ns/activitystreams#Collection", + "https://www.w3.org/ns/activitystreams#totalItems": 42 + } } """ if isinstance(v, int): - translated_metadata.setdefault( - ACTIVITYSTREAMS_URI + "followers", [] - ).append( - { - "@type": ACTIVITYSTREAMS_URI + "Collection", - ACTIVITYSTREAMS_URI + "totalItems": v, - } - ) + collection = BNode() + graph.add((root, ACTIVITYSTREAMS.followers, collection)) + graph.add((collection, RDF.type, ACTIVITYSTREAMS.Collection)) + graph.add((collection, ACTIVITYSTREAMS.totalItems, Literal(v))) def normalize_license(self, d): """ >>> GitHubMapping().normalize_license({'spdx_id': 'MIT'}) - {'@id': 'https://spdx.org/licenses/MIT'} + rdflib.term.URIRef('https://spdx.org/licenses/MIT') """ if isinstance(d, dict) and isinstance(d.get("spdx_id"), str): - return {"@id": "https://spdx.org/licenses/" + d["spdx_id"]} + return SPDX + d["spdx_id"] diff --git a/swh/indexer/metadata_dictionary/maven.py b/swh/indexer/metadata_dictionary/maven.py index 419eb74..a374a5e 100644 --- a/swh/indexer/metadata_dictionary/maven.py +++ b/swh/indexer/metadata_dictionary/maven.py @@ -1,162 +1,159 @@ -# Copyright (C) 2018-2021 The Software Heritage developers +# Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os -from typing import Any, Dict, Optional -import xml.parsers.expat +from typing import Any, Dict -import xmltodict +from rdflib import Graph, Literal, URIRef -from swh.indexer.codemeta import CROSSWALK_TABLE, SCHEMA_URI +from swh.indexer.codemeta import CROSSWALK_TABLE +from swh.indexer.namespaces import SCHEMA -from .base import DictMapping, SingleFileIntrinsicMapping +from .base import SingleFileIntrinsicMapping, XmlMapping +from .utils import prettyprint_graph # noqa -class MavenMapping(DictMapping, SingleFileIntrinsicMapping): +class MavenMapping(XmlMapping, SingleFileIntrinsicMapping): """ dedicated class for Maven (pom.xml) mapping and translation """ name = "maven" filename = b"pom.xml" mapping = CROSSWALK_TABLE["Java (Maven)"] string_fields = ["name", "version", "description", "email"] - def translate(self, content: bytes) -> Optional[Dict[str, Any]]: - try: - d = xmltodict.parse(content).get("project") or {} - except xml.parsers.expat.ExpatError: - self.log.warning("Error parsing XML from %s", self.log_suffix) - return None - except UnicodeDecodeError: - self.log.warning("Error unidecoding XML from %s", self.log_suffix) - return None - except (LookupError, ValueError): - # unknown encoding or multi-byte encoding - self.log.warning("Error detecting XML encoding from %s", self.log_suffix) - return None - if not isinstance(d, dict): - self.log.warning("Skipping ill-formed XML content: %s", content) - return None - metadata = self._translate_dict(d, normalize=False) - metadata[SCHEMA_URI + "codeRepository"] = self.parse_repositories(d) - metadata[SCHEMA_URI + "license"] = self.parse_licenses(d) - return self.normalize_translation(metadata) - _default_repository = {"url": "https://repo.maven.apache.org/maven2/"} - def parse_repositories(self, d): + def _translate_dict(self, d: Dict[str, Any]) -> Dict[str, Any]: + return super()._translate_dict(d.get("project") or {}) + + def extra_translation(self, graph: Graph, root, d): + self.parse_repositories(graph, root, d) + + def parse_repositories(self, graph: Graph, root, d): """https://maven.apache.org/pom.html#Repositories + >>> import rdflib >>> import xmltodict >>> from pprint import pprint >>> d = xmltodict.parse(''' ... ... ... codehausSnapshots ... Codehaus Snapshots ... http://snapshots.maven.codehaus.org/maven2 ... default ... ... ... ''') - >>> MavenMapping().parse_repositories(d) + >>> MavenMapping().parse_repositories(rdflib.Graph(), rdflib.BNode(), d) """ repositories = d.get("repositories") if not repositories: - results = [self.parse_repository(d, self._default_repository)] + self.parse_repository(graph, root, d, self._default_repository) elif isinstance(repositories, dict): repositories = repositories.get("repository") or [] if not isinstance(repositories, list): repositories = [repositories] - results = [self.parse_repository(d, repo) for repo in repositories] - else: - results = [] - return [res for res in results if res] or None + for repo in repositories: + self.parse_repository(graph, root, d, repo) - def parse_repository(self, d, repo): + def parse_repository(self, graph: Graph, root, d, repo): if not isinstance(repo, dict): return if repo.get("layout", "default") != "default": return # TODO ? url = repo.get("url") group_id = d.get("groupId") artifact_id = d.get("artifactId") if ( isinstance(url, str) and isinstance(group_id, str) and isinstance(artifact_id, str) ): repo = os.path.join(url, *group_id.split("."), artifact_id) - return {"@id": repo} + graph.add((root, SCHEMA.codeRepository, URIRef(repo))) def normalize_groupId(self, id_): """https://maven.apache.org/pom.html#Maven_Coordinates >>> MavenMapping().normalize_groupId('org.example') - {'@id': 'org.example'} + rdflib.term.Literal('org.example') """ if isinstance(id_, str): - return {"@id": id_} + return Literal(id_) - def parse_licenses(self, d): + def translate_licenses(self, graph, root, licenses): """https://maven.apache.org/pom.html#Licenses >>> import xmltodict >>> import json >>> d = xmltodict.parse(''' ... ... ... Apache License, Version 2.0 ... https://www.apache.org/licenses/LICENSE-2.0.txt ... ... ... ''') >>> print(json.dumps(d, indent=4)) { "licenses": { "license": { "name": "Apache License, Version 2.0", "url": "https://www.apache.org/licenses/LICENSE-2.0.txt" } } } - >>> MavenMapping().parse_licenses(d) - [{'@id': 'https://www.apache.org/licenses/LICENSE-2.0.txt'}] + >>> graph = Graph() + >>> root = URIRef("http://example.org/test-software") + >>> MavenMapping().translate_licenses(graph, root, d["licenses"]) + >>> prettyprint_graph(graph, root) + { + "@id": ..., + "http://schema.org/license": { + "@id": "https://www.apache.org/licenses/LICENSE-2.0.txt" + } + } or, if there are more than one license: >>> import xmltodict >>> from pprint import pprint >>> d = xmltodict.parse(''' ... ... ... Apache License, Version 2.0 ... https://www.apache.org/licenses/LICENSE-2.0.txt ... ... ... MIT License ... https://opensource.org/licenses/MIT ... ... ... ''') - >>> pprint(MavenMapping().parse_licenses(d)) - [{'@id': 'https://www.apache.org/licenses/LICENSE-2.0.txt'}, - {'@id': 'https://opensource.org/licenses/MIT'}] + >>> graph = Graph() + >>> root = URIRef("http://example.org/test-software") + >>> MavenMapping().translate_licenses(graph, root, d["licenses"]) + >>> pprint(set(graph.triples((root, URIRef("http://schema.org/license"), None)))) + {(rdflib.term.URIRef('http://example.org/test-software'), + rdflib.term.URIRef('http://schema.org/license'), + rdflib.term.URIRef('https://opensource.org/licenses/MIT')), + (rdflib.term.URIRef('http://example.org/test-software'), + rdflib.term.URIRef('http://schema.org/license'), + rdflib.term.URIRef('https://www.apache.org/licenses/LICENSE-2.0.txt'))} """ - licenses = d.get("licenses") if not isinstance(licenses, dict): return licenses = licenses.get("license") if isinstance(licenses, dict): licenses = [licenses] elif not isinstance(licenses, list): return - return [ - {"@id": license["url"]} - for license in licenses - if isinstance(license, dict) and isinstance(license.get("url"), str) - ] or None + for license in licenses: + if isinstance(license, dict) and isinstance(license.get("url"), str): + graph.add((root, SCHEMA.license, URIRef(license["url"]))) diff --git a/swh/indexer/metadata_dictionary/npm.py b/swh/indexer/metadata_dictionary/npm.py index 00231dc..1540ef6 100644 --- a/swh/indexer/metadata_dictionary/npm.py +++ b/swh/indexer/metadata_dictionary/npm.py @@ -1,243 +1,282 @@ # Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import re import urllib.parse -from swh.indexer.codemeta import CROSSWALK_TABLE, SCHEMA_URI +from rdflib import RDF, BNode, Graph, Literal, URIRef + +from swh.indexer.codemeta import CROSSWALK_TABLE +from swh.indexer.namespaces import SCHEMA from .base import JsonMapping, SingleFileIntrinsicMapping +from .utils import add_list, prettyprint_graph # noqa + +SPDX = URIRef("https://spdx.org/licenses/") class NpmMapping(JsonMapping, SingleFileIntrinsicMapping): """ dedicated class for NPM (package.json) mapping and translation """ name = "npm" mapping = CROSSWALK_TABLE["NodeJS"] filename = b"package.json" - string_fields = ["name", "version", "homepage", "description", "email"] + string_fields = ["name", "version", "description", "email"] + uri_fields = ["homepage"] _schema_shortcuts = { "github": "git+https://github.com/%s.git", "gist": "git+https://gist.github.com/%s.git", "gitlab": "git+https://gitlab.com/%s.git", # Bitbucket supports both hg and git, and the shortcut does not # tell which one to use. # 'bitbucket': 'https://bitbucket.org/', } def normalize_repository(self, d): """https://docs.npmjs.com/files/package.json#repository >>> NpmMapping().normalize_repository({ ... 'type': 'git', ... 'url': 'https://example.org/foo.git' ... }) - {'@id': 'git+https://example.org/foo.git'} + rdflib.term.URIRef('git+https://example.org/foo.git') >>> NpmMapping().normalize_repository( ... 'gitlab:foo/bar') - {'@id': 'git+https://gitlab.com/foo/bar.git'} + rdflib.term.URIRef('git+https://gitlab.com/foo/bar.git') >>> NpmMapping().normalize_repository( ... 'foo/bar') - {'@id': 'git+https://github.com/foo/bar.git'} + rdflib.term.URIRef('git+https://github.com/foo/bar.git') """ if ( isinstance(d, dict) and isinstance(d.get("type"), str) and isinstance(d.get("url"), str) ): url = "{type}+{url}".format(**d) elif isinstance(d, str): if "://" in d: url = d elif ":" in d: (schema, rest) = d.split(":", 1) if schema in self._schema_shortcuts: url = self._schema_shortcuts[schema] % rest else: return None else: url = self._schema_shortcuts["github"] % d else: return None - return {"@id": url} + return URIRef(url) def normalize_bugs(self, d): """https://docs.npmjs.com/files/package.json#bugs >>> NpmMapping().normalize_bugs({ ... 'url': 'https://example.org/bugs/', ... 'email': 'bugs@example.org' ... }) - {'@id': 'https://example.org/bugs/'} + rdflib.term.URIRef('https://example.org/bugs/') >>> NpmMapping().normalize_bugs( ... 'https://example.org/bugs/') - {'@id': 'https://example.org/bugs/'} + rdflib.term.URIRef('https://example.org/bugs/') """ if isinstance(d, dict) and isinstance(d.get("url"), str): - return {"@id": d["url"]} + return URIRef(d["url"]) elif isinstance(d, str): - return {"@id": d} + return URIRef(d) else: return None _parse_author = re.compile( r"^ *" r"(?P.*?)" r"( +<(?P.*)>)?" r"( +\((?P.*)\))?" r" *$" ) - def normalize_author(self, d): + def translate_author(self, graph: Graph, root, d): r"""https://docs.npmjs.com/files/package.json#people-fields-author-contributors' >>> from pprint import pprint - >>> pprint(NpmMapping().normalize_author({ + >>> root = URIRef("http://example.org/test-software") + >>> graph = Graph() + >>> NpmMapping().translate_author(graph, root, { ... 'name': 'John Doe', ... 'email': 'john.doe@example.org', ... 'url': 'https://example.org/~john.doe', - ... })) - {'@list': [{'@type': 'http://schema.org/Person', - 'http://schema.org/email': 'john.doe@example.org', - 'http://schema.org/name': 'John Doe', - 'http://schema.org/url': {'@id': 'https://example.org/~john.doe'}}]} - >>> pprint(NpmMapping().normalize_author( + ... }) + >>> prettyprint_graph(graph, root) + { + "@id": ..., + "http://schema.org/author": { + "@list": [ + { + "@type": "http://schema.org/Person", + "http://schema.org/email": "john.doe@example.org", + "http://schema.org/name": "John Doe", + "http://schema.org/url": { + "@id": "https://example.org/~john.doe" + } + } + ] + } + } + >>> graph = Graph() + >>> NpmMapping().translate_author(graph, root, ... 'John Doe (https://example.org/~john.doe)' - ... )) - {'@list': [{'@type': 'http://schema.org/Person', - 'http://schema.org/email': 'john.doe@example.org', - 'http://schema.org/name': 'John Doe', - 'http://schema.org/url': {'@id': 'https://example.org/~john.doe'}}]} - >>> pprint(NpmMapping().normalize_author({ + ... ) + >>> prettyprint_graph(graph, root) + { + "@id": ..., + "http://schema.org/author": { + "@list": [ + { + "@type": "http://schema.org/Person", + "http://schema.org/email": "john.doe@example.org", + "http://schema.org/name": "John Doe", + "http://schema.org/url": { + "@id": "https://example.org/~john.doe" + } + } + ] + } + } + >>> graph = Graph() + >>> NpmMapping().translate_author(graph, root, { ... 'name': 'John Doe', ... 'email': 'john.doe@example.org', ... 'url': 'https:\\\\example.invalid/~john.doe', - ... })) - {'@list': [{'@type': 'http://schema.org/Person', - 'http://schema.org/email': 'john.doe@example.org', - 'http://schema.org/name': 'John Doe'}]} + ... }) + >>> prettyprint_graph(graph, root) + { + "@id": ..., + "http://schema.org/author": { + "@list": [ + { + "@type": "http://schema.org/Person", + "http://schema.org/email": "john.doe@example.org", + "http://schema.org/name": "John Doe" + } + ] + } + } """ # noqa - author = {"@type": SCHEMA_URI + "Person"} + author = BNode() + graph.add((author, RDF.type, SCHEMA.Person)) if isinstance(d, dict): name = d.get("name", None) email = d.get("email", None) url = d.get("url", None) elif isinstance(d, str): match = self._parse_author.match(d) if not match: return None name = match.group("name") email = match.group("email") url = match.group("url") else: return None if name and isinstance(name, str): - author[SCHEMA_URI + "name"] = name + graph.add((author, SCHEMA.name, Literal(name))) if email and isinstance(email, str): - author[SCHEMA_URI + "email"] = email + graph.add((author, SCHEMA.email, Literal(email))) if url and isinstance(url, str): # Workaround for https://github.com/digitalbazaar/pyld/issues/91 : drop # URLs that are blatantly invalid early, so PyLD does not crash. parsed_url = urllib.parse.urlparse(url) if parsed_url.netloc: - author[SCHEMA_URI + "url"] = {"@id": url} + graph.add((author, SCHEMA.url, URIRef(url))) - return {"@list": [author]} + add_list(graph, root, SCHEMA.author, [author]) def normalize_description(self, description): r"""Try to re-decode ``description`` as UTF-16, as this is a somewhat common mistake that causes issues in the database because of null bytes in JSON. >>> NpmMapping().normalize_description("foo bar") - 'foo bar' + rdflib.term.Literal('foo bar') >>> NpmMapping().normalize_description( ... "\ufffd\ufffd#\x00 \x00f\x00o\x00o\x00 \x00b\x00a\x00r\x00\r\x00 \x00" ... ) - 'foo bar' + rdflib.term.Literal('foo bar') >>> NpmMapping().normalize_description( ... "\ufffd\ufffd\x00#\x00 \x00f\x00o\x00o\x00 \x00b\x00a\x00r\x00\r\x00 " ... ) - 'foo bar' + rdflib.term.Literal('foo bar') >>> NpmMapping().normalize_description( ... # invalid UTF-16 and meaningless UTF-8: ... "\ufffd\ufffd\x00#\x00\x00\x00 \x00\x00\x00\x00f\x00\x00\x00\x00" ... ) is None True >>> NpmMapping().normalize_description( ... # ditto (ut looks like little-endian at first) ... "\ufffd\ufffd#\x00\x00\x00 \x00\x00\x00\x00f\x00\x00\x00\x00\x00" ... ) is None True >>> NpmMapping().normalize_description(None) is None True """ if not isinstance(description, str): return None # XXX: if this function ever need to support more cases, consider # switching to https://pypi.org/project/ftfy/ instead of adding more hacks if description.startswith("\ufffd\ufffd") and "\x00" in description: # 2 unicode replacement characters followed by '# ' encoded as UTF-16 # is a common mistake, which indicates a README.md was saved as UTF-16, # and some NPM tool opened it as UTF-8 and used the first line as # description. description_bytes = description.encode() # Strip the the two unicode replacement characters assert description_bytes.startswith(b"\xef\xbf\xbd\xef\xbf\xbd") description_bytes = description_bytes[6:] # If the following attempts fail to recover the description, discard it # entirely because the current indexer storage backend (postgresql) cannot # store zero bytes in JSON columns. description = None if not description_bytes.startswith(b"\x00"): # try UTF-16 little-endian (the most common) first try: description = description_bytes.decode("utf-16le") except UnicodeDecodeError: pass if description is None: # if it fails, try UTF-16 big-endian try: description = description_bytes.decode("utf-16be") except UnicodeDecodeError: pass if description: if description.startswith("# "): description = description[2:] - return description.rstrip() - return description + return Literal(description.rstrip()) + else: + return None + return Literal(description) def normalize_license(self, s): """https://docs.npmjs.com/files/package.json#license >>> NpmMapping().normalize_license('MIT') - {'@id': 'https://spdx.org/licenses/MIT'} - """ - if isinstance(s, str): - return {"@id": "https://spdx.org/licenses/" + s} - - def normalize_homepage(self, s): - """https://docs.npmjs.com/files/package.json#homepage - - >>> NpmMapping().normalize_homepage('https://example.org/~john.doe') - {'@id': 'https://example.org/~john.doe'} + rdflib.term.URIRef('https://spdx.org/licenses/MIT') """ if isinstance(s, str): - return {"@id": s} + return SPDX + s def normalize_keywords(self, lst): """https://docs.npmjs.com/files/package.json#homepage >>> NpmMapping().normalize_keywords(['foo', 'bar']) - ['foo', 'bar'] + [rdflib.term.Literal('foo'), rdflib.term.Literal('bar')] """ if isinstance(lst, list): - return [x for x in lst if isinstance(x, str)] + return [Literal(x) for x in lst if isinstance(x, str)] diff --git a/swh/indexer/metadata_dictionary/nuget.py b/swh/indexer/metadata_dictionary/nuget.py new file mode 100644 index 0000000..62f7ea9 --- /dev/null +++ b/swh/indexer/metadata_dictionary/nuget.py @@ -0,0 +1,95 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import os.path +import re +from typing import Any, Dict, List + +from rdflib import RDF, BNode, Graph, Literal, URIRef + +from swh.indexer.codemeta import _DATA_DIR, _read_crosstable +from swh.indexer.namespaces import SCHEMA +from swh.indexer.storage.interface import Sha1 + +from .base import BaseIntrinsicMapping, DirectoryLsEntry, XmlMapping +from .utils import add_list + +NUGET_TABLE_PATH = os.path.join(_DATA_DIR, "nuget.csv") + +with open(NUGET_TABLE_PATH) as fd: + (CODEMETA_TERMS, NUGET_TABLE) = _read_crosstable(fd) + +SPDX = URIRef("https://spdx.org/licenses/") + + +class NuGetMapping(XmlMapping, BaseIntrinsicMapping): + """ + dedicated class for NuGet (.nuspec) mapping and translation + """ + + name = "nuget" + mapping = NUGET_TABLE["NuGet"] + mapping["copyright"] = URIRef("http://schema.org/copyrightNotice") + mapping["language"] = URIRef("http://schema.org/inLanguage") + string_fields = [ + "description", + "version", + "name", + "tags", + "license", + "summary", + "copyright", + "language", + ] + uri_fields = ["projectUrl", "licenseUrl"] + + @classmethod + def detect_metadata_files(cls, file_entries: List[DirectoryLsEntry]) -> List[Sha1]: + for entry in file_entries: + if entry["name"].endswith(b".nuspec"): + return [entry["sha1"]] + return [] + + def _translate_dict(self, d: Dict[str, Any]) -> Dict[str, Any]: + return super()._translate_dict(d.get("package", {}).get("metadata", {})) + + def translate_repository(self, graph, root, v): + if isinstance(v, dict) and isinstance(v["@url"], str): + codemeta_key = URIRef(self.mapping["repository.url"]) + graph.add((root, codemeta_key, URIRef(v["@url"]))) + + def normalize_license(self, v): + if isinstance(v, dict) and v["@type"] == "expression": + license_string = v["#text"] + if not bool( + re.search(r" with |\(|\)| and ", license_string, re.IGNORECASE) + ): + return [ + SPDX + license_type.strip() + for license_type in re.split( + r" or ", license_string, flags=re.IGNORECASE + ) + ] + else: + return None + + def translate_authors(self, graph: Graph, root, s): + if isinstance(s, str): + authors = [] + for author_name in s.split(","): + author_name = author_name.strip() + author = BNode() + graph.add((author, RDF.type, SCHEMA.Person)) + graph.add((author, SCHEMA.name, Literal(author_name))) + authors.append(author) + add_list(graph, root, SCHEMA.author, authors) + + def translate_releaseNotes(self, graph: Graph, root, s): + if isinstance(s, str): + graph.add((root, SCHEMA.releaseNotes, Literal(s))) + + def normalize_tags(self, s): + if isinstance(s, str): + return [Literal(tag) for tag in s.split(" ")] diff --git a/swh/indexer/metadata_dictionary/python.py b/swh/indexer/metadata_dictionary/python.py index 686deed..b16d681 100644 --- a/swh/indexer/metadata_dictionary/python.py +++ b/swh/indexer/metadata_dictionary/python.py @@ -1,76 +1,80 @@ -# Copyright (C) 2018-2019 The Software Heritage developers +# Copyright (C) 2018-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import email.parser import email.policy -import itertools -from swh.indexer.codemeta import CROSSWALK_TABLE, SCHEMA_URI +from rdflib import BNode, Literal, URIRef + +from swh.indexer.codemeta import CROSSWALK_TABLE +from swh.indexer.namespaces import RDF, SCHEMA from .base import DictMapping, SingleFileIntrinsicMapping +from .utils import add_list _normalize_pkginfo_key = str.lower class LinebreakPreservingEmailPolicy(email.policy.EmailPolicy): def header_fetch_parse(self, name, value): if hasattr(value, "name"): return value value = value.replace("\n ", "\n") return self.header_factory(name, value) class PythonPkginfoMapping(DictMapping, SingleFileIntrinsicMapping): """Dedicated class for Python's PKG-INFO mapping and translation. https://www.python.org/dev/peps/pep-0314/""" name = "pkg-info" filename = b"PKG-INFO" mapping = { _normalize_pkginfo_key(k): v for (k, v) in CROSSWALK_TABLE["Python PKG-INFO"].items() } string_fields = [ "name", "version", "description", "summary", "author", "author-email", ] _parser = email.parser.BytesHeaderParser(policy=LinebreakPreservingEmailPolicy()) def translate(self, content): msg = self._parser.parsebytes(content) d = {} for (key, value) in msg.items(): key = _normalize_pkginfo_key(key) if value != "UNKNOWN": d.setdefault(key, []).append(value) - metadata = self._translate_dict(d, normalize=False) - if SCHEMA_URI + "author" in metadata or SCHEMA_URI + "email" in metadata: - metadata[SCHEMA_URI + "author"] = { - "@list": [ - { - "@type": SCHEMA_URI + "Person", - SCHEMA_URI - + "name": metadata.pop(SCHEMA_URI + "author", [None])[0], - SCHEMA_URI - + "email": metadata.pop(SCHEMA_URI + "email", [None])[0], - } - ] - } - return self.normalize_translation(metadata) + return self._translate_dict(d) + + def extra_translation(self, graph, root, d): + author_names = list(graph.triples((root, SCHEMA.author, None))) + author_emails = list(graph.triples((root, SCHEMA.email, None))) + graph.remove((root, SCHEMA.author, None)) + graph.remove((root, SCHEMA.email, None)) + if author_names or author_emails: + author = BNode() + graph.add((author, RDF.type, SCHEMA.Person)) + for (_, _, author_name) in author_names: + graph.add((author, SCHEMA.name, author_name)) + for (_, _, author_email) in author_emails: + graph.add((author, SCHEMA.email, author_email)) + add_list(graph, root, SCHEMA.author, [author]) def normalize_home_page(self, urls): - return [{"@id": url} for url in urls] + return [URIRef(url) for url in urls] def normalize_keywords(self, keywords): - return list(itertools.chain.from_iterable(s.split(" ") for s in keywords)) + return [Literal(keyword) for s in keywords for keyword in s.split(" ")] def normalize_license(self, licenses): - return [{"@id": license} for license in licenses] + return [URIRef("https://spdx.org/licenses/" + license) for license in licenses] diff --git a/swh/indexer/metadata_dictionary/ruby.py b/swh/indexer/metadata_dictionary/ruby.py index bdb06aa..71a0b10 100644 --- a/swh/indexer/metadata_dictionary/ruby.py +++ b/swh/indexer/metadata_dictionary/ruby.py @@ -1,135 +1,130 @@ # Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import ast import itertools import re from typing import List -from swh.indexer.codemeta import CROSSWALK_TABLE, SCHEMA_URI +from rdflib import RDF, BNode, Graph, Literal, URIRef + +from swh.indexer.codemeta import CROSSWALK_TABLE from swh.indexer.metadata_dictionary.base import DirectoryLsEntry +from swh.indexer.namespaces import SCHEMA from swh.indexer.storage.interface import Sha1 from .base import BaseIntrinsicMapping, DictMapping +from .utils import add_map + +SPDX = URIRef("https://spdx.org/licenses/") -def name_to_person(name): - return { - "@type": SCHEMA_URI + "Person", - SCHEMA_URI + "name": name, - } +def name_to_person(graph: Graph, name): + if not isinstance(name, str): + return None + author = BNode() + graph.add((author, RDF.type, SCHEMA.Person)) + graph.add((author, SCHEMA.name, Literal(name))) + return author class GemspecMapping(BaseIntrinsicMapping, DictMapping): name = "gemspec" mapping = CROSSWALK_TABLE["Ruby Gem"] string_fields = ["name", "version", "description", "summary", "email"] + uri_fields = ["homepage"] _re_spec_new = re.compile(r".*Gem::Specification.new +(do|\{) +\|.*\|.*") _re_spec_entry = re.compile(r"\s*\w+\.(?P\w+)\s*=\s*(?P.*)") @classmethod def detect_metadata_files(cls, file_entries: List[DirectoryLsEntry]) -> List[Sha1]: for entry in file_entries: if entry["name"].endswith(b".gemspec"): return [entry["sha1"]] return [] def translate(self, raw_content): try: raw_content = raw_content.decode() except UnicodeDecodeError: self.log.warning("Error unidecoding from %s", self.log_suffix) return # Skip lines before 'Gem::Specification.new' lines = itertools.dropwhile( lambda x: not self._re_spec_new.match(x), raw_content.split("\n") ) try: next(lines) # Consume 'Gem::Specification.new' except StopIteration: self.log.warning("Could not find Gem::Specification in %s", self.log_suffix) return content_dict = {} for line in lines: match = self._re_spec_entry.match(line) if match: value = self.eval_ruby_expression(match.group("expr")) if value: content_dict[match.group("key")] = value return self._translate_dict(content_dict) def eval_ruby_expression(self, expr): """Very simple evaluator of Ruby expressions. >>> GemspecMapping().eval_ruby_expression('"Foo bar"') 'Foo bar' >>> GemspecMapping().eval_ruby_expression("'Foo bar'") 'Foo bar' >>> GemspecMapping().eval_ruby_expression("['Foo', 'bar']") ['Foo', 'bar'] >>> GemspecMapping().eval_ruby_expression("'Foo bar'.freeze") 'Foo bar' >>> GemspecMapping().eval_ruby_expression( \ "['Foo'.freeze, 'bar'.freeze]") ['Foo', 'bar'] """ def evaluator(node): if isinstance(node, ast.Str): return node.s elif isinstance(node, ast.List): res = [] for element in node.elts: val = evaluator(element) if not val: return res.append(val) return res expr = expr.replace(".freeze", "") try: # We're parsing Ruby expressions here, but Python's # ast.parse works for very simple Ruby expressions # (mainly strings delimited with " or ', and lists # of such strings). tree = ast.parse(expr, mode="eval") except (SyntaxError, ValueError): return if isinstance(tree, ast.Expression): return evaluator(tree.body) - def normalize_homepage(self, s): - if isinstance(s, str): - return {"@id": s} - def normalize_license(self, s): if isinstance(s, str): - return [{"@id": "https://spdx.org/licenses/" + s}] + return SPDX + s def normalize_licenses(self, licenses): if isinstance(licenses, list): - return [ - {"@id": "https://spdx.org/licenses/" + license} - for license in licenses - if isinstance(license, str) - ] + return [SPDX + license for license in licenses if isinstance(license, str)] - def normalize_author(self, author): + def translate_author(self, graph: Graph, root, author): if isinstance(author, str): - return {"@list": [name_to_person(author)]} + add_map(graph, root, SCHEMA.author, name_to_person, [author]) - def normalize_authors(self, authors): + def translate_authors(self, graph: Graph, root, authors): if isinstance(authors, list): - return { - "@list": [ - name_to_person(author) - for author in authors - if isinstance(author, str) - ] - } + add_map(graph, root, SCHEMA.author, name_to_person, authors) diff --git a/swh/indexer/metadata_dictionary/utils.py b/swh/indexer/metadata_dictionary/utils.py new file mode 100644 index 0000000..173b146 --- /dev/null +++ b/swh/indexer/metadata_dictionary/utils.py @@ -0,0 +1,72 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + + +import json +from typing import Callable, Iterable, Optional, Sequence, TypeVar + +from pyld import jsonld +from rdflib import RDF, Graph, URIRef +import rdflib.term + +from swh.indexer.codemeta import _document_loader + + +def prettyprint_graph(graph: Graph, root: URIRef): + s = graph.serialize(format="application/ld+json") + jsonld_graph = json.loads(s) + translated_metadata = jsonld.frame( + jsonld_graph, + {"@id": str(root)}, + options={ + "documentLoader": _document_loader, + "processingMode": "json-ld-1.1", + }, + ) + print(json.dumps(translated_metadata, indent=4)) + + +def add_list( + graph: Graph, + subject: rdflib.term.Node, + predicate: rdflib.term.Identifier, + objects: Sequence[rdflib.term.Node], +) -> None: + """Adds triples to the ``graph`` so that they are equivalent to this + JSON-LD object:: + + { + "@id": subject, + predicate: {"@list": objects} + } + + This is a naive implementation of + https://json-ld.org/spec/latest/json-ld-api/#list-to-rdf-conversion + """ + # JSON-LD's @list is syntactic sugar for a linked list / chain in the RDF graph, + # which is what we are going to construct, starting from the end: + last_link: rdflib.term.Node + last_link = RDF.nil + for item in reversed(objects): + link = rdflib.BNode() + graph.add((link, RDF.first, item)) + graph.add((link, RDF.rest, last_link)) + last_link = link + graph.add((subject, predicate, last_link)) + + +TValue = TypeVar("TValue") + + +def add_map( + graph: Graph, + subject: rdflib.term.Node, + predicate: rdflib.term.Identifier, + f: Callable[[Graph, TValue], Optional[rdflib.term.Node]], + values: Iterable[TValue], +) -> None: + """Helper for :func:`add_list` that takes a mapper function ``f``.""" + nodes = [f(graph, value) for value in values] + add_list(graph, subject, predicate, [node for node in nodes if node]) diff --git a/swh/indexer/namespaces.py b/swh/indexer/namespaces.py new file mode 100644 index 0000000..65ab826 --- /dev/null +++ b/swh/indexer/namespaces.py @@ -0,0 +1,12 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from rdflib import Namespace as _Namespace +from rdflib import RDF # noqa + +SCHEMA = _Namespace("http://schema.org/") +CODEMETA = _Namespace("https://codemeta.github.io/terms/") +FORGEFED = _Namespace("https://forgefed.org/ns#") +ACTIVITYSTREAMS = _Namespace("https://www.w3.org/ns/activitystreams#") diff --git a/swh/indexer/tests/metadata_dictionary/test_cff.py b/swh/indexer/tests/metadata_dictionary/test_cff.py index f91a689..fb50ba5 100644 --- a/swh/indexer/tests/metadata_dictionary/test_cff.py +++ b/swh/indexer/tests/metadata_dictionary/test_cff.py @@ -1,220 +1,225 @@ -# Copyright (C) 2017-2022 The Software Heritage developers +# Copyright (C) 2021-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.indexer.metadata_dictionary import MAPPINGS def test_compute_metadata_cff(): """ testing CITATION.cff translation """ content = """# YAML 1.2 --- abstract: "Command line program to convert from Citation File \ Format to various other formats such as BibTeX, EndNote, RIS, \ schema.org, CodeMeta, and .zenodo.json." authors: - affiliation: "Netherlands eScience Center" family-names: Klaver given-names: Tom - affiliation: "Humboldt-Universität zu Berlin" family-names: Druskat given-names: Stephan orcid: https://orcid.org/0000-0003-4925-7248 cff-version: "1.0.3" date-released: 2019-11-12 doi: 10.5281/zenodo.1162057 keywords: - "citation" - "bibliography" - "cff" - "CITATION.cff" license: Apache-2.0 message: "If you use this software, please cite it using these metadata." license: Apache-2.0 message: "If you use this software, please cite it using these metadata." repository-code: "https://github.com/citation-file-format/cff-converter-python" title: cffconvert version: "1.4.0-alpha0" """.encode( "utf-8" ) + result = MAPPINGS["CffMapping"]().translate(content) + assert set(result.pop("keywords")) == { + "citation", + "bibliography", + "cff", + "CITATION.cff", + } expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ { "type": "Person", "affiliation": { "type": "Organization", "name": "Netherlands eScience Center", }, "familyName": "Klaver", "givenName": "Tom", }, { "id": "https://orcid.org/0000-0003-4925-7248", "type": "Person", "affiliation": { "type": "Organization", "name": "Humboldt-Universität zu Berlin", }, "familyName": "Druskat", "givenName": "Stephan", }, ], "codeRepository": ( "https://github.com/citation-file-format/cff-converter-python" ), "datePublished": "2019-11-12", "description": """Command line program to convert from \ Citation File Format to various other formats such as BibTeX, EndNote, \ RIS, schema.org, CodeMeta, and .zenodo.json.""", "identifier": "https://doi.org/10.5281/zenodo.1162057", - "keywords": ["citation", "bibliography", "cff", "CITATION.cff"], "license": "https://spdx.org/licenses/Apache-2.0", "version": "1.4.0-alpha0", } - result = MAPPINGS["CffMapping"]().translate(content) assert expected == result def test_compute_metadata_cff_invalid_yaml(): """ test yaml translation for invalid yaml file """ content = """cff-version: 1.0.3 message: To cite the SigMF specification, please include the following: authors: - name: The GNU Radio Foundation, Inc. """.encode( "utf-8" ) expected = None result = MAPPINGS["CffMapping"]().translate(content) assert expected == result def test_compute_metadata_cff_empty(): """ test yaml translation for empty yaml file """ content = """ """.encode( "utf-8" ) expected = None result = MAPPINGS["CffMapping"]().translate(content) assert expected == result def test_compute_metadata_cff_list(): """ test yaml translation for empty yaml file """ content = """ - Foo - Bar """.encode( "utf-8" ) expected = None result = MAPPINGS["CffMapping"]().translate(content) assert expected == result def test_cff_empty_fields(): """ testing CITATION.cff translation """ content = """# YAML 1.2 authors: - affiliation: "Hogwarts" family-names: given-names: Harry - affiliation: "Ministry of Magic" family-names: Weasley orcid: given-names: Arthur """.encode( "utf-8" ) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ { "type": "Person", "affiliation": { "type": "Organization", "name": "Hogwarts", }, "givenName": "Harry", }, { "type": "Person", "affiliation": { "type": "Organization", "name": "Ministry of Magic", }, "familyName": "Weasley", "givenName": "Arthur", }, ], } result = MAPPINGS["CffMapping"]().translate(content) assert expected == result def test_cff_invalid_fields(): """ testing CITATION.cff translation """ content = """# YAML 1.2 authors: - affiliation: "Hogwarts" family-names: - Potter - James given-names: Harry """.encode( "utf-8" ) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ { "type": "Person", "affiliation": { "type": "Organization", "name": "Hogwarts", }, "givenName": "Harry", }, ], } result = MAPPINGS["CffMapping"]().translate(content) assert expected == result diff --git a/swh/indexer/tests/metadata_dictionary/test_codemeta.py b/swh/indexer/tests/metadata_dictionary/test_codemeta.py index 383b4a7..21865ee 100644 --- a/swh/indexer/tests/metadata_dictionary/test_codemeta.py +++ b/swh/indexer/tests/metadata_dictionary/test_codemeta.py @@ -1,175 +1,367 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json from hypothesis import HealthCheck, given, settings from swh.indexer.codemeta import CODEMETA_TERMS from swh.indexer.metadata_detector import detect_metadata from swh.indexer.metadata_dictionary import MAPPINGS from ..utils import json_document_strategy def test_compute_metadata_valid_codemeta(): raw_content = b"""{ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "@type": "SoftwareSourceCode", "identifier": "CodeMeta", "description": "CodeMeta is a concept vocabulary that can be used to standardize the exchange of software metadata across repositories and organizations.", "name": "CodeMeta: Minimal metadata schemas for science software and code, in JSON-LD", "codeRepository": "https://github.com/codemeta/codemeta", "issueTracker": "https://github.com/codemeta/codemeta/issues", "license": "https://spdx.org/licenses/Apache-2.0", "version": "2.0", "author": [ { "@type": "Person", "givenName": "Carl", "familyName": "Boettiger", "email": "cboettig@gmail.com", "@id": "http://orcid.org/0000-0002-1642-628X" }, { "@type": "Person", "givenName": "Matthew B.", "familyName": "Jones", "email": "jones@nceas.ucsb.edu", "@id": "http://orcid.org/0000-0003-0077-4738" } ], "maintainer": { "@type": "Person", "givenName": "Carl", "familyName": "Boettiger", "email": "cboettig@gmail.com", "@id": "http://orcid.org/0000-0002-1642-628X" }, "contIntegration": "https://travis-ci.org/codemeta/codemeta", "developmentStatus": "active", "downloadUrl": "https://github.com/codemeta/codemeta/archive/2.0.zip", "funder": { "@id": "https://doi.org/10.13039/100000001", "@type": "Organization", "name": "National Science Foundation" }, "funding":"1549758; Codemeta: A Rosetta Stone for Metadata in Scientific Software", "keywords": [ "metadata", "software" ], "version":"2.0", "dateCreated":"2017-06-05", "datePublished":"2017-06-05", "programmingLanguage": "JSON-LD" }""" # noqa expected_result = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "identifier": "CodeMeta", "description": "CodeMeta is a concept vocabulary that can " "be used to standardize the exchange of software metadata " "across repositories and organizations.", "name": "CodeMeta: Minimal metadata schemas for science " "software and code, in JSON-LD", "codeRepository": "https://github.com/codemeta/codemeta", "issueTracker": "https://github.com/codemeta/codemeta/issues", "license": "https://spdx.org/licenses/Apache-2.0", "version": "2.0", "author": [ { "type": "Person", "givenName": "Carl", "familyName": "Boettiger", "email": "cboettig@gmail.com", "id": "http://orcid.org/0000-0002-1642-628X", }, { "type": "Person", "givenName": "Matthew B.", "familyName": "Jones", "email": "jones@nceas.ucsb.edu", "id": "http://orcid.org/0000-0003-0077-4738", }, ], "maintainer": { "type": "Person", "givenName": "Carl", "familyName": "Boettiger", "email": "cboettig@gmail.com", "id": "http://orcid.org/0000-0002-1642-628X", }, "contIntegration": "https://travis-ci.org/codemeta/codemeta", "developmentStatus": "active", "downloadUrl": "https://github.com/codemeta/codemeta/archive/2.0.zip", "funder": { "id": "https://doi.org/10.13039/100000001", "type": "Organization", "name": "National Science Foundation", }, "funding": "1549758; Codemeta: A Rosetta Stone for Metadata " "in Scientific Software", "keywords": ["metadata", "software"], "version": "2.0", "dateCreated": "2017-06-05", "datePublished": "2017-06-05", "programmingLanguage": "JSON-LD", } result = MAPPINGS["CodemetaMapping"]().translate(raw_content) assert result == expected_result def test_compute_metadata_codemeta_alternate_context(): raw_content = b"""{ "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld", "@type": "SoftwareSourceCode", "identifier": "CodeMeta" }""" # noqa expected_result = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "identifier": "CodeMeta", } result = MAPPINGS["CodemetaMapping"]().translate(raw_content) assert result == expected_result @settings(suppress_health_check=[HealthCheck.too_slow]) @given(json_document_strategy(keys=CODEMETA_TERMS)) def test_codemeta_adversarial(doc): raw = json.dumps(doc).encode() MAPPINGS["CodemetaMapping"]().translate(raw) def test_detect_metadata_codemeta_json_uppercase(): df = [ { "sha1_git": b"abc", "name": b"index.html", "target": b"abc", "length": 897, "status": "visible", "type": "file", "perms": 33188, "dir_id": b"dir_a", "sha1": b"bcd", }, { "sha1_git": b"aab", "name": b"CODEMETA.json", "target": b"aab", "length": 712, "status": "visible", "type": "file", "perms": 33188, "dir_id": b"dir_a", "sha1": b"bcd", }, ] results = detect_metadata(df) expected_results = {"CodemetaMapping": [b"bcd"]} assert expected_results == results + + +def test_sword_default_xmlns(): + content = """ + + My Software + + Author 1 + foo@example.org + + + Author 2 + + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "author": [ + {"name": "Author 1", "email": "foo@example.org"}, + {"name": "Author 2"}, + ], + } + + +def test_sword_basics(): + content = """ + + My Software + + Author 1 + foo@example.org + + + Author 2 + + + Author 3 + bar@example.org + + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "author": [ + {"name": "Author 1", "email": "foo@example.org"}, + {"name": "Author 2"}, + {"name": "Author 3", "email": "bar@example.org"}, + ], + } + + +def test_sword_mixed(): + content = """ + + My Software + blah + 1.2.3 + blih + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "version": "1.2.3", + } + + +def test_sword_schemaorg_in_codemeta(): + content = """ + + My Software + 1.2.3 + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "version": "1.2.3", + } + + +def test_sword_schemaorg_in_codemeta_constrained(): + """Resulting property has the compact URI 'schema:url' instead of just + the term 'url', because term 'url' is defined by the Codemeta schema + has having type '@id'.""" + content = """ + + My Software + http://example.org/my-software + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "schema:url": "http://example.org/my-software", + } + + +def test_sword_schemaorg_not_in_codemeta(): + content = """ + + My Software + http://example.org/my-software + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + "schema:sameAs": "http://example.org/my-software", + } + + +def test_sword_atom_name(): + content = """ + + My Software + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": "My Software", + } + + +def test_sword_multiple_names(): + content = """ + + Atom Name 1 + Atom Name 2 + Atom Title 1 + Atom Title 2 + Codemeta Name 1 + Codemeta Name 2 + + """ + + result = MAPPINGS["SwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "name": [ + "Atom Name 1", + "Atom Name 2", + "Atom Title 1", + "Atom Title 2", + "Codemeta Name 1", + "Codemeta Name 2", + ], + } + + +def test_json_sword(): + content = """{"id": "hal-01243573", "@xmlns": "http://www.w3.org/2005/Atom", "author": {"name": "Author 1", "email": "foo@example.org"}, "client": "hal", "codemeta:url": "http://example.org/", "codemeta:name": "The assignment problem", "@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0", "codemeta:author": {"codemeta:name": "Author 2"}, "codemeta:license": {"codemeta:name": "GNU General Public License v3.0 or later"}}""" # noqa + result = MAPPINGS["JsonSwordCodemetaMapping"]().translate(content) + assert result == { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "author": [ + {"name": "Author 1", "email": "foo@example.org"}, + {"name": "Author 2"}, + ], + "license": {"name": "GNU General Public License v3.0 or later"}, + "name": "The assignment problem", + "schema:url": "http://example.org/", + "name": "The assignment problem", + } diff --git a/swh/indexer/tests/metadata_dictionary/test_composer.py b/swh/indexer/tests/metadata_dictionary/test_composer.py index 9513938..809ac01 100644 --- a/swh/indexer/tests/metadata_dictionary/test_composer.py +++ b/swh/indexer/tests/metadata_dictionary/test_composer.py @@ -1,84 +1,89 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.indexer.metadata_dictionary import MAPPINGS def test_compute_metadata_composer(): raw_content = """{ "name": "symfony/polyfill-mbstring", "type": "library", "description": "Symfony polyfill for the Mbstring extension", "keywords": [ "polyfill", "shim", "compatibility", "portable" ], "homepage": "https://symfony.com", "license": "MIT", "authors": [ { "name": "Nicolas Grekas", "email": "p@tchwork.com" }, { "name": "Symfony Community", "homepage": "https://symfony.com/contributors" } ], "require": { "php": ">=7.1" }, "provide": { "ext-mbstring": "*" }, "autoload": { "files": [ "bootstrap.php" ] }, "suggest": { "ext-mbstring": "For best performance" }, "minimum-stability": "dev", "extra": { "branch-alias": { "dev-main": "1.26-dev" }, "thanks": { "name": "symfony/polyfill", "url": "https://github.com/symfony/polyfill" } } } """.encode( "utf-8" ) result = MAPPINGS["ComposerMapping"]().translate(raw_content) + assert set(result.pop("keywords")) == { + "polyfill", + "shim", + "compatibility", + "portable", + }, result expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "symfony/polyfill-mbstring", - "keywords": ["polyfill", "shim", "compatibility", "portable"], "description": "Symfony polyfill for the Mbstring extension", "url": "https://symfony.com", "license": "https://spdx.org/licenses/MIT", "author": [ { "type": "Person", "name": "Nicolas Grekas", "email": "p@tchwork.com", }, { "type": "Person", "name": "Symfony Community", }, ], } assert result == expected diff --git a/swh/indexer/tests/metadata_dictionary/test_dart.py b/swh/indexer/tests/metadata_dictionary/test_dart.py index 146f7c7..956d088 100644 --- a/swh/indexer/tests/metadata_dictionary/test_dart.py +++ b/swh/indexer/tests/metadata_dictionary/test_dart.py @@ -1,157 +1,160 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information +import pytest + from swh.indexer.metadata_dictionary import MAPPINGS def test_compute_metadata_pubspec(): raw_content = """ --- name: newtify description: >- Have you been turned into a newt? Would you like to be? This package can help. It has all of the newt-transmogrification functionality you have been looking for. keywords: - polyfill - shim - compatibility - portable - mbstring version: 1.2.3 license: MIT homepage: https://example-pet-store.com/newtify documentation: https://example-pet-store.com/newtify/docs environment: sdk: '>=2.10.0 <3.0.0' dependencies: efts: ^2.0.4 transmogrify: ^0.4.0 dev_dependencies: test: '>=1.15.0 <2.0.0' """.encode( "utf-8" ) result = MAPPINGS["PubMapping"]().translate(raw_content) + assert set(result.pop("keywords")) == { + "polyfill", + "shim", + "compatibility", + "portable", + "mbstring", + }, result expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "newtify", - "keywords": [ - "polyfill", - "shim", - "compatibility", - "portable", - "mbstring", - ], "description": """Have you been turned into a newt? Would you like to be? \ This package can help. It has all of the \ newt-transmogrification functionality you have been looking \ for.""", "url": "https://example-pet-store.com/newtify", "license": "https://spdx.org/licenses/MIT", } assert result == expected def test_normalize_author_pubspec(): raw_content = """ author: Atlee Pine """.encode( "utf-8" ) result = MAPPINGS["PubMapping"]().translate(raw_content) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ {"type": "Person", "name": "Atlee Pine", "email": "atlee@example.org"}, ], } assert result == expected def test_normalize_authors_pubspec(): raw_content = """ authors: - Vicky Merzown - Ron Bilius Weasley """.encode( "utf-8" ) result = MAPPINGS["PubMapping"]().translate(raw_content) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ {"type": "Person", "name": "Vicky Merzown", "email": "vmz@example.org"}, { "type": "Person", "name": "Ron Bilius Weasley", }, ], } assert result == expected +@pytest.mark.xfail(reason="https://github.com/w3c/json-ld-api/issues/547") def test_normalize_author_authors_pubspec(): raw_content = """ authors: - Vicky Merzown - Ron Bilius Weasley author: Hermione Granger """.encode( "utf-8" ) result = MAPPINGS["PubMapping"]().translate(raw_content) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [ {"type": "Person", "name": "Vicky Merzown", "email": "vmz@example.org"}, { "type": "Person", "name": "Ron Bilius Weasley", }, { "type": "Person", "name": "Hermione Granger", }, ], } assert result == expected def test_normalize_empty_authors(): raw_content = """ authors: """.encode( "utf-8" ) result = MAPPINGS["PubMapping"]().translate(raw_content) expected = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } assert result == expected diff --git a/swh/indexer/tests/metadata_dictionary/test_github.py b/swh/indexer/tests/metadata_dictionary/test_github.py index 290d91c..c0592dc 100644 --- a/swh/indexer/tests/metadata_dictionary/test_github.py +++ b/swh/indexer/tests/metadata_dictionary/test_github.py @@ -1,142 +1,142 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.indexer.metadata_dictionary import MAPPINGS CONTEXT = [ "https://doi.org/10.5063/schema/codemeta-2.0", { "as": "https://www.w3.org/ns/activitystreams#", "forge": "https://forgefed.org/ns#", }, ] def test_compute_metadata_none(): """ testing content empty content is empty should return None """ content = b"" # None if no metadata was found or an error occurred declared_metadata = None result = MAPPINGS["GitHubMapping"]().translate(content) assert declared_metadata == result def test_supported_terms(): terms = MAPPINGS["GitHubMapping"].supported_terms() assert { "http://schema.org/name", "http://schema.org/license", "https://forgefed.org/ns#forks", "https://www.w3.org/ns/activitystreams#totalItems", } <= terms def test_compute_metadata_github(): """ testing only computation of metadata with hard_mapping_npm """ content = b""" { "id": 80521091, "node_id": "MDEwOlJlcG9zaXRvcnk4MDUyMTA5MQ==", "name": "swh-indexer", "full_name": "SoftwareHeritage/swh-indexer", "private": false, "owner": { "login": "SoftwareHeritage", "id": 18555939, "node_id": "MDEyOk9yZ2FuaXphdGlvbjE4NTU1OTM5", "avatar_url": "https://avatars.githubusercontent.com/u/18555939?v=4", "gravatar_id": "", "url": "https://api.github.com/users/SoftwareHeritage", "type": "Organization", "site_admin": false }, "html_url": "https://github.com/SoftwareHeritage/swh-indexer", "description": "GitHub mirror of Metadata indexer", "fork": false, "url": "https://api.github.com/repos/SoftwareHeritage/swh-indexer", "created_at": "2017-01-31T13:05:39Z", "updated_at": "2022-06-22T08:02:20Z", "pushed_at": "2022-06-29T09:01:08Z", "git_url": "git://github.com/SoftwareHeritage/swh-indexer.git", "ssh_url": "git@github.com:SoftwareHeritage/swh-indexer.git", "clone_url": "https://github.com/SoftwareHeritage/swh-indexer.git", "svn_url": "https://github.com/SoftwareHeritage/swh-indexer", "homepage": "https://forge.softwareheritage.org/source/swh-indexer/", "size": 2713, "stargazers_count": 13, "watchers_count": 12, "language": "Python", "has_issues": false, "has_projects": false, "has_downloads": true, "has_wiki": false, "has_pages": false, "forks_count": 1, "mirror_url": null, "archived": false, "disabled": false, "open_issues_count": 0, "license": { "key": "gpl-3.0", "name": "GNU General Public License v3.0", "spdx_id": "GPL-3.0", "url": "https://api.github.com/licenses/gpl-3.0", "node_id": "MDc6TGljZW5zZTk=" }, "allow_forking": true, "is_template": false, "web_commit_signoff_required": false, "topics": [ ], "visibility": "public", "forks": 1, "open_issues": 0, "watchers": 13, "default_branch": "master", "temp_clone_token": null, "organization": { "login": "SoftwareHeritage", "id": 18555939, "node_id": "MDEyOk9yZ2FuaXphdGlvbjE4NTU1OTM5", "avatar_url": "https://avatars.githubusercontent.com/u/18555939?v=4", "gravatar_id": "", "type": "Organization", "site_admin": false }, "network_count": 1, "subscribers_count": 6 } """ result = MAPPINGS["GitHubMapping"]().translate(content) assert result == { "@context": CONTEXT, - "type": "https://forgefed.org/ns#Repository", + "type": "forge:Repository", "forge:forks": { "as:totalItems": 1, "type": "as:OrderedCollection", }, "as:likes": { "as:totalItems": 13, "type": "as:Collection", }, "as:followers": { "as:totalItems": 12, "type": "as:Collection", }, "license": "https://spdx.org/licenses/GPL-3.0", "name": "SoftwareHeritage/swh-indexer", "description": "GitHub mirror of Metadata indexer", "schema:codeRepository": "https://github.com/SoftwareHeritage/swh-indexer", "schema:dateCreated": "2017-01-31T13:05:39Z", "schema:dateModified": "2022-06-22T08:02:20Z", } diff --git a/swh/indexer/tests/metadata_dictionary/test_maven.py b/swh/indexer/tests/metadata_dictionary/test_maven.py index ea51860..0267e95 100644 --- a/swh/indexer/tests/metadata_dictionary/test_maven.py +++ b/swh/indexer/tests/metadata_dictionary/test_maven.py @@ -1,365 +1,365 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging from hypothesis import HealthCheck, given, settings from swh.indexer.metadata_dictionary import MAPPINGS from ..utils import xml_document_strategy def test_compute_metadata_maven(): raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 central Maven Repository Switchboard default http://repo1.maven.org/maven2 false Apache License, Version 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt repo A business-friendly OSS license """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "license": "https://www.apache.org/licenses/LICENSE-2.0.txt", "codeRepository": ("http://repo1.maven.org/maven2/com/mycompany/app/my-app"), } def test_compute_metadata_maven_empty(): raw_content = b""" """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } def test_compute_metadata_maven_almost_empty(): raw_content = b""" """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } def test_compute_metadata_maven_invalid_xml(caplog): expected_warning = ( "swh.indexer.metadata_dictionary.maven.MavenMapping", logging.WARNING, "Error parsing XML from foo", ) caplog.at_level(logging.WARNING, logger="swh.indexer.metadata_dictionary") raw_content = b""" """ caplog.clear() result = MAPPINGS["MavenMapping"]("foo").translate(raw_content) assert caplog.record_tuples == [expected_warning], result assert result is None raw_content = b""" """ caplog.clear() result = MAPPINGS["MavenMapping"]("foo").translate(raw_content) assert caplog.record_tuples == [expected_warning], result assert result is None def test_compute_metadata_maven_unknown_encoding(caplog): expected_warning = ( "swh.indexer.metadata_dictionary.maven.MavenMapping", logging.WARNING, "Error detecting XML encoding from foo", ) caplog.at_level(logging.WARNING, logger="swh.indexer.metadata_dictionary") raw_content = b""" """ caplog.clear() result = MAPPINGS["MavenMapping"]("foo").translate(raw_content) assert caplog.record_tuples == [expected_warning], result assert result is None raw_content = b""" """ caplog.clear() result = MAPPINGS["MavenMapping"]("foo").translate(raw_content) assert caplog.record_tuples == [expected_warning], result assert result is None def test_compute_metadata_maven_invalid_encoding(caplog): expected_warning = [ # libexpat1 <= 2.2.10-2+deb11u1 [ ( "swh.indexer.metadata_dictionary.maven.MavenMapping", logging.WARNING, "Error unidecoding XML from foo", ) ], # libexpat1 >= 2.2.10-2+deb11u2 [ ( "swh.indexer.metadata_dictionary.maven.MavenMapping", logging.WARNING, "Error parsing XML from foo", ) ], ] caplog.at_level(logging.WARNING, logger="swh.indexer.metadata_dictionary") raw_content = b""" """ caplog.clear() result = MAPPINGS["MavenMapping"]("foo").translate(raw_content) assert caplog.record_tuples in expected_warning, result assert result is None def test_compute_metadata_maven_minimal(): raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } def test_compute_metadata_maven_empty_nodes(): raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } raw_content = b""" 4.0.0 com.mycompany.app my-app 1.2.3 """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } raw_content = b""" 1.2.3 """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "version": "1.2.3", } def test_compute_metadata_maven_invalid_licenses(): raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 foo """ result = MAPPINGS["MavenMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", "codeRepository": ( "https://repo.maven.apache.org/maven2/com/mycompany/app/my-app" ), } def test_compute_metadata_maven_multiple(): """Tests when there are multiple code repos and licenses.""" raw_content = b""" Maven Default Project 4.0.0 com.mycompany.app my-app 1.2.3 central Maven Repository Switchboard default http://repo1.maven.org/maven2 false example Example Maven Repo default http://example.org/maven2 Apache License, Version 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt repo A business-friendly OSS license MIT license https://opensource.org/licenses/MIT """ result = MAPPINGS["MavenMapping"]().translate(raw_content) + assert set(result.pop("license")) == { + "https://www.apache.org/licenses/LICENSE-2.0.txt", + "https://opensource.org/licenses/MIT", + }, result + assert set(result.pop("codeRepository")) == { + "http://repo1.maven.org/maven2/com/mycompany/app/my-app", + "http://example.org/maven2/com/mycompany/app/my-app", + }, result assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "Maven Default Project", - "identifier": "com.mycompany.app", + "schema:identifier": "com.mycompany.app", "version": "1.2.3", - "license": [ - "https://www.apache.org/licenses/LICENSE-2.0.txt", - "https://opensource.org/licenses/MIT", - ], - "codeRepository": [ - "http://repo1.maven.org/maven2/com/mycompany/app/my-app", - "http://example.org/maven2/com/mycompany/app/my-app", - ], } @settings(suppress_health_check=[HealthCheck.too_slow]) @given( xml_document_strategy( keys=list(MAPPINGS["MavenMapping"].mapping), # type: ignore root="project", xmlns="http://maven.apache.org/POM/4.0.0", ) ) def test_maven_adversarial(doc): MAPPINGS["MavenMapping"]().translate(doc) diff --git a/swh/indexer/tests/metadata_dictionary/test_npm.py b/swh/indexer/tests/metadata_dictionary/test_npm.py index 781e995..000cb7c 100644 --- a/swh/indexer/tests/metadata_dictionary/test_npm.py +++ b/swh/indexer/tests/metadata_dictionary/test_npm.py @@ -1,318 +1,313 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json from hypothesis import HealthCheck, given, settings import pytest from swh.indexer.metadata_detector import detect_metadata from swh.indexer.metadata_dictionary import MAPPINGS from swh.indexer.storage.model import ContentMetadataRow from ..test_metadata import TRANSLATOR_TOOL, ContentMetadataTestIndexer from ..utils import ( BASE_TEST_CONFIG, MAPPING_DESCRIPTION_CONTENT_SHA1, json_document_strategy, ) def test_compute_metadata_none(): """ testing content empty content is empty should return None """ content = b"" # None if no metadata was found or an error occurred declared_metadata = None result = MAPPINGS["NpmMapping"]().translate(content) assert declared_metadata == result def test_compute_metadata_npm(): """ testing only computation of metadata with hard_mapping_npm """ content = b""" { "name": "test_metadata", "version": "0.0.2", "description": "Simple package.json test for indexer", "repository": { "type": "git", "url": "https://github.com/moranegg/metadata_test" }, "author": { "email": "moranegg@example.com", "name": "Morane G" } } """ declared_metadata = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "test_metadata", "version": "0.0.2", "description": "Simple package.json test for indexer", "codeRepository": "git+https://github.com/moranegg/metadata_test", "author": [ { "type": "Person", "name": "Morane G", "email": "moranegg@example.com", } ], } result = MAPPINGS["NpmMapping"]().translate(content) assert declared_metadata == result def test_compute_metadata_invalid_description_npm(): """ testing only computation of metadata with hard_mapping_npm """ content = b""" { "name": "test_metadata", "version": "0.0.2", "description": 1234 } """ declared_metadata = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "test_metadata", "version": "0.0.2", } result = MAPPINGS["NpmMapping"]().translate(content) assert declared_metadata == result def test_index_content_metadata_npm(storage, obj_storage): """ testing NPM with package.json - one sha1 uses a file that can't be translated to metadata and should return None in the translated metadata """ sha1s = [ MAPPING_DESCRIPTION_CONTENT_SHA1["json:test-metadata-package.json"], MAPPING_DESCRIPTION_CONTENT_SHA1["json:npm-package.json"], MAPPING_DESCRIPTION_CONTENT_SHA1["python:code"], ] # this metadata indexer computes only metadata for package.json # in npm context with a hard mapping config = BASE_TEST_CONFIG.copy() config["tools"] = [TRANSLATOR_TOOL] metadata_indexer = ContentMetadataTestIndexer(config=config) metadata_indexer.run(sha1s, log_suffix="unknown content") results = list(metadata_indexer.idx_storage.content_metadata_get(sha1s)) expected_results = [ ContentMetadataRow( id=sha1s[0], tool=TRANSLATOR_TOOL, metadata={ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "codeRepository": "git+https://github.com/moranegg/metadata_test", "description": "Simple package.json test for indexer", "name": "test_metadata", "version": "0.0.1", }, ), ContentMetadataRow( id=sha1s[1], tool=TRANSLATOR_TOOL, metadata={ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "issueTracker": "https://github.com/npm/npm/issues", "author": [ { "type": "Person", "name": "Isaac Z. Schlueter", "email": "i@izs.me", "url": "http://blog.izs.me", } ], "codeRepository": "git+https://github.com/npm/npm", "description": "a package manager for JavaScript", "license": "https://spdx.org/licenses/Artistic-2.0", "version": "5.0.3", "name": "npm", - "keywords": [ - "install", - "modules", - "package manager", - "package.json", - ], "url": "https://docs.npmjs.com/", }, ), ] for result in results: del result.tool["id"] + result.metadata.pop("keywords", None) # The assertion below returns False sometimes because of nested lists assert expected_results == results def test_npm_bugs_normalization(): # valid dictionary package_json = b"""{ "name": "foo", "bugs": { "url": "https://github.com/owner/project/issues", "email": "foo@example.com" } }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "issueTracker": "https://github.com/owner/project/issues", "type": "SoftwareSourceCode", } # "invalid" dictionary package_json = b"""{ "name": "foo", "bugs": { "email": "foo@example.com" } }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "type": "SoftwareSourceCode", } # string package_json = b"""{ "name": "foo", "bugs": "https://github.com/owner/project/issues" }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "issueTracker": "https://github.com/owner/project/issues", "type": "SoftwareSourceCode", } def test_npm_repository_normalization(): # normal package_json = b"""{ "name": "foo", "repository": { "type" : "git", "url" : "https://github.com/npm/cli.git" } }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "codeRepository": "git+https://github.com/npm/cli.git", "type": "SoftwareSourceCode", } # missing url package_json = b"""{ "name": "foo", "repository": { "type" : "git" } }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "type": "SoftwareSourceCode", } # github shortcut package_json = b"""{ "name": "foo", "repository": "github:npm/cli" }""" result = MAPPINGS["NpmMapping"]().translate(package_json) expected_result = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "codeRepository": "git+https://github.com/npm/cli.git", "type": "SoftwareSourceCode", } assert result == expected_result # github shortshortcut package_json = b"""{ "name": "foo", "repository": "npm/cli" }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == expected_result # gitlab shortcut package_json = b"""{ "name": "foo", "repository": "gitlab:user/repo" }""" result = MAPPINGS["NpmMapping"]().translate(package_json) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "foo", "codeRepository": "git+https://gitlab.com/user/repo.git", "type": "SoftwareSourceCode", } @settings(suppress_health_check=[HealthCheck.too_slow]) @given(json_document_strategy(keys=list(MAPPINGS["NpmMapping"].mapping))) # type: ignore def test_npm_adversarial(doc): raw = json.dumps(doc).encode() MAPPINGS["NpmMapping"]().translate(raw) @pytest.mark.parametrize( "filename", [b"package.json", b"Package.json", b"PACKAGE.json", b"PACKAGE.JSON"] ) def test_detect_metadata_package_json(filename): df = [ { "sha1_git": b"abc", "name": b"index.js", "target": b"abc", "length": 897, "status": "visible", "type": "file", "perms": 33188, "dir_id": b"dir_a", "sha1": b"bcd", }, { "sha1_git": b"aab", "name": filename, "target": b"aab", "length": 712, "status": "visible", "type": "file", "perms": 33188, "dir_id": b"dir_a", "sha1": b"cde", }, ] results = detect_metadata(df) expected_results = {"NpmMapping": [b"cde"]} assert expected_results == results diff --git a/swh/indexer/tests/metadata_dictionary/test_nuget.py b/swh/indexer/tests/metadata_dictionary/test_nuget.py new file mode 100644 index 0000000..e83ad6f --- /dev/null +++ b/swh/indexer/tests/metadata_dictionary/test_nuget.py @@ -0,0 +1,172 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import pytest + +from swh.indexer.metadata_detector import detect_metadata +from swh.indexer.metadata_dictionary import MAPPINGS + + +def test_compute_metadata_nuget(): + raw_content = b""" + + + sample + 1.2.3 + Kim Abercrombie, Franck Halmaert + Sample exists only to show a sample .nuspec file. + Summary is being deprecated. Use description instead. + http://example.org/ + + MIT + https://raw.github.com/timrwood/moment/master/LICENSE + + + + + + See the [changelog](https://github.com/httpie/httpie/releases/tag/3.2.0). + + python3 java cpp search-tag + + + + + """ + + result = MAPPINGS["NuGetMapping"]().translate(raw_content) + + assert set(result.pop("keywords")) == { + "python3", + "java", + "cpp", + "search-tag", + }, result + + assert set(result.pop("license")) == { + "https://spdx.org/licenses/MIT", + "https://raw.github.com/timrwood/moment/master/LICENSE", + }, result + + assert set(result.pop("description")) == { + "Sample exists only to show a sample .nuspec file.", + "Summary is being deprecated. Use description instead.", + }, result + + expected = { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "type": "SoftwareSourceCode", + "author": [ + {"type": "Person", "name": "Kim Abercrombie"}, + {"type": "Person", "name": "Franck Halmaert"}, + ], + "codeRepository": "https://github.com/NuGet/NuGet.Client.git", + "url": "http://example.org/", + "version": "1.2.3", + "schema:releaseNotes": ( + "See the [changelog](https://github.com/httpie/httpie/releases/tag/3.2.0)." + ), + } + + assert result == expected + + +@pytest.mark.parametrize( + "filename", + [b"package_name.nuspec", b"number_5.nuspec", b"CAPS.nuspec", b"\x8anan.nuspec"], +) +def test_detect_metadata_package_nuspec(filename): + df = [ + { + "sha1_git": b"abc", + "name": b"example.json", + "target": b"abc", + "length": 897, + "status": "visible", + "type": "file", + "perms": 33188, + "dir_id": b"dir_a", + "sha1": b"bcd", + }, + { + "sha1_git": b"aab", + "name": filename, + "target": b"aab", + "length": 712, + "status": "visible", + "type": "file", + "perms": 33188, + "dir_id": b"dir_a", + "sha1": b"cde", + }, + ] + results = detect_metadata(df) + + expected_results = {"NuGetMapping": [b"cde"]} + assert expected_results == results + + +def test_normalize_license_multiple_licenses_or_delimiter(): + raw_content = raw_content = b""" + + + BitTorrent-1.0 or GPL-3.0-with-GCC-exception + + + + + """ + result = MAPPINGS["NuGetMapping"]().translate(raw_content) + assert set(result.pop("license")) == { + "https://spdx.org/licenses/BitTorrent-1.0", + "https://spdx.org/licenses/GPL-3.0-with-GCC-exception", + } + expected = { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "type": "SoftwareSourceCode", + } + + assert result == expected + + +def test_normalize_license_unsupported_delimiter(): + raw_content = raw_content = b""" + + + (MIT) + + + + + """ + result = MAPPINGS["NuGetMapping"]().translate(raw_content) + expected = { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "type": "SoftwareSourceCode", + } + + assert result == expected + + +def test_copyrightNotice_absolute_uri_property(): + raw_content = raw_content = b""" + + + Copyright 2017-2022 + en-us + + + + + """ + result = MAPPINGS["NuGetMapping"]().translate(raw_content) + expected = { + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "type": "SoftwareSourceCode", + "schema:copyrightNotice": "Copyright 2017-2022", + "schema:inLanguage": "en-us", + } + + assert result == expected diff --git a/swh/indexer/tests/metadata_dictionary/test_python.py b/swh/indexer/tests/metadata_dictionary/test_python.py index 106a9ca..dbbabd1 100644 --- a/swh/indexer/tests/metadata_dictionary/test_python.py +++ b/swh/indexer/tests/metadata_dictionary/test_python.py @@ -1,114 +1,113 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.indexer.metadata_dictionary import MAPPINGS def test_compute_metadata_pkginfo(): raw_content = b"""\ Metadata-Version: 2.1 Name: swh.core Version: 0.0.49 Summary: Software Heritage core utilities Home-page: https://forge.softwareheritage.org/diffusion/DCORE/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-core Description: swh-core ======== \x20 core library for swh's modules: - config parser - hash computations - serialization - logging mechanism \x20 Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing """ # noqa result = MAPPINGS["PythonPkginfoMapping"]().translate(raw_content) - assert result["description"] == [ + assert set(result.pop("description")) == { "Software Heritage core utilities", # note the comma here "swh-core\n" "========\n" "\n" "core library for swh's modules:\n" "- config parser\n" "- hash computations\n" "- serialization\n" "- logging mechanism\n" "", - ], result - del result["description"] + }, result assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "url": "https://forge.softwareheritage.org/diffusion/DCORE/", "name": "swh.core", "author": [ { "type": "Person", "name": "Software Heritage developers", "email": "swh-devel@inria.fr", } ], "version": "0.0.49", } def test_compute_metadata_pkginfo_utf8(): raw_content = b"""\ Metadata-Version: 1.1 Name: snowpyt Description-Content-Type: UNKNOWN Description: foo Hydrology N\xc2\xb083 """ # noqa result = MAPPINGS["PythonPkginfoMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "snowpyt", "description": "foo\nHydrology N°83", } def test_compute_metadata_pkginfo_keywords(): raw_content = b"""\ Metadata-Version: 2.1 Name: foo Keywords: foo bar baz """ # noqa result = MAPPINGS["PythonPkginfoMapping"]().translate(raw_content) + assert set(result.pop("keywords")) == {"foo", "bar", "baz"}, result assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "foo", - "keywords": ["foo", "bar", "baz"], } def test_compute_metadata_pkginfo_license(): raw_content = b"""\ Metadata-Version: 2.1 Name: foo License: MIT """ # noqa result = MAPPINGS["PythonPkginfoMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "foo", - "license": "MIT", + "license": "https://spdx.org/licenses/MIT", } diff --git a/swh/indexer/tests/metadata_dictionary/test_ruby.py b/swh/indexer/tests/metadata_dictionary/test_ruby.py index ba2cc30..53e0a0a 100644 --- a/swh/indexer/tests/metadata_dictionary/test_ruby.py +++ b/swh/indexer/tests/metadata_dictionary/test_ruby.py @@ -1,134 +1,136 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from hypothesis import HealthCheck, given, settings, strategies +import pytest from swh.indexer.metadata_dictionary import MAPPINGS def test_gemspec_base(): raw_content = b""" Gem::Specification.new do |s| s.name = 'example' s.version = '0.1.0' s.licenses = ['MIT'] s.summary = "This is an example!" s.description = "Much longer explanation of the example!" s.authors = ["Ruby Coder"] s.email = 'rubycoder@example.com' s.files = ["lib/example.rb"] s.homepage = 'https://rubygems.org/gems/example' s.metadata = { "source_code_uri" => "https://github.com/example/example" } end""" result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert set(result.pop("description")) == { "This is an example!", "Much longer explanation of the example!", } assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [{"type": "Person", "name": "Ruby Coder"}], "name": "example", "license": "https://spdx.org/licenses/MIT", "codeRepository": "https://rubygems.org/gems/example", "email": "rubycoder@example.com", "version": "0.1.0", } +@pytest.mark.xfail(reason="https://github.com/w3c/json-ld-api/issues/547") def test_gemspec_two_author_fields(): raw_content = b""" Gem::Specification.new do |s| s.authors = ["Ruby Coder1"] s.author = "Ruby Coder2" end""" result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert result.pop("author") in ( [ {"type": "Person", "name": "Ruby Coder1"}, {"type": "Person", "name": "Ruby Coder2"}, ], [ {"type": "Person", "name": "Ruby Coder2"}, {"type": "Person", "name": "Ruby Coder1"}, ], ) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } def test_gemspec_invalid_author(): raw_content = b""" Gem::Specification.new do |s| s.author = ["Ruby Coder"] end""" result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } raw_content = b""" Gem::Specification.new do |s| s.author = "Ruby Coder1", end""" result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", } raw_content = b""" Gem::Specification.new do |s| s.authors = ["Ruby Coder1", ["Ruby Coder2"]] end""" result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "author": [{"type": "Person", "name": "Ruby Coder1"}], } def test_gemspec_alternative_header(): raw_content = b""" require './lib/version' Gem::Specification.new { |s| s.name = 'rb-system-with-aliases' s.summary = 'execute system commands with aliases' } """ result = MAPPINGS["GemspecMapping"]().translate(raw_content) assert result == { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "SoftwareSourceCode", "name": "rb-system-with-aliases", "description": "execute system commands with aliases", } @settings(suppress_health_check=[HealthCheck.too_slow]) @given( strategies.dictionaries( # keys strategies.one_of( strategies.text(), *map(strategies.just, MAPPINGS["GemspecMapping"].mapping), # type: ignore ), # values strategies.recursive( strategies.characters(), lambda children: strategies.lists(children, min_size=1), ), ) ) def test_gemspec_adversarial(doc): parts = [b"Gem::Specification.new do |s|\n"] for (k, v) in doc.items(): parts.append(" s.{} = {}\n".format(k, repr(v)).encode()) parts.append(b"end\n") MAPPINGS["GemspecMapping"]().translate(b"".join(parts)) diff --git a/swh/indexer/tests/test_cli.py b/swh/indexer/tests/test_cli.py index bd67a05..6bbab40 100644 --- a/swh/indexer/tests/test_cli.py +++ b/swh/indexer/tests/test_cli.py @@ -1,908 +1,922 @@ # Copyright (C) 2019-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime from functools import reduce import re from typing import Any, Dict, List from unittest.mock import patch import attr from click.testing import CliRunner from confluent_kafka import Consumer import pytest from swh.indexer import fossology_license from swh.indexer.cli import indexer_cli_group from swh.indexer.storage.interface import IndexerStorageInterface from swh.indexer.storage.model import ( ContentLicenseRow, ContentMimetypeRow, DirectoryIntrinsicMetadataRow, OriginExtrinsicMetadataRow, OriginIntrinsicMetadataRow, ) from swh.journal.writer import get_journal_writer from swh.model.hashutil import hash_to_bytes from swh.model.model import Content, Origin, OriginVisitStatus from .test_metadata import REMD from .utils import ( DIRECTORY2, RAW_CONTENT_IDS, RAW_CONTENTS, REVISION, SHA1_TO_LICENSES, mock_compute_license, ) def fill_idx_storage(idx_storage: IndexerStorageInterface, nb_rows: int) -> List[int]: tools: List[Dict[str, Any]] = [ { "tool_name": "tool %d" % i, "tool_version": "0.0.1", "tool_configuration": {}, } for i in range(2) ] tools = idx_storage.indexer_configuration_add(tools) origin_metadata = [ OriginIntrinsicMetadataRow( id="file://dev/%04d" % origin_id, from_directory=hash_to_bytes("abcd{:0>36}".format(origin_id)), indexer_configuration_id=tools[origin_id % 2]["id"], metadata={"name": "origin %d" % origin_id}, mappings=["mapping%d" % (origin_id % 10)], ) for origin_id in range(nb_rows) ] directory_metadata = [ DirectoryIntrinsicMetadataRow( id=hash_to_bytes("abcd{:0>36}".format(origin_id)), indexer_configuration_id=tools[origin_id % 2]["id"], metadata={"name": "origin %d" % origin_id}, mappings=["mapping%d" % (origin_id % 10)], ) for origin_id in range(nb_rows) ] idx_storage.directory_intrinsic_metadata_add(directory_metadata) idx_storage.origin_intrinsic_metadata_add(origin_metadata) return [tool["id"] for tool in tools] def _origins_in_task_args(tasks): """Returns the set of origins contained in the arguments of the provided tasks (assumed to be of type index-origin-metadata).""" return reduce( set.union, (set(task["arguments"]["args"][0]) for task in tasks), set() ) def _assert_tasks_for_origins(tasks, origins): expected_kwargs = {} assert {task["type"] for task in tasks} == {"index-origin-metadata"} assert all(len(task["arguments"]["args"]) == 1 for task in tasks) for task in tasks: assert task["arguments"]["kwargs"] == expected_kwargs, task assert _origins_in_task_args(tasks) == set(["file://dev/%04d" % i for i in origins]) @pytest.fixture def cli_runner(): return CliRunner() def test_cli_mapping_list(cli_runner, swh_config): result = cli_runner.invoke( indexer_cli_group, ["-C", swh_config, "mapping", "list"], catch_exceptions=False, ) expected_output = "\n".join( [ "cff", "codemeta", "composer", "gemspec", "github", + "json-sword-codemeta", "maven", "npm", + "nuget", "pkg-info", "pubspec", + "sword-codemeta", "", ] # must be sorted for test to pass ) assert result.exit_code == 0, result.output assert result.output == expected_output def test_cli_mapping_list_terms(cli_runner, swh_config): result = cli_runner.invoke( indexer_cli_group, ["-C", swh_config, "mapping", "list-terms"], catch_exceptions=False, ) assert result.exit_code == 0, result.output assert re.search(r"http://schema.org/url:\n.*npm", result.output) assert re.search(r"http://schema.org/url:\n.*codemeta", result.output) assert re.search( r"https://codemeta.github.io/terms/developmentStatus:\n\tcodemeta", result.output, ) def test_cli_mapping_list_terms_exclude(cli_runner, swh_config): result = cli_runner.invoke( indexer_cli_group, - ["-C", swh_config, "mapping", "list-terms", "--exclude-mapping", "codemeta"], + [ + "-C", + swh_config, + "mapping", + "list-terms", + "--exclude-mapping", + "codemeta", + "--exclude-mapping", + "json-sword-codemeta", + "--exclude-mapping", + "sword-codemeta", + ], catch_exceptions=False, ) assert result.exit_code == 0, result.output assert re.search(r"http://schema.org/url:\n.*npm", result.output) assert not re.search(r"http://schema.org/url:\n.*codemeta", result.output) assert not re.search( r"https://codemeta.github.io/terms/developmentStatus:\n\tcodemeta", result.output, ) @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_empty_db( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "reindex_origin_metadata", ], catch_exceptions=False, ) expected_output = "Nothing to do (no origin metadata matched the criteria).\n" assert result.exit_code == 0, result.output assert result.output == expected_output tasks = indexer_scheduler.search_tasks() assert len(tasks) == 0 @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_divisor( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when origin_batch_size*task_batch_size is a divisor of nb_origins.""" fill_idx_storage(idx_storage, 90) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "reindex_origin_metadata", ], catch_exceptions=False, ) # Check the output expected_output = ( "Scheduled 3 tasks (30 origins).\n" "Scheduled 6 tasks (60 origins).\n" "Scheduled 9 tasks (90 origins).\n" "Done.\n" ) assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 9 _assert_tasks_for_origins(tasks, range(90)) @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_dry_run( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when origin_batch_size*task_batch_size is a divisor of nb_origins.""" fill_idx_storage(idx_storage, 90) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "--dry-run", "reindex_origin_metadata", ], catch_exceptions=False, ) # Check the output expected_output = ( "Scheduled 3 tasks (30 origins).\n" "Scheduled 6 tasks (60 origins).\n" "Scheduled 9 tasks (90 origins).\n" "Done.\n" ) assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 0 @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_nondivisor( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when neither origin_batch_size or task_batch_size is a divisor of nb_origins.""" fill_idx_storage(idx_storage, 70) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "reindex_origin_metadata", "--batch-size", "20", ], catch_exceptions=False, ) # Check the output expected_output = ( "Scheduled 3 tasks (60 origins).\n" "Scheduled 4 tasks (70 origins).\n" "Done.\n" ) assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 4 _assert_tasks_for_origins(tasks, range(70)) @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_filter_one_mapping( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when origin_batch_size*task_batch_size is a divisor of nb_origins.""" fill_idx_storage(idx_storage, 110) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "reindex_origin_metadata", "--mapping", "mapping1", ], catch_exceptions=False, ) # Check the output expected_output = "Scheduled 2 tasks (11 origins).\nDone.\n" assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 2 _assert_tasks_for_origins(tasks, [1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101]) @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_filter_two_mappings( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when origin_batch_size*task_batch_size is a divisor of nb_origins.""" fill_idx_storage(idx_storage, 110) result = cli_runner.invoke( indexer_cli_group, [ "--config-file", swh_config, "schedule", "reindex_origin_metadata", "--mapping", "mapping1", "--mapping", "mapping2", ], catch_exceptions=False, ) # Check the output expected_output = "Scheduled 3 tasks (22 origins).\nDone.\n" assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 3 _assert_tasks_for_origins( tasks, [ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 2, 12, 22, 32, 42, 52, 62, 72, 82, 92, 102, ], ) @patch("swh.scheduler.cli.utils.TASK_BATCH_SIZE", 3) @patch("swh.scheduler.cli_utils.TASK_BATCH_SIZE", 3) def test_cli_origin_metadata_reindex_filter_one_tool( cli_runner, swh_config, indexer_scheduler, idx_storage, storage ): """Tests the re-indexing when origin_batch_size*task_batch_size is a divisor of nb_origins.""" tool_ids = fill_idx_storage(idx_storage, 110) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "schedule", "reindex_origin_metadata", "--tool-id", str(tool_ids[0]), ], catch_exceptions=False, ) # Check the output expected_output = ( "Scheduled 3 tasks (30 origins).\n" "Scheduled 6 tasks (55 origins).\n" "Done.\n" ) assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks() assert len(tasks) == 6 _assert_tasks_for_origins(tasks, [x * 2 for x in range(55)]) def now(): return datetime.datetime.now(tz=datetime.timezone.utc) def test_cli_journal_client_schedule( cli_runner, swh_config, indexer_scheduler, kafka_prefix: str, kafka_server, consumer: Consumer, ): """Test the 'swh indexer journal-client' cli tool.""" journal_writer = get_journal_writer( "kafka", brokers=[kafka_server], prefix=kafka_prefix, client_id="test producer", value_sanitizer=lambda object_type, value: value, flush_timeout=3, # fail early if something is going wrong ) visit_statuses = [ OriginVisitStatus( origin="file:///dev/zero", visit=1, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///dev/foobar", visit=2, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///tmp/spamegg", visit=3, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///dev/0002", visit=6, date=now(), status="full", snapshot=None, ), OriginVisitStatus( # will be filtered out due to its 'partial' status origin="file:///dev/0000", visit=4, date=now(), status="partial", snapshot=None, ), OriginVisitStatus( # will be filtered out due to its 'ongoing' status origin="file:///dev/0001", visit=5, date=now(), status="ongoing", snapshot=None, ), ] journal_writer.write_additions("origin_visit_status", visit_statuses) visit_statuses_full = [vs for vs in visit_statuses if vs.status == "full"] result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", "--broker", kafka_server, "--prefix", kafka_prefix, "--group-id", "test-consumer", "--stop-after-objects", len(visit_statuses), "--origin-metadata-task-type", "index-origin-metadata", ], catch_exceptions=False, ) # Check the output expected_output = "Done.\n" assert result.exit_code == 0, result.output assert result.output == expected_output # Check scheduled tasks tasks = indexer_scheduler.search_tasks(task_type="index-origin-metadata") # This can be split into multiple tasks but no more than the origin-visit-statuses # written in the journal assert len(tasks) <= len(visit_statuses_full) actual_origins = [] for task in tasks: actual_task = dict(task) assert actual_task["type"] == "index-origin-metadata" scheduled_origins = actual_task["arguments"]["args"][0] actual_origins.extend(scheduled_origins) assert set(actual_origins) == {vs.origin for vs in visit_statuses_full} def test_cli_journal_client_without_brokers( cli_runner, swh_config, kafka_prefix: str, kafka_server, consumer: Consumer ): """Without brokers configuration, the cli fails.""" with pytest.raises(ValueError, match="brokers"): cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", ], catch_exceptions=False, ) @pytest.mark.parametrize("indexer_name", ["origin_intrinsic_metadata", "*"]) def test_cli_journal_client_index__origin_intrinsic_metadata( cli_runner, swh_config, kafka_prefix: str, kafka_server, consumer: Consumer, idx_storage, storage, mocker, swh_indexer_config, indexer_name: str, ): """Test the 'swh indexer journal-client' cli tool.""" journal_writer = get_journal_writer( "kafka", brokers=[kafka_server], prefix=kafka_prefix, client_id="test producer", value_sanitizer=lambda object_type, value: value, flush_timeout=3, # fail early if something is going wrong ) visit_statuses = [ OriginVisitStatus( origin="file:///dev/zero", visit=1, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///dev/foobar", visit=2, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///tmp/spamegg", visit=3, date=now(), status="full", snapshot=None, ), OriginVisitStatus( origin="file:///dev/0002", visit=6, date=now(), status="full", snapshot=None, ), OriginVisitStatus( # will be filtered out due to its 'partial' status origin="file:///dev/0000", visit=4, date=now(), status="partial", snapshot=None, ), OriginVisitStatus( # will be filtered out due to its 'ongoing' status origin="file:///dev/0001", visit=5, date=now(), status="ongoing", snapshot=None, ), ] journal_writer.write_additions("origin_visit_status", visit_statuses) visit_statuses_full = [vs for vs in visit_statuses if vs.status == "full"] storage.revision_add([REVISION]) mocker.patch( "swh.indexer.metadata.get_head_swhid", return_value=REVISION.swhid(), ) mocker.patch( "swh.indexer.metadata.DirectoryMetadataIndexer.index", return_value=[ DirectoryIntrinsicMetadataRow( id=DIRECTORY2.id, indexer_configuration_id=1, mappings=["cff"], metadata={"foo": "bar"}, ) ], ) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", indexer_name, "--broker", kafka_server, "--prefix", kafka_prefix, "--group-id", "test-consumer", "--stop-after-objects", len(visit_statuses), ], catch_exceptions=False, ) # Check the output expected_output = "Done.\n" assert result.exit_code == 0, result.output assert result.output == expected_output results = idx_storage.origin_intrinsic_metadata_get( [status.origin for status in visit_statuses] ) expected_results = [ OriginIntrinsicMetadataRow( id=status.origin, from_directory=DIRECTORY2.id, tool={"id": 1, **swh_indexer_config["tools"]}, mappings=["cff"], metadata={"foo": "bar"}, ) for status in sorted(visit_statuses_full, key=lambda r: r.origin) ] assert sorted(results, key=lambda r: r.id) == expected_results @pytest.mark.parametrize("indexer_name", ["extrinsic_metadata", "*"]) def test_cli_journal_client_index__origin_extrinsic_metadata( cli_runner, swh_config, kafka_prefix: str, kafka_server, consumer: Consumer, idx_storage, storage, mocker, swh_indexer_config, indexer_name: str, ): """Test the 'swh indexer journal-client' cli tool.""" journal_writer = get_journal_writer( "kafka", brokers=[kafka_server], prefix=kafka_prefix, client_id="test producer", value_sanitizer=lambda object_type, value: value, flush_timeout=3, # fail early if something is going wrong ) origin = Origin("http://example.org/repo.git") storage.origin_add([origin]) raw_extrinsic_metadata = attr.evolve(REMD, target=origin.swhid()) raw_extrinsic_metadata = attr.evolve( raw_extrinsic_metadata, id=raw_extrinsic_metadata.compute_hash() ) journal_writer.write_additions("raw_extrinsic_metadata", [raw_extrinsic_metadata]) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", indexer_name, "--broker", kafka_server, "--prefix", kafka_prefix, "--group-id", "test-consumer", "--stop-after-objects", 1, ], catch_exceptions=False, ) # Check the output expected_output = "Done.\n" assert result.exit_code == 0, result.output assert result.output == expected_output results = idx_storage.origin_extrinsic_metadata_get([origin.url]) expected_results = [ OriginExtrinsicMetadataRow( id=origin.url, from_remd_id=raw_extrinsic_metadata.id, tool={"id": 1, **swh_indexer_config["tools"]}, mappings=["github"], metadata={ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "type": "https://forgefed.org/ns#Repository", "name": "test software", }, ) ] assert sorted(results, key=lambda r: r.id) == expected_results def test_cli_journal_client_index__content_mimetype( cli_runner, swh_config, kafka_prefix: str, kafka_server, consumer: Consumer, idx_storage, obj_storage, storage, mocker, swh_indexer_config, ): """Test the 'swh indexer journal-client' cli tool.""" journal_writer = get_journal_writer( "kafka", brokers=[kafka_server], prefix=kafka_prefix, client_id="test producer", value_sanitizer=lambda object_type, value: value, flush_timeout=3, # fail early if something is going wrong ) contents = [] expected_results = [] content_ids = [] for content_id, (raw_content, mimetypes, encoding) in RAW_CONTENTS.items(): content = Content.from_data(raw_content) assert content_id == content.sha1 contents.append(content) content_ids.append(content_id) # Older libmagic versions (e.g. buster: 1:5.35-4+deb10u2, bullseye: 1:5.39-3) # returns different results. This allows to deal with such a case when executing # tests on different environments machines (e.g. ci tox, ci debian, dev machine, # ...) all_mimetypes = mimetypes if isinstance(mimetypes, tuple) else [mimetypes] expected_results.extend( [ ContentMimetypeRow( id=content.sha1, tool={"id": 1, **swh_indexer_config["tools"]}, mimetype=mimetype, encoding=encoding, ) for mimetype in all_mimetypes ] ) assert len(contents) == len(RAW_CONTENTS) journal_writer.write_additions("content", contents) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", "content_mimetype", "--broker", kafka_server, "--prefix", kafka_prefix, "--group-id", "test-consumer", "--stop-after-objects", len(contents), ], catch_exceptions=False, ) # Check the output expected_output = "Done.\n" assert result.exit_code == 0, result.output assert result.output == expected_output results = idx_storage.content_mimetype_get(content_ids) assert len(results) == len(contents) for result in results: assert result in expected_results def test_cli_journal_client_index__fossology_license( cli_runner, swh_config, kafka_prefix: str, kafka_server, consumer: Consumer, idx_storage, obj_storage, storage, mocker, swh_indexer_config, ): """Test the 'swh indexer journal-client' cli tool.""" # Patch fossology_license.compute_license = mock_compute_license journal_writer = get_journal_writer( "kafka", brokers=[kafka_server], prefix=kafka_prefix, client_id="test producer", value_sanitizer=lambda object_type, value: value, flush_timeout=3, # fail early if something is going wrong ) tool = {"id": 1, **swh_indexer_config["tools"]} id0, id1, id2 = RAW_CONTENT_IDS contents = [] content_ids = [] expected_results = [] for content_id, (raw_content, _, _) in RAW_CONTENTS.items(): content = Content.from_data(raw_content) assert content_id == content.sha1 contents.append(content) content_ids.append(content_id) expected_results.extend( [ ContentLicenseRow(id=content_id, tool=tool, license=license) for license in SHA1_TO_LICENSES[content_id] ] ) assert len(contents) == len(RAW_CONTENTS) journal_writer.write_additions("content", contents) result = cli_runner.invoke( indexer_cli_group, [ "-C", swh_config, "journal-client", "content_fossology_license", "--broker", kafka_server, "--prefix", kafka_prefix, "--group-id", "test-consumer", "--stop-after-objects", len(contents), ], catch_exceptions=False, ) # Check the output expected_output = "Done.\n" assert result.exit_code == 0, result.output assert result.output == expected_output results = idx_storage.content_fossology_license_get(content_ids) assert len(results) == len(expected_results) for result in results: assert result in expected_results diff --git a/swh/indexer/tests/test_codemeta.py b/swh/indexer/tests/test_codemeta.py index 1829a70..6d394d4 100644 --- a/swh/indexer/tests/test_codemeta.py +++ b/swh/indexer/tests/test_codemeta.py @@ -1,298 +1,270 @@ # Copyright (C) 2018-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -import pytest - -from swh.indexer.codemeta import CROSSWALK_TABLE, merge_documents, merge_values +from swh.indexer.codemeta import CROSSWALK_TABLE, merge_documents def test_crosstable(): - assert CROSSWALK_TABLE["NodeJS"] == { + assert {k: str(v) for (k, v) in CROSSWALK_TABLE["NodeJS"].items()} == { "repository": "http://schema.org/codeRepository", "os": "http://schema.org/operatingSystem", "cpu": "http://schema.org/processorRequirements", "engines": "http://schema.org/runtimePlatform", "author": "http://schema.org/author", "author.email": "http://schema.org/email", "author.name": "http://schema.org/name", "contributors": "http://schema.org/contributor", "keywords": "http://schema.org/keywords", "license": "http://schema.org/license", "version": "http://schema.org/version", "description": "http://schema.org/description", "name": "http://schema.org/name", "bugs": "https://codemeta.github.io/terms/issueTracker", "homepage": "http://schema.org/url", } -def test_merge_values(): - assert merge_values("a", "b") == ["a", "b"] - assert merge_values(["a", "b"], "c") == ["a", "b", "c"] - assert merge_values("a", ["b", "c"]) == ["a", "b", "c"] - - assert merge_values({"@list": ["a"]}, {"@list": ["b"]}) == {"@list": ["a", "b"]} - assert merge_values({"@list": ["a", "b"]}, {"@list": ["c"]}) == { - "@list": ["a", "b", "c"] - } - - with pytest.raises(ValueError): - merge_values({"@list": ["a"]}, "b") - with pytest.raises(ValueError): - merge_values("a", {"@list": ["b"]}) - with pytest.raises(ValueError): - merge_values({"@list": ["a"]}, ["b"]) - with pytest.raises(ValueError): - merge_values(["a"], {"@list": ["b"]}) - - assert merge_values("a", None) == "a" - assert merge_values(["a", "b"], None) == ["a", "b"] - assert merge_values(None, ["b", "c"]) == ["b", "c"] - assert merge_values({"@list": ["a"]}, None) == {"@list": ["a"]} - assert merge_values(None, {"@list": ["a"]}) == {"@list": ["a"]} - - def test_merge_documents(): """ Test the creation of a coherent minimal metadata set """ # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "test_1", "version": "0.0.2", "description": "Simple package.json test for indexer", "codeRepository": "git+https://github.com/moranegg/metadata_test", }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "test_0_1", "version": "0.0.2", "description": "Simple package.json test for indexer", "codeRepository": "git+https://github.com/moranegg/metadata_test", }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "name": "test_metadata", "version": "0.0.2", "author": { "type": "Person", "name": "moranegg", }, }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "version": "0.0.2", "description": "Simple package.json test for indexer", "name": ["test_1", "test_0_1", "test_metadata"], "author": [{"type": "Person", "name": "moranegg"}], "codeRepository": "git+https://github.com/moranegg/metadata_test", } assert results == expected_results def test_merge_documents_ids(): # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test1", "name": "test_1", }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test2", "name": "test_2", }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test1", "schema:sameAs": "http://example.org/test2", "name": ["test_1", "test_2"], } assert results == expected_results def test_merge_documents_duplicate_ids(): # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test1", "name": "test_1", }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test1", "name": "test_1b", }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test2", "name": "test_2", }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "id": "http://example.org/test1", "schema:sameAs": "http://example.org/test2", "name": ["test_1", "test_1b", "test_2"], } assert results == expected_results def test_merge_documents_lists(): """Tests merging two @list elements.""" # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_1"}, ] }, }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_2"}, ] }, }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": [ {"name": "test_1"}, {"name": "test_2"}, ], } assert results == expected_results def test_merge_documents_lists_duplicates(): """Tests merging two @list elements with a duplicate subelement.""" # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_1"}, ] }, }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_2"}, {"name": "test_1"}, ] }, }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": [ {"name": "test_1"}, {"name": "test_2"}, ], } assert results == expected_results def test_merge_documents_list_left(): """Tests merging a singleton with an @list.""" # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": {"name": "test_1"}, }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_2"}, ] }, }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": [ {"name": "test_1"}, {"name": "test_2"}, ], } assert results == expected_results def test_merge_documents_list_right(): """Tests merging an @list with a singleton.""" # given metadata_list = [ { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": { "@list": [ {"name": "test_1"}, ] }, }, { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": {"name": "test_2"}, }, ] # when results = merge_documents(metadata_list) # then expected_results = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "author": [ {"name": "test_1"}, {"name": "test_2"}, ], } assert results == expected_results diff --git a/swh/indexer/tests/test_origin_metadata.py b/swh/indexer/tests/test_origin_metadata.py index 4f6df9a..567f479 100644 --- a/swh/indexer/tests/test_origin_metadata.py +++ b/swh/indexer/tests/test_origin_metadata.py @@ -1,356 +1,356 @@ # Copyright (C) 2018-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import copy from unittest.mock import patch import pytest from swh.indexer.metadata import OriginMetadataIndexer from swh.indexer.storage.interface import IndexerStorageInterface from swh.indexer.storage.model import ( DirectoryIntrinsicMetadataRow, OriginIntrinsicMetadataRow, ) from swh.model.model import Origin from swh.storage.interface import StorageInterface from .test_metadata import TRANSLATOR_TOOL from .utils import DIRECTORY2, YARN_PARSER_METADATA @pytest.fixture def swh_indexer_config(swh_indexer_config): """Override the default configuration to override the tools entry""" cfg = copy.deepcopy(swh_indexer_config) cfg["tools"] = TRANSLATOR_TOOL return cfg def test_origin_metadata_indexer_release( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://npm.example.org/yarn-parser" indexer.run([origin]) tool = swh_indexer_config["tools"] dir_id = DIRECTORY2.id dir_metadata = DirectoryIntrinsicMetadataRow( id=dir_id, tool=tool, metadata=YARN_PARSER_METADATA, mappings=["npm"], ) origin_metadata = OriginIntrinsicMetadataRow( id=origin, tool=tool, from_directory=dir_id, metadata=YARN_PARSER_METADATA, mappings=["npm"], ) dir_results = list(idx_storage.directory_intrinsic_metadata_get([dir_id])) for dir_result in dir_results: assert dir_result.tool del dir_result.tool["id"] assert dir_results == [dir_metadata] orig_results = list(idx_storage.origin_intrinsic_metadata_get([origin])) for orig_result in orig_results: assert orig_result.tool del orig_result.tool["id"] assert orig_results == [origin_metadata] def test_origin_metadata_indexer_revision( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://github.com/librariesio/yarn-parser" indexer.run([origin]) tool = swh_indexer_config["tools"] dir_id = DIRECTORY2.id dir_metadata = DirectoryIntrinsicMetadataRow( id=dir_id, tool=tool, metadata=YARN_PARSER_METADATA, mappings=["npm"], ) origin_metadata = OriginIntrinsicMetadataRow( id=origin, tool=tool, from_directory=dir_id, metadata=YARN_PARSER_METADATA, mappings=["npm"], ) dir_results = list(idx_storage.directory_intrinsic_metadata_get([dir_id])) for dir_result in dir_results: assert dir_result.tool del dir_result.tool["id"] assert dir_results == [dir_metadata] orig_results = list(idx_storage.origin_intrinsic_metadata_get([origin])) for orig_result in orig_results: assert orig_result.tool del orig_result.tool["id"] assert orig_results == [origin_metadata] def test_origin_metadata_indexer_duplicate_origin( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) indexer.storage = storage indexer.idx_storage = idx_storage indexer.run(["https://github.com/librariesio/yarn-parser"]) indexer.run(["https://github.com/librariesio/yarn-parser"] * 2) origin = "https://github.com/librariesio/yarn-parser" dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert len(dir_results) == 1 orig_results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert len(orig_results) == 1 def test_origin_metadata_indexer_missing_head( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: storage.origin_add([Origin(url="https://example.com")]) indexer = OriginMetadataIndexer(config=swh_indexer_config) indexer.run(["https://example.com"]) origin = "https://example.com" results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert results == [] def test_origin_metadata_indexer_partial_missing_head( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: origin1 = "https://example.com" origin2 = "https://github.com/librariesio/yarn-parser" storage.origin_add([Origin(url=origin1)]) indexer = OriginMetadataIndexer(config=swh_indexer_config) indexer.run([origin1, origin2]) dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert dir_results == [ DirectoryIntrinsicMetadataRow( id=dir_id, metadata=YARN_PARSER_METADATA, mappings=["npm"], tool=dir_results[0].tool, ) ] orig_results = list( indexer.idx_storage.origin_intrinsic_metadata_get([origin1, origin2]) ) for orig_result in orig_results: assert orig_results == [ OriginIntrinsicMetadataRow( id=origin2, from_directory=dir_id, metadata=YARN_PARSER_METADATA, mappings=["npm"], tool=orig_results[0].tool, ) ] def test_origin_metadata_indexer_duplicate_directory( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) indexer.storage = storage indexer.idx_storage = idx_storage indexer.catch_exceptions = False origin1 = "https://github.com/librariesio/yarn-parser" origin2 = "https://github.com/librariesio/yarn-parser.git" indexer.run([origin1, origin2]) dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert len(dir_results) == 1 orig_results = list( indexer.idx_storage.origin_intrinsic_metadata_get([origin1, origin2]) ) assert len(orig_results) == 2 def test_origin_metadata_indexer_no_metadata_file( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://github.com/librariesio/yarn-parser" with patch("swh.indexer.metadata_dictionary.npm.NpmMapping.filename", b"foo.json"): indexer.run([origin]) dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert dir_results == [] orig_results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert orig_results == [] def test_origin_metadata_indexer_no_metadata( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://github.com/librariesio/yarn-parser" with patch( "swh.indexer.metadata.DirectoryMetadataIndexer" ".translate_directory_intrinsic_metadata", return_value=(["npm"], {"@context": "foo"}), ): indexer.run([origin]) dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert dir_results == [] orig_results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert orig_results == [] @pytest.mark.parametrize("catch_exceptions", [True, False]) def test_origin_metadata_indexer_directory_error( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, sentry_events, catch_exceptions, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://github.com/librariesio/yarn-parser" indexer.catch_exceptions = catch_exceptions with patch( "swh.indexer.metadata.DirectoryMetadataIndexer" ".translate_directory_intrinsic_metadata", return_value=None, ): indexer.run([origin]) assert len(sentry_events) == 1 sentry_event = sentry_events.pop() assert sentry_event.get("tags") == { "swh-indexer-origin-head-swhid": ( - "swh:1:rev:179fd041d75edab00feba8e4439897422f3bdfa1" + "swh:1:rev:a78410ce2f78f5078fd4ee7edb8c82c02a4a712c" ), "swh-indexer-origin-url": origin, } assert "'TypeError'" in str(sentry_event) dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert dir_results == [] orig_results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert orig_results == [] @pytest.mark.parametrize("catch_exceptions", [True, False]) def test_origin_metadata_indexer_content_exception( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, sentry_events, catch_exceptions, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) origin = "https://github.com/librariesio/yarn-parser" indexer.catch_exceptions = catch_exceptions class TestException(Exception): pass with patch( "swh.indexer.metadata.ContentMetadataRow", side_effect=TestException(), ): indexer.run([origin]) assert len(sentry_events) == 1 sentry_event = sentry_events.pop() assert sentry_event.get("tags") == { - "swh-indexer-content-sha1": "d8f40c3ca9cc30ddaca25c55b5dff18271ff030e", + "swh-indexer-content-sha1": "df9d3bcc0158faa446bd1af225f8e2e4afa576d7", "swh-indexer-origin-head-swhid": ( - "swh:1:rev:179fd041d75edab00feba8e4439897422f3bdfa1" + "swh:1:rev:a78410ce2f78f5078fd4ee7edb8c82c02a4a712c" ), "swh-indexer-origin-url": origin, } assert ".TestException'" in str(sentry_event), sentry_event dir_id = DIRECTORY2.id dir_results = list(indexer.idx_storage.directory_intrinsic_metadata_get([dir_id])) assert dir_results == [] orig_results = list(indexer.idx_storage.origin_intrinsic_metadata_get([origin])) assert orig_results == [] def test_origin_metadata_indexer_unknown_origin( swh_indexer_config, idx_storage: IndexerStorageInterface, storage: StorageInterface, obj_storage, ) -> None: indexer = OriginMetadataIndexer(config=swh_indexer_config) result = indexer.index_list([Origin("https://unknown.org/foo")]) assert not result diff --git a/swh/indexer/tests/utils.py b/swh/indexer/tests/utils.py index db0ee95..7938cdc 100644 --- a/swh/indexer/tests/utils.py +++ b/swh/indexer/tests/utils.py @@ -1,774 +1,761 @@ # Copyright (C) 2017-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import abc import datetime import functools from typing import Any, Dict, List, Tuple import unittest from hypothesis import strategies from swh.core.api.classes import stream_results from swh.indexer.storage import INDEXER_CFG_KEY from swh.model.hashutil import hash_to_bytes from swh.model.model import ( Content, Directory, DirectoryEntry, ObjectType, Origin, OriginVisit, OriginVisitStatus, Person, Release, Revision, RevisionType, Snapshot, SnapshotBranch, TargetType, TimestampWithTimezone, ) from swh.storage.utils import now BASE_TEST_CONFIG: Dict[str, Dict[str, Any]] = { "storage": {"cls": "memory"}, "objstorage": {"cls": "memory"}, INDEXER_CFG_KEY: {"cls": "memory"}, } ORIGIN_VISITS = [ {"type": "git", "origin": "https://github.com/SoftwareHeritage/swh-storage"}, {"type": "ftp", "origin": "rsync://ftp.gnu.org/gnu/3dldf"}, { "type": "deposit", "origin": "https://forge.softwareheritage.org/source/jesuisgpl/", }, { "type": "pypi", "origin": "https://old-pypi.example.org/project/limnoria/", }, # with rev head {"type": "pypi", "origin": "https://pypi.org/project/limnoria/"}, # with rel head {"type": "svn", "origin": "http://0-512-md.googlecode.com/svn/"}, {"type": "git", "origin": "https://github.com/librariesio/yarn-parser"}, {"type": "git", "origin": "https://github.com/librariesio/yarn-parser.git"}, {"type": "git", "origin": "https://npm.example.org/yarn-parser"}, ] ORIGINS = [Origin(url=visit["origin"]) for visit in ORIGIN_VISITS] OBJ_STORAGE_RAW_CONTENT: Dict[str, bytes] = { "text:some": b"this is some text", "text:another": b"another text", "text:yet": b"yet another text", "python:code": b""" import unittest import logging from swh.indexer.mimetype import MimetypeIndexer from swh.indexer.tests.test_utils import MockObjStorage class MockStorage(): def content_mimetype_add(self, mimetypes): self.state = mimetypes def indexer_configuration_add(self, tools): return [{ 'id': 10, }] """, "c:struct": b""" #ifndef __AVL__ #define __AVL__ typedef struct _avl_tree avl_tree; typedef struct _data_t { int content; } data_t; """, "lisp:assertion": b""" (should 'pygments (recognize 'lisp 'easily)) """, "json:test-metadata-package.json": b""" { "name": "test_metadata", "version": "0.0.1", "description": "Simple package.json test for indexer", "repository": { "type": "git", "url": "https://github.com/moranegg/metadata_test" } } """, "json:npm-package.json": b""" { "version": "5.0.3", "name": "npm", "description": "a package manager for JavaScript", - "keywords": [ - "install", - "modules", - "package manager", - "package.json" - ], "preferGlobal": true, "config": { "publishtest": false }, "homepage": "https://docs.npmjs.com/", "author": "Isaac Z. Schlueter (http://blog.izs.me)", "repository": { "type": "git", "url": "https://github.com/npm/npm" }, "bugs": { "url": "https://github.com/npm/npm/issues" }, "dependencies": { "JSONStream": "~1.3.1", "abbrev": "~1.1.0", "ansi-regex": "~2.1.1", "ansicolors": "~0.3.2", "ansistyles": "~0.1.3" }, "devDependencies": { "tacks": "~1.2.6", "tap": "~10.3.2" }, "license": "Artistic-2.0" } """, "text:carriage-return": b""" """, "text:empty": b"", # was 626364 / b'bcd' "text:unimportant": b"unimportant content for bcd", # was 636465 / b'cde' now yarn-parser package.json "json:yarn-parser-package.json": b""" { "name": "yarn-parser", "version": "1.0.0", "description": "Tiny web service for parsing yarn.lock files", "main": "index.js", "scripts": { "start": "node index.js", "test": "mocha" }, "engines": { "node": "9.8.0" }, "repository": { "type": "git", "url": "git+https://github.com/librariesio/yarn-parser.git" }, - "keywords": [ - "yarn", - "parse", - "lock", - "dependencies" - ], "author": "Andrew Nesbitt", "license": "AGPL-3.0", "bugs": { "url": "https://github.com/librariesio/yarn-parser/issues" }, "homepage": "https://github.com/librariesio/yarn-parser#readme", "dependencies": { "@yarnpkg/lockfile": "^1.0.0", "body-parser": "^1.15.2", "express": "^4.14.0" }, "devDependencies": { "chai": "^4.1.2", "mocha": "^5.2.0", "request": "^2.87.0", "test": "^0.6.0" } } """, } MAPPING_DESCRIPTION_CONTENT_SHA1GIT: Dict[str, bytes] = {} MAPPING_DESCRIPTION_CONTENT_SHA1: Dict[str, bytes] = {} OBJ_STORAGE_DATA: Dict[bytes, bytes] = {} for key_description, data in OBJ_STORAGE_RAW_CONTENT.items(): content = Content.from_data(data) MAPPING_DESCRIPTION_CONTENT_SHA1GIT[key_description] = content.sha1_git MAPPING_DESCRIPTION_CONTENT_SHA1[key_description] = content.sha1 OBJ_STORAGE_DATA[content.sha1] = data RAW_CONTENT_METADATA = [ ( "du français".encode(), "text/plain", "utf-8", ), ( b"def __init__(self):", ("text/x-python", "text/x-script.python"), "us-ascii", ), ( b"\xff\xfe\x00\x00\x00\x00\xff\xfe\xff\xff", "application/octet-stream", "", ), ] RAW_CONTENTS: Dict[bytes, Tuple] = {} RAW_CONTENT_IDS: List[bytes] = [] for index, raw_content_d in enumerate(RAW_CONTENT_METADATA): raw_content = raw_content_d[0] content = Content.from_data(raw_content) RAW_CONTENTS[content.sha1] = raw_content_d RAW_CONTENT_IDS.append(content.sha1) # and write it to objstorage data so it's flushed in the objstorage OBJ_STORAGE_DATA[content.sha1] = raw_content SHA1_TO_LICENSES: Dict[bytes, List[str]] = { RAW_CONTENT_IDS[0]: ["GPL"], RAW_CONTENT_IDS[1]: ["AGPL"], RAW_CONTENT_IDS[2]: [], } DIRECTORY = Directory( entries=( DirectoryEntry( name=b"index.js", type="file", target=MAPPING_DESCRIPTION_CONTENT_SHA1GIT["text:some"], perms=0o100644, ), DirectoryEntry( name=b"package.json", type="file", target=MAPPING_DESCRIPTION_CONTENT_SHA1GIT[ "json:test-metadata-package.json" ], perms=0o100644, ), DirectoryEntry( name=b".github", type="dir", target=Directory(entries=()).id, perms=0o040000, ), ), ) DIRECTORY2 = Directory( entries=( DirectoryEntry( name=b"package.json", type="file", target=MAPPING_DESCRIPTION_CONTENT_SHA1GIT["json:yarn-parser-package.json"], perms=0o100644, ), ), ) _utc_plus_2 = datetime.timezone(datetime.timedelta(minutes=120)) REVISION = Revision( message=b"Improve search functionality", author=Person( name=b"Andrew Nesbitt", fullname=b"Andrew Nesbitt ", email=b"andrewnez@gmail.com", ), committer=Person( name=b"Andrew Nesbitt", fullname=b"Andrew Nesbitt ", email=b"andrewnez@gmail.com", ), committer_date=TimestampWithTimezone.from_datetime( datetime.datetime(2013, 10, 4, 12, 50, 49, tzinfo=_utc_plus_2) ), type=RevisionType.GIT, synthetic=False, date=TimestampWithTimezone.from_datetime( datetime.datetime(2017, 2, 20, 16, 14, 16, tzinfo=_utc_plus_2) ), directory=DIRECTORY2.id, parents=(), ) REVISIONS = [REVISION] RELEASE = Release( name=b"v0.0.0", message=None, author=Person( name=b"Andrew Nesbitt", fullname=b"Andrew Nesbitt ", email=b"andrewnez@gmail.com", ), synthetic=False, date=TimestampWithTimezone.from_datetime( datetime.datetime(2017, 2, 20, 16, 14, 16, tzinfo=_utc_plus_2) ), target_type=ObjectType.DIRECTORY, target=DIRECTORY2.id, ) RELEASES = [RELEASE] SNAPSHOTS = [ # https://github.com/SoftwareHeritage/swh-storage Snapshot( branches={ b"refs/heads/add-revision-origin-cache": SnapshotBranch( target=b'L[\xce\x1c\x88\x8eF\t\xf1"\x19\x1e\xfb\xc0s\xe7/\xe9l\x1e', target_type=TargetType.REVISION, ), b"refs/head/master": SnapshotBranch( target=b"8K\x12\x00d\x03\xcc\xe4]bS\xe3\x8f{\xd7}\xac\xefrm", target_type=TargetType.REVISION, ), b"HEAD": SnapshotBranch( target=b"refs/head/master", target_type=TargetType.ALIAS ), b"refs/tags/v0.0.103": SnapshotBranch( target=b'\xb6"Im{\xfdLb\xb0\x94N\xea\x96m\x13x\x88+\x0f\xdd', target_type=TargetType.RELEASE, ), }, ), # rsync://ftp.gnu.org/gnu/3dldf Snapshot( branches={ b"3DLDF-1.1.4.tar.gz": SnapshotBranch( target=b'dJ\xfb\x1c\x91\xf4\x82B%]6\xa2\x90|\xd3\xfc"G\x99\x11', target_type=TargetType.REVISION, ), b"3DLDF-2.0.2.tar.gz": SnapshotBranch( target=b"\xb6\x0e\xe7\x9e9\xac\xaa\x19\x9e=\xd1\xc5\x00\\\xc6\xfc\xe0\xa6\xb4V", # noqa target_type=TargetType.REVISION, ), b"3DLDF-2.0.3-examples.tar.gz": SnapshotBranch( target=b"!H\x19\xc0\xee\x82-\x12F1\xbd\x97\xfe\xadZ\x80\x80\xc1\x83\xff", # noqa target_type=TargetType.REVISION, ), b"3DLDF-2.0.3.tar.gz": SnapshotBranch( target=b"\x8e\xa9\x8e/\xea}\x9feF\xf4\x9f\xfd\xee\xcc\x1a\xb4`\x8c\x8by", # noqa target_type=TargetType.REVISION, ), b"3DLDF-2.0.tar.gz": SnapshotBranch( target=b"F6*\xff(?\x19a\xef\xb6\xc2\x1fv$S\xe3G\xd3\xd1m", target_type=TargetType.REVISION, ), }, ), # https://forge.softwareheritage.org/source/jesuisgpl/", Snapshot( branches={ b"master": SnapshotBranch( target=b"\xe7n\xa4\x9c\x9f\xfb\xb7\xf76\x11\x08{\xa6\xe9\x99\xb1\x9e]q\xeb", # noqa target_type=TargetType.REVISION, ) }, ), # https://old-pypi.example.org/project/limnoria/ Snapshot( branches={ b"HEAD": SnapshotBranch( target=b"releases/2018.09.09", target_type=TargetType.ALIAS ), b"releases/2018.09.01": SnapshotBranch( target=b"<\xee1(\xe8\x8d_\xc1\xc9\xa6rT\xf1\x1d\xbb\xdfF\xfdw\xcf", target_type=TargetType.REVISION, ), b"releases/2018.09.09": SnapshotBranch( target=b"\x83\xb9\xb6\xc7\x05\xb1%\xd0\xfem\xd8kA\x10\x9d\xc5\xfa2\xf8t", # noqa target_type=TargetType.REVISION, ), }, ), # https://pypi.org/project/limnoria/ Snapshot( branches={ b"HEAD": SnapshotBranch( target=b"releases/2018.09.09", target_type=TargetType.ALIAS ), b"releases/2018.09.01": SnapshotBranch( target=b"<\xee1(\xe8\x8d_\xc1\xc9\xa6rT\xf1\x1d\xbb\xdfF\xfdw\xcf", target_type=TargetType.RELEASE, ), b"releases/2018.09.09": SnapshotBranch( target=b"\x83\xb9\xb6\xc7\x05\xb1%\xd0\xfem\xd8kA\x10\x9d\xc5\xfa2\xf8t", # noqa target_type=TargetType.RELEASE, ), }, ), # http://0-512-md.googlecode.com/svn/ Snapshot( branches={ b"master": SnapshotBranch( target=b"\xe4?r\xe1,\x88\xab\xec\xe7\x9a\x87\xb8\xc9\xad#.\x1bw=\x18", target_type=TargetType.REVISION, ) }, ), # https://github.com/librariesio/yarn-parser Snapshot( branches={ b"HEAD": SnapshotBranch( target=REVISION.id, target_type=TargetType.REVISION, ) }, ), # https://github.com/librariesio/yarn-parser.git Snapshot( branches={ b"HEAD": SnapshotBranch( target=REVISION.id, target_type=TargetType.REVISION, ) }, ), # https://npm.example.org/yarn-parser Snapshot( branches={ b"HEAD": SnapshotBranch( target=RELEASE.id, target_type=TargetType.RELEASE, ) }, ), ] assert len(SNAPSHOTS) == len(ORIGIN_VISITS) YARN_PARSER_METADATA = { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "url": "https://github.com/librariesio/yarn-parser#readme", "codeRepository": "git+git+https://github.com/librariesio/yarn-parser.git", "author": [{"type": "Person", "name": "Andrew Nesbitt"}], "license": "https://spdx.org/licenses/AGPL-3.0", "version": "1.0.0", "description": "Tiny web service for parsing yarn.lock files", "issueTracker": "https://github.com/librariesio/yarn-parser/issues", "name": "yarn-parser", - "keywords": ["yarn", "parse", "lock", "dependencies"], "type": "SoftwareSourceCode", } json_dict_keys = strategies.one_of( strategies.characters(), strategies.just("type"), strategies.just("url"), strategies.just("name"), strategies.just("email"), strategies.just("@id"), strategies.just("@context"), strategies.just("repository"), strategies.just("license"), strategies.just("repositories"), strategies.just("licenses"), ) """Hypothesis strategy that generates strings, with an emphasis on those that are often used as dictionary keys in metadata files.""" generic_json_document = strategies.recursive( strategies.none() | strategies.booleans() | strategies.floats() | strategies.characters(), lambda children: ( strategies.lists(children, min_size=1) | strategies.dictionaries(json_dict_keys, children, min_size=1) ), ) """Hypothesis strategy that generates possible values for values of JSON metadata files.""" def json_document_strategy(keys=None): """Generates an hypothesis strategy that generates metadata files for a JSON-based format that uses the given keys.""" if keys is None: keys = strategies.characters() else: keys = strategies.one_of(map(strategies.just, keys)) return strategies.dictionaries(keys, generic_json_document, min_size=1) def _tree_to_xml(root, xmlns, data): def encode(s): "Skips unpaired surrogates generated by json_document_strategy" return s.encode("utf8", "replace") def to_xml(data, indent=b" "): if data is None: return b"" elif isinstance(data, (bool, str, int, float)): return indent + encode(str(data)) elif isinstance(data, list): return b"\n".join(to_xml(v, indent=indent) for v in data) elif isinstance(data, dict): lines = [] for (key, value) in data.items(): lines.append(indent + encode("<{}>".format(key))) lines.append(to_xml(value, indent=indent + b" ")) lines.append(indent + encode("".format(key))) return b"\n".join(lines) else: raise TypeError(data) return b"\n".join( [ '<{} xmlns="{}">'.format(root, xmlns).encode(), to_xml(data), "".format(root).encode(), ] ) class TreeToXmlTest(unittest.TestCase): def test_leaves(self): self.assertEqual( _tree_to_xml("root", "http://example.com", None), b'\n\n', ) self.assertEqual( _tree_to_xml("root", "http://example.com", True), b'\n True\n', ) self.assertEqual( _tree_to_xml("root", "http://example.com", "abc"), b'\n abc\n', ) self.assertEqual( _tree_to_xml("root", "http://example.com", 42), b'\n 42\n', ) self.assertEqual( _tree_to_xml("root", "http://example.com", 3.14), b'\n 3.14\n', ) def test_dict(self): self.assertIn( _tree_to_xml("root", "http://example.com", {"foo": "bar", "baz": "qux"}), [ b'\n' b" \n bar\n \n" b" \n qux\n \n" b"", b'\n' b" \n qux\n \n" b" \n bar\n \n" b"", ], ) def test_list(self): self.assertEqual( _tree_to_xml( "root", "http://example.com", [ {"foo": "bar"}, {"foo": "baz"}, ], ), b'\n' b" \n bar\n \n" b" \n baz\n \n" b"", ) def xml_document_strategy(keys, root, xmlns): """Generates an hypothesis strategy that generates metadata files for an XML format that uses the given keys.""" return strategies.builds( functools.partial(_tree_to_xml, root, xmlns), json_document_strategy(keys) ) def filter_dict(d, keys): "return a copy of the dict with keys deleted" if not isinstance(keys, (list, tuple)): keys = (keys,) return dict((k, v) for (k, v) in d.items() if k not in keys) def fill_obj_storage(obj_storage): """Add some content in an object storage.""" for obj_id, content in OBJ_STORAGE_DATA.items(): obj_storage.add(content, obj_id) def fill_storage(storage): """Fill in storage with consistent test dataset.""" storage.content_add([Content.from_data(data) for data in OBJ_STORAGE_DATA.values()]) storage.directory_add([DIRECTORY, DIRECTORY2]) storage.revision_add(REVISIONS) storage.release_add(RELEASES) storage.snapshot_add(SNAPSHOTS) storage.origin_add(ORIGINS) for visit, snapshot in zip(ORIGIN_VISITS, SNAPSHOTS): assert snapshot.id is not None visit = storage.origin_visit_add( [OriginVisit(origin=visit["origin"], date=now(), type=visit["type"])] )[0] visit_status = OriginVisitStatus( origin=visit.origin, visit=visit.visit, date=now(), status="full", snapshot=snapshot.id, ) storage.origin_visit_status_add([visit_status]) class CommonContentIndexerTest(metaclass=abc.ABCMeta): def get_indexer_results(self, ids): """Override this for indexers that don't have a mock storage.""" return self.indexer.idx_storage.state def assert_results_ok(self, sha1s, expected_results=None): sha1s = [hash_to_bytes(sha1) for sha1 in sha1s] actual_results = list(self.get_indexer_results(sha1s)) if expected_results is None: expected_results = self.expected_results # expected results may contain slightly duplicated results assert 0 < len(actual_results) <= len(expected_results) for result in actual_results: assert result in expected_results def test_index(self): """Known sha1 have their data indexed""" sha1s = [self.id0, self.id1, self.id2] # when self.indexer.run(sha1s) self.assert_results_ok(sha1s) # 2nd pass self.indexer.run(sha1s) self.assert_results_ok(sha1s) def test_index_one_unknown_sha1(self): """Unknown sha1s are not indexed""" sha1s = [ self.id1, "799a5ef812c53907562fe379d4b3851e69c7cb15", # unknown "800a5ef812c53907562fe379d4b3851e69c7cb15", # unknown ] # unknown # when self.indexer.run(sha1s) # then expected_results = [res for res in self.expected_results if res.id in sha1s] self.assert_results_ok(sha1s, expected_results) class CommonContentIndexerPartitionTest: """Allows to factorize tests on range indexer.""" def setUp(self): self.contents = sorted(OBJ_STORAGE_DATA) def assert_results_ok(self, partition_id, nb_partitions, actual_results): expected_ids = [ c.sha1 for c in stream_results( self.indexer.storage.content_get_partition, partition_id=partition_id, nb_partitions=nb_partitions, ) ] actual_results = list(actual_results) for indexed_data in actual_results: _id = indexed_data.id assert _id in expected_ids _tool_id = indexed_data.indexer_configuration_id assert _tool_id == self.indexer.tool["id"] def test__index_contents(self): """Indexing contents without existing data results in indexed data""" partition_id = 0 nb_partitions = 4 actual_results = list( self.indexer._index_contents(partition_id, nb_partitions, indexed={}) ) self.assert_results_ok(partition_id, nb_partitions, actual_results) def test__index_contents_with_indexed_data(self): """Indexing contents with existing data results in less indexed data""" partition_id = 3 nb_partitions = 4 # first pass actual_results = list( self.indexer._index_contents(partition_id, nb_partitions, indexed={}), ) self.assert_results_ok(partition_id, nb_partitions, actual_results) indexed_ids = {res.id for res in actual_results} actual_results = list( self.indexer._index_contents( partition_id, nb_partitions, indexed=indexed_ids ) ) # already indexed, so nothing new assert actual_results == [] def test_generate_content_get(self): """Optimal indexing should result in indexed data""" partition_id = 0 nb_partitions = 1 actual_results = self.indexer.run( partition_id, nb_partitions, skip_existing=False ) assert actual_results["status"] == "eventful", actual_results def test_generate_content_get_no_result(self): """No result indexed returns False""" actual_results = self.indexer.run(1, 2**512, incremental=False) assert actual_results == {"status": "uneventful"} def mock_compute_license(path): """path is the content identifier""" if isinstance(id, bytes): path = path.decode("utf-8") # path is something like /tmp/tmpXXX/ so we keep only the sha1 part id_ = path.split("/")[-1] return {"licenses": SHA1_TO_LICENSES.get(hash_to_bytes(id_), [])}