diff --git a/PKG-INFO b/PKG-INFO index 838ac17..968ea38 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,69 +1,69 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.59 +Version: 0.0.60 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/debian/changelog b/debian/changelog index 4b07cdf..5c2c28a 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,479 +1,481 @@ -swh-indexer (0.0.59-1~swh1~bpo9+1) stretch-swh; urgency=medium +swh-indexer (0.0.60-1~swh1) unstable-swh; urgency=medium - * Rebuild for stretch-backports. + * v0.0.60 + * origin_head: Make next step optional + * tests: Increase coverage - -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 14:27:20 +0100 + -- Antoine R. Dumont (@ardumont) Wed, 21 Nov 2018 12:33:13 +0100 swh-indexer (0.0.59-1~swh1) unstable-swh; urgency=medium * v0.0.59 * fossology license: Fix issue on license computation * Improve docstrings * Fix pep8 violations * Increase coverage on content indexers -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 14:27:20 +0100 swh-indexer (0.0.58-1~swh1) unstable-swh; urgency=medium * v0.0.58 * Add missing default configuration for fossology license indexer * tests: Remove dead code -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 12:06:56 +0100 swh-indexer (0.0.57-1~swh1) unstable-swh; urgency=medium * v0.0.57 * storage: Open new endpoint on fossology license range retrieval * indexer: Open new fossology license range indexer -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 11:44:57 +0100 swh-indexer (0.0.56-1~swh1) unstable-swh; urgency=medium * v0.0.56 * storage.api: Open new endpoints (mimetype range, fossology range) * content indexers: Open mimetype and fossology range indexers * Remove orchestrator modules * tests: Improve coverage -- Antoine R. Dumont (@ardumont) Mon, 19 Nov 2018 11:56:06 +0100 swh-indexer (0.0.55-1~swh1) unstable-swh; urgency=medium * v0.0.55 * swh.indexer: Let task reschedule itself through the scheduler * Use swh.scheduler instead of celery leaking all around * swh.indexer.orchestrator: Fix orchestrator initialization step * swh.indexer.tasks: Fix type error when no result or list result -- Antoine R. Dumont (@ardumont) Mon, 29 Oct 2018 10:41:54 +0100 swh-indexer (0.0.54-1~swh1) unstable-swh; urgency=medium * v0.0.54 * swh.indexer.tasks: Fix task to use the scheduler's -- Antoine R. Dumont (@ardumont) Thu, 25 Oct 2018 20:13:51 +0200 swh-indexer (0.0.53-1~swh1) unstable-swh; urgency=medium * v0.0.53 * swh.indexer.rehash: Migrate to latest swh.model.hashutil.MultiHash * indexer: Add the origin intrinsic metadata indexer * indexer: Add OriginIndexer and OriginHeadIndexer. * indexer.storage: Add the origin intrinsic metadata storage database * indexer.storage: Autogenerate the Indexer Storage HTTP API. * setup: prepare for pypi upload * tests: Add a tox file * tests: migrate to pytest * tests: Add tests around celery stack * docs: Improve documentation and reuse README in generated documentation -- Antoine R. Dumont (@ardumont) Thu, 25 Oct 2018 19:03:56 +0200 swh-indexer (0.0.52-1~swh1) unstable-swh; urgency=medium * v0.0.52 * swh.indexer.storage: Refactor fossology license get (first external * contribution, cf. /CONTRIBUTORS) * swh.indexer.storage: Fix typo in invariable name metadata * swh.indexer.storage: No longer use temp table when reading data * swh.indexer.storage: Clean up unused import * swh.indexer.storage: Remove dead entry points origin_metadata* * swh.indexer.storage: Update docstrings information and format -- Antoine R. Dumont (@ardumont) Wed, 13 Jun 2018 11:20:40 +0200 swh-indexer (0.0.51-1~swh1) unstable-swh; urgency=medium * Release swh.indexer v0.0.51 * Update for new db_transaction{,_generator} -- Nicolas Dandrimont Tue, 05 Jun 2018 14:10:39 +0200 swh-indexer (0.0.50-1~swh1) unstable-swh; urgency=medium * v0.0.50 * swh.indexer.api.client: Permit to specify the query timeout option -- Antoine R. Dumont (@ardumont) Thu, 24 May 2018 12:19:06 +0200 swh-indexer (0.0.49-1~swh1) unstable-swh; urgency=medium * v0.0.49 * test_storage: Instantiate the tools during tests' setUp phase * test_storage: Deallocate storage during teardown step * test_storage: Make storage test fixture connect to postgres itself * storage.api.server: Only instantiate storage backend once per import * Use thread-aware psycopg2 connection pooling for database access -- Antoine R. Dumont (@ardumont) Mon, 14 May 2018 11:09:30 +0200 swh-indexer (0.0.48-1~swh1) unstable-swh; urgency=medium * Release swh.indexer v0.0.48 * Update for new swh.storage -- Nicolas Dandrimont Sat, 12 May 2018 18:30:10 +0200 swh-indexer (0.0.47-1~swh1) unstable-swh; urgency=medium * v0.0.47 * d/control: Fix runtime typo in packaging dependency -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 16:54:49 +0100 swh-indexer (0.0.46-1~swh1) unstable-swh; urgency=medium * v0.0.46 * Split swh-indexer packages in 2 python3-swh.indexer.storage and * python3-swh.indexer -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 16:18:04 +0100 swh-indexer (0.0.45-1~swh1) unstable-swh; urgency=medium * v0.0.45 * Fix usual error raised when deploying -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 15:01:01 +0100 swh-indexer (0.0.44-1~swh1) unstable-swh; urgency=medium * v0.0.44 * swh.indexer: Make indexer use their own storage -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 13:20:44 +0100 swh-indexer (0.0.43-1~swh1) unstable-swh; urgency=medium * v0.0.43 * swh.indexer.mimetype: Work around problem in detection -- Antoine R. Dumont (@ardumont) Wed, 29 Nov 2017 10:26:11 +0100 swh-indexer (0.0.42-1~swh1) unstable-swh; urgency=medium * v0.0.42 * swh.indexer: Make indexers register tools in prepare method -- Antoine R. Dumont (@ardumont) Fri, 24 Nov 2017 11:26:03 +0100 swh-indexer (0.0.41-1~swh1) unstable-swh; urgency=medium * v0.0.41 * mimetype: Use magic library api instead of parsing `file` cli output -- Antoine R. Dumont (@ardumont) Mon, 20 Nov 2017 13:05:29 +0100 swh-indexer (0.0.39-1~swh1) unstable-swh; urgency=medium * v0.0.39 * swh.indexer.producer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Thu, 19 Oct 2017 10:03:44 +0200 swh-indexer (0.0.38-1~swh1) unstable-swh; urgency=medium * v0.0.38 * swh.indexer.indexer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 19:57:47 +0200 swh-indexer (0.0.37-1~swh1) unstable-swh; urgency=medium * v0.0.37 * swh.indexer.indexer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 18:59:42 +0200 swh-indexer (0.0.36-1~swh1) unstable-swh; urgency=medium * v0.0.36 * packaging: Cleanup * codemeta: Adding codemeta.json file to document metadata * swh.indexer.mimetype: Fix edge case regarding empty raw content * docs: sanitize docstrings for sphinx documentation generation * swh.indexer.metadata: Add RevisionMetadataIndexer * swh.indexer.metadata: Add ContentMetadataIndexer * swh.indexer: Refactor base class to improve inheritance * swh.indexer.metadata: First draft of the metadata content indexer * for npm (package.json) * swh.indexer.tests: Added tests for language indexer -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 16:24:24 +0200 swh-indexer (0.0.35-1~swh1) unstable-swh; urgency=medium * Release swh.indexer 0.0.35 * Update tasks to new swh.scheduler API -- Nicolas Dandrimont Mon, 12 Jun 2017 18:02:04 +0200 swh-indexer (0.0.34-1~swh1) unstable-swh; urgency=medium * v0.0.34 * Fix unbound local error on edge case -- Antoine R. Dumont (@ardumont) Wed, 07 Jun 2017 11:23:29 +0200 swh-indexer (0.0.33-1~swh1) unstable-swh; urgency=medium * v0.0.33 * language indexer: Improve edge case policy -- Antoine R. Dumont (@ardumont) Wed, 07 Jun 2017 11:02:47 +0200 swh-indexer (0.0.32-1~swh1) unstable-swh; urgency=medium * v0.0.32 * Update fossology license to use the latest swh-storage * Improve language indexer to deal with potential error on bad * chunking -- Antoine R. Dumont (@ardumont) Tue, 06 Jun 2017 18:13:40 +0200 swh-indexer (0.0.31-1~swh1) unstable-swh; urgency=medium * v0.0.31 * Reduce log verbosity on language indexer -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 19:08:52 +0200 swh-indexer (0.0.30-1~swh1) unstable-swh; urgency=medium * v0.0.30 * Fix wrong default configuration -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 18:01:27 +0200 swh-indexer (0.0.29-1~swh1) unstable-swh; urgency=medium * v0.0.29 * Update indexer to resolve indexer configuration identifier * Adapt language indexer to use partial raw content -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 16:21:27 +0200 swh-indexer (0.0.28-1~swh1) unstable-swh; urgency=medium * v0.0.28 * Add error resilience to fossology indexer -- Antoine R. Dumont (@ardumont) Mon, 22 May 2017 12:57:55 +0200 swh-indexer (0.0.27-1~swh1) unstable-swh; urgency=medium * v0.0.27 * swh.indexer.language: Incremental encoding detection -- Antoine R. Dumont (@ardumont) Wed, 17 May 2017 18:04:27 +0200 swh-indexer (0.0.26-1~swh1) unstable-swh; urgency=medium * v0.0.26 * swh.indexer.orchestrator: Add batch size option per indexer * Log caught exception in a unified manner * Add rescheduling option (not by default) on rehash + indexers -- Antoine R. Dumont (@ardumont) Wed, 17 May 2017 14:08:07 +0200 swh-indexer (0.0.25-1~swh1) unstable-swh; urgency=medium * v0.0.25 * Add reschedule on error parameter for indexers -- Antoine R. Dumont (@ardumont) Fri, 12 May 2017 12:13:15 +0200 swh-indexer (0.0.24-1~swh1) unstable-swh; urgency=medium * v0.0.24 * Make rehash indexer more resilient to errors by rescheduling contents * in error (be it reading or updating problems) -- Antoine R. Dumont (@ardumont) Thu, 04 May 2017 14:22:43 +0200 swh-indexer (0.0.23-1~swh1) unstable-swh; urgency=medium * v0.0.23 * Improve producer to optionally make it synchroneous -- Antoine R. Dumont (@ardumont) Wed, 03 May 2017 15:29:44 +0200 swh-indexer (0.0.22-1~swh1) unstable-swh; urgency=medium * v0.0.22 * Improve mimetype indexer implementation * Make the chaining option in the mimetype indexer -- Antoine R. Dumont (@ardumont) Tue, 02 May 2017 16:31:14 +0200 swh-indexer (0.0.21-1~swh1) unstable-swh; urgency=medium * v0.0.21 * swh.indexer.rehash: Actually make the worker log -- Antoine R. Dumont (@ardumont) Tue, 02 May 2017 14:28:55 +0200 swh-indexer (0.0.20-1~swh1) unstable-swh; urgency=medium * v0.0.20 * swh.indexer.rehash: * Improve reading from objstorage only when needed * Fix empty file use case (which was skipped) * Add logging -- Antoine R. Dumont (@ardumont) Fri, 28 Apr 2017 09:39:09 +0200 swh-indexer (0.0.19-1~swh1) unstable-swh; urgency=medium * v0.0.19 * Fix rehash indexer's default configuration file -- Antoine R. Dumont (@ardumont) Thu, 27 Apr 2017 19:17:20 +0200 swh-indexer (0.0.18-1~swh1) unstable-swh; urgency=medium * v0.0.18 * Add new rehash indexer -- Antoine R. Dumont (@ardumont) Wed, 26 Apr 2017 15:23:02 +0200 swh-indexer (0.0.17-1~swh1) unstable-swh; urgency=medium * v0.0.17 * Add information on indexer tools (T610) -- Antoine R. Dumont (@ardumont) Fri, 02 Dec 2016 18:32:54 +0100 swh-indexer (0.0.16-1~swh1) unstable-swh; urgency=medium * v0.0.16 * bug fixes -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 19:31:52 +0100 swh-indexer (0.0.15-1~swh1) unstable-swh; urgency=medium * v0.0.15 * Improve message producer -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 18:16:42 +0100 swh-indexer (0.0.14-1~swh1) unstable-swh; urgency=medium * v0.0.14 * Update package dependency on fossology-nomossa -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 14:13:41 +0100 swh-indexer (0.0.13-1~swh1) unstable-swh; urgency=medium * v0.0.13 * Add new license indexer * ctags indexer: align behavior with other indexers regarding the * conflict update policy -- Antoine R. Dumont (@ardumont) Mon, 14 Nov 2016 14:13:34 +0100 swh-indexer (0.0.12-1~swh1) unstable-swh; urgency=medium * v0.0.12 * Add runtime dependency on universal-ctags -- Antoine R. Dumont (@ardumont) Fri, 04 Nov 2016 13:59:59 +0100 swh-indexer (0.0.11-1~swh1) unstable-swh; urgency=medium * v0.0.11 * Remove dependency on exuberant-ctags -- Antoine R. Dumont (@ardumont) Thu, 03 Nov 2016 16:13:26 +0100 swh-indexer (0.0.10-1~swh1) unstable-swh; urgency=medium * v0.0.10 * Add ctags indexer -- Antoine R. Dumont (@ardumont) Thu, 20 Oct 2016 16:12:42 +0200 swh-indexer (0.0.9-1~swh1) unstable-swh; urgency=medium * v0.0.9 * d/control: Bump dependency to latest python3-swh.storage api * mimetype: Use the charset to filter out data * orchestrator: Separate 2 distincts orchestrators (one for all * contents, one for text contents) * mimetype: once index computed, send text contents to text orchestrator -- Antoine R. Dumont (@ardumont) Thu, 13 Oct 2016 15:28:17 +0200 swh-indexer (0.0.8-1~swh1) unstable-swh; urgency=medium * v0.0.8 * Separate configuration file per indexer (no need for language) * Rename module file_properties to mimetype consistently with other * layers -- Antoine R. Dumont (@ardumont) Sat, 08 Oct 2016 11:46:29 +0200 swh-indexer (0.0.7-1~swh1) unstable-swh; urgency=medium * v0.0.7 * Adapt indexer language and mimetype to store result in storage. * Clean up obsolete code -- Antoine R. Dumont (@ardumont) Sat, 08 Oct 2016 10:26:08 +0200 swh-indexer (0.0.6-1~swh1) unstable-swh; urgency=medium * v0.0.6 * Fix multiple issues on production -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 17:00:11 +0200 swh-indexer (0.0.5-1~swh1) unstable-swh; urgency=medium * v0.0.5 * Fix debian/control dependency issue -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 16:06:20 +0200 swh-indexer (0.0.4-1~swh1) unstable-swh; urgency=medium * v0.0.4 * Upgrade dependencies issues -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 16:01:52 +0200 swh-indexer (0.0.3-1~swh1) unstable-swh; urgency=medium * v0.0.3 * Add encoding detection * Use encoding to improve language detection * bypass language detection for binary files * bypass ctags for binary files or decoding failure file -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 12:30:11 +0200 swh-indexer (0.0.2-1~swh1) unstable-swh; urgency=medium * v0.0.2 * Provide one possible sha1's name for the multiple tools to ease * information extrapolation * Fix debian package dependency issue -- Antoine R. Dumont (@ardumont) Thu, 29 Sep 2016 21:45:44 +0200 swh-indexer (0.0.1-1~swh1) unstable-swh; urgency=medium * Initial release * v0.0.1 * First implementation on poc -- Antoine R. Dumont (@ardumont) Wed, 28 Sep 2016 23:40:13 +0200 diff --git a/swh.indexer.egg-info/PKG-INFO b/swh.indexer.egg-info/PKG-INFO index 838ac17..968ea38 100644 --- a/swh.indexer.egg-info/PKG-INFO +++ b/swh.indexer.egg-info/PKG-INFO @@ -1,69 +1,69 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.59 +Version: 0.0.60 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh/indexer/origin_head.py b/swh/indexer/origin_head.py index 54123ac..6a1ca96 100644 --- a/swh/indexer/origin_head.py +++ b/swh/indexer/origin_head.py @@ -1,217 +1,221 @@ # Copyright (C) 2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import re import click import logging from swh.scheduler import get_scheduler from swh.scheduler.utils import create_task_dict from swh.indexer.indexer import OriginIndexer class OriginHeadIndexer(OriginIndexer): """Origin-level indexer. This indexer is in charge of looking up the revision that acts as the "head" of an origin. In git, this is usually the commit pointed to by the 'master' branch.""" ADDITIONAL_CONFIG = { 'tools': ('dict', { 'name': 'origin-metadata', 'version': '0.0.1', 'configuration': {}, }), + 'tasks': ('dict', { + 'revision_metadata': 'revision_metadata', + 'origin_intrinsic_metadata': 'origin_metadata', + }) } CONFIG_BASE_FILENAME = 'indexer/origin_head' - revision_metadata_task = 'revision_metadata' - origin_intrinsic_metadata_task = 'origin_metadata' - def filter(self, ids): yield from ids def persist_index_computations(self, results, policy_update): """Do nothing. The indexer's results are not persistent, they should only be piped to another indexer.""" pass def next_step(self, results, task): """Once the head is found, call the RevisionMetadataIndexer on these revisions, then call the OriginMetadataIndexer with both the origin_id and the revision metadata, so it can copy the revision metadata to the origin's metadata. Args: results (Iterable[dict]): Iterable of return values from `index`. """ super().next_step(results, task) - if self.revision_metadata_task is None and \ - self.origin_intrinsic_metadata_task is None: + revision_metadata_task = self.config['tasks']['revision_metadata'] + origin_intrinsic_metadata_task = self.config['tasks'][ + 'origin_intrinsic_metadata'] + if revision_metadata_task is None and \ + origin_intrinsic_metadata_task is None: return - assert self.revision_metadata_task is not None - assert self.origin_intrinsic_metadata_task is not None + assert revision_metadata_task is not None + assert origin_intrinsic_metadata_task is not None # Second task to run after this one: copy the revision's metadata # to the origin sub_task = create_task_dict( - self.origin_intrinsic_metadata_task, + origin_intrinsic_metadata_task, 'oneshot', origin_head={ str(result['origin_id']): result['revision_id'].decode() for result in results}, policy_update='update-dups', ) del sub_task['next_run'] # Not json-serializable # First task to run after this one: index the metadata of the # revision task = create_task_dict( - self.revision_metadata_task, + revision_metadata_task, 'oneshot', ids=[res['revision_id'].decode() for res in results], policy_update='update-dups', next_step={ **sub_task, 'result_name': 'revisions_metadata'}, ) if getattr(self, 'scheduler', None): scheduler = self.scheduler else: scheduler = get_scheduler(**self.config['scheduler']) scheduler.create_tasks([task]) # Dispatch def index(self, origin): origin_id = origin['id'] latest_snapshot = self.storage.snapshot_get_latest(origin_id) method = getattr(self, '_try_get_%s_head' % origin['type'], None) if method is None: method = self._try_get_head_generic rev_id = method(latest_snapshot) if rev_id is None: return None result = { 'origin_id': origin_id, 'revision_id': rev_id, } return result # VCSs def _try_get_vcs_head(self, snapshot): try: if isinstance(snapshot, dict): branches = snapshot['branches'] if branches[b'HEAD']['target_type'] == 'revision': return branches[b'HEAD']['target'] except KeyError: return None _try_get_hg_head = _try_get_git_head = _try_get_vcs_head # Tarballs _archive_filename_re = re.compile( rb'^' rb'(?P.*)[-_]' rb'(?P[0-9]+(\.[0-9])*)' rb'(?P[-+][a-zA-Z0-9.~]+?)?' rb'(?P(\.[a-zA-Z0-9]+)+)' rb'$') @classmethod def _parse_version(cls, filename): """Extracts the release version from an archive filename, to get an ordering whose maximum is likely to be the last version of the software >>> OriginHeadIndexer._parse_version(b'foo') (-inf,) >>> OriginHeadIndexer._parse_version(b'foo.tar.gz') (-inf,) >>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1.tar.gz') (0, 0, 1, 0) >>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1-beta2.tar.gz') (0, 0, 1, -1, 'beta2') >>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1+foobar.tar.gz') (0, 0, 1, 1, 'foobar') """ res = cls._archive_filename_re.match(filename) if res is None: return (float('-infinity'),) version = [int(n) for n in res.group('version').decode().split('.')] if res.group('preversion') is None: version.append(0) else: preversion = res.group('preversion').decode() if preversion.startswith('-'): version.append(-1) version.append(preversion[1:]) elif preversion.startswith('+'): version.append(1) version.append(preversion[1:]) else: assert False, res.group('preversion') return tuple(version) def _try_get_ftp_head(self, snapshot): archive_names = list(snapshot['branches']) max_archive_name = max(archive_names, key=self._parse_version) r = self._try_resolve_target(snapshot['branches'], max_archive_name) return r # Generic def _try_get_head_generic(self, snapshot): # Works on 'deposit', 'svn', and 'pypi'. try: if isinstance(snapshot, dict): branches = snapshot['branches'] except KeyError: return None else: return ( self._try_resolve_target(branches, b'HEAD') or self._try_resolve_target(branches, b'master') ) def _try_resolve_target(self, branches, target_name): try: target = branches[target_name] while target['target_type'] == 'alias': target = branches[target['target']] if target['target_type'] == 'revision': return target['target'] elif target['target_type'] == 'content': return None # TODO elif target['target_type'] == 'directory': return None # TODO elif target['target_type'] == 'release': return None # TODO else: assert False except KeyError: return None @click.command() @click.option('--origins', '-i', help='Origins to lookup, in the "type+url" format', multiple=True) def main(origins): rev_metadata_indexer = OriginHeadIndexer() rev_metadata_indexer.run(origins, 'update-dups', parse_ids=True) if __name__ == '__main__': logging.basicConfig(level=logging.INFO) main() diff --git a/swh/indexer/tests/test_ctags.py b/swh/indexer/tests/test_ctags.py index ae45338..21939d7 100644 --- a/swh/indexer/tests/test_ctags.py +++ b/swh/indexer/tests/test_ctags.py @@ -1,104 +1,153 @@ # Copyright (C) 2017-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import unittest import logging -from swh.indexer.ctags import CtagsIndexer + +from unittest.mock import patch +from swh.indexer.ctags import ( + CtagsIndexer, run_ctags +) + from swh.indexer.tests.test_utils import ( BasicMockIndexerStorage, MockObjStorage, CommonContentIndexerTest, CommonIndexerWithErrorsTest, CommonIndexerNoTool, SHA1_TO_CTAGS, NoDiskIndexer ) +class BasicTest(unittest.TestCase): + @patch('swh.indexer.ctags.subprocess') + def test_run_ctags(self, mock_subprocess): + """Computing licenses from a raw content should return results + + """ + output0 = """ +{"name":"defun","kind":"function","line":1,"language":"scheme"} +{"name":"name","kind":"symbol","line":5,"language":"else"}""" + output1 = """ +{"name":"let","kind":"var","line":10,"language":"something"}""" + + expected_result0 = [ + { + 'name': 'defun', + 'kind': 'function', + 'line': 1, + 'lang': 'scheme' + }, + { + 'name': 'name', + 'kind': 'symbol', + 'line': 5, + 'lang': 'else' + } + ] + + expected_result1 = [ + { + 'name': 'let', + 'kind': 'var', + 'line': 10, + 'lang': 'something' + } + ] + for path, lang, intermediary_result, expected_result in [ + (b'some/path', 'lisp', output0, expected_result0), + (b'some/path/2', 'markdown', output1, expected_result1) + ]: + mock_subprocess.check_output.return_value = intermediary_result + actual_result = list(run_ctags(path, lang=lang)) + self.assertEqual(actual_result, expected_result) + + class InjectCtagsIndexer: """Override ctags computations. """ def compute_ctags(self, path, lang): """Inject fake ctags given path (sha1 identifier). """ return { 'lang': lang, **SHA1_TO_CTAGS.get(path) } class CtagsIndexerTest(NoDiskIndexer, InjectCtagsIndexer, CtagsIndexer): """Specific language whose configuration is enough to satisfy the indexing tests. """ def prepare(self): self.config = { 'tools': { 'name': 'universal-ctags', 'version': '~git7859817b', 'configuration': { 'command_line': '''ctags --fields=+lnz --sort=no ''' ''' --links=no ''', 'max_content_size': 1000, }, }, 'languages': { 'python': 'python', 'haskell': 'haskell', 'bar': 'bar', } } self.idx_storage = BasicMockIndexerStorage() self.log = logging.getLogger('swh.indexer') self.objstorage = MockObjStorage() self.tool_config = self.config['tools']['configuration'] self.max_content_size = self.tool_config['max_content_size'] self.tools = self.register_tools(self.config['tools']) self.tool = self.tools[0] self.language_map = self.config['languages'] class TestCtagsIndexer(CommonContentIndexerTest, unittest.TestCase): """Ctags indexer test scenarios: - Known sha1s in the input list have their data indexed - Unknown sha1 in the input list are not indexed """ def setUp(self): self.indexer = CtagsIndexerTest() # Prepare test input self.id0 = '01c9379dfc33803963d07c1ccc748d3fe4c96bb5' self.id1 = 'd4c647f0fc257591cc9ba1722484229780d1c607' self.id2 = '688a5ef812c53907562fe379d4b3851e69c7cb15' tool_id = self.indexer.tool['id'] self.expected_results = { self.id0: { 'id': self.id0, 'indexer_configuration_id': tool_id, 'ctags': SHA1_TO_CTAGS[self.id0], }, self.id1: { 'id': self.id1, 'indexer_configuration_id': tool_id, 'ctags': SHA1_TO_CTAGS[self.id1], }, self.id2: { 'id': self.id2, 'indexer_configuration_id': tool_id, 'ctags': SHA1_TO_CTAGS[self.id2], } } class CtagsIndexerUnknownToolTestStorage( CommonIndexerNoTool, CtagsIndexerTest): """Fossology license indexer with wrong configuration""" class TestCtagsIndexersErrors( CommonIndexerWithErrorsTest, unittest.TestCase): """Test the indexer raise the right errors when wrongly initialized""" Indexer = CtagsIndexerUnknownToolTestStorage diff --git a/swh/indexer/tests/test_origin_head.py b/swh/indexer/tests/test_origin_head.py index 335ced7..f7e07a1 100644 --- a/swh/indexer/tests/test_origin_head.py +++ b/swh/indexer/tests/test_origin_head.py @@ -1,91 +1,91 @@ -# Copyright (C) 2017 The Software Heritage developers +# Copyright (C) 2017-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import unittest import logging from swh.indexer.origin_head import OriginHeadIndexer from swh.indexer.tests.test_utils import MockIndexerStorage, MockStorage class OriginHeadTestIndexer(OriginHeadIndexer): """Specific indexer whose configuration is enough to satisfy the indexing tests. """ - - revision_metadata_task = None - origin_intrinsic_metadata_task = None - def prepare(self): self.config = { 'tools': { 'name': 'origin-metadata', 'version': '0.0.1', 'configuration': {}, }, + 'tasks': { + 'revision_metadata': None, + 'origin_intrinsic_metadata': None, + } } self.storage = MockStorage() self.idx_storage = MockIndexerStorage() self.log = logging.getLogger('swh.indexer') self.objstorage = None self.tools = self.register_tools(self.config['tools']) self.tool = self.tools[0] self.results = None def persist_index_computations(self, results, policy_update): self.results = results class OriginHead(unittest.TestCase): def test_git(self): indexer = OriginHeadTestIndexer() indexer.run( ['git+https://github.com/SoftwareHeritage/swh-storage'], 'update-dups', parse_ids=True) self.assertEqual(indexer.results, [{ 'revision_id': b'8K\x12\x00d\x03\xcc\xe4]bS\xe3\x8f{' b'\xd7}\xac\xefrm', 'origin_id': 52189575}]) def test_ftp(self): indexer = OriginHeadTestIndexer() indexer.run( ['ftp+rsync://ftp.gnu.org/gnu/3dldf'], 'update-dups', parse_ids=True) self.assertEqual(indexer.results, [{ 'revision_id': b'\x8e\xa9\x8e/\xea}\x9feF\xf4\x9f\xfd\xee' b'\xcc\x1a\xb4`\x8c\x8by', 'origin_id': 4423668}]) def test_deposit(self): indexer = OriginHeadTestIndexer() indexer.run( ['deposit+https://forge.softwareheritage.org/source/' 'jesuisgpl/'], 'update-dups', parse_ids=True) self.assertEqual(indexer.results, [{ 'revision_id': b'\xe7n\xa4\x9c\x9f\xfb\xb7\xf76\x11\x08{' b'\xa6\xe9\x99\xb1\x9e]q\xeb', 'origin_id': 77775770}]) def test_pypi(self): indexer = OriginHeadTestIndexer() indexer.run( ['pypi+https://pypi.org/project/limnoria/'], 'update-dups', parse_ids=True) self.assertEqual(indexer.results, [{ 'revision_id': b'\x83\xb9\xb6\xc7\x05\xb1%\xd0\xfem\xd8k' b'A\x10\x9d\xc5\xfa2\xf8t', 'origin_id': 85072327}]) def test_svn(self): indexer = OriginHeadTestIndexer() indexer.run( ['svn+http://0-512-md.googlecode.com/svn/'], 'update-dups', parse_ids=True) self.assertEqual(indexer.results, [{ 'revision_id': b'\xe4?r\xe1,\x88\xab\xec\xe7\x9a\x87\xb8' b'\xc9\xad#.\x1bw=\x18', 'origin_id': 49908349}]) diff --git a/swh/indexer/tests/test_origin_metadata.py b/swh/indexer/tests/test_origin_metadata.py index 1ed3024..7166434 100644 --- a/swh/indexer/tests/test_origin_metadata.py +++ b/swh/indexer/tests/test_origin_metadata.py @@ -1,126 +1,130 @@ # Copyright (C) 2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import time import logging import unittest from celery import task from swh.indexer.metadata import OriginMetadataIndexer from swh.indexer.tests.test_utils import MockObjStorage, MockStorage from swh.indexer.tests.test_utils import MockIndexerStorage from swh.indexer.tests.test_origin_head import OriginHeadTestIndexer from swh.indexer.tests.test_metadata import RevisionMetadataTestIndexer from swh.scheduler.tests.scheduler_testing import SchedulerTestFixture class OriginMetadataTestIndexer(OriginMetadataIndexer): def prepare(self): self.config = { 'storage': { 'cls': 'remote', 'args': { 'url': 'http://localhost:9999', } }, 'tools': { 'name': 'origin-metadata', 'version': '0.0.1', 'configuration': {} } } self.storage = MockStorage() self.idx_storage = MockIndexerStorage() self.log = logging.getLogger('swh.indexer') self.objstorage = MockObjStorage() self.tools = self.register_tools(self.config['tools']) self.tool = self.tools[0] self.results = [] @task def revision_metadata_test_task(*args, **kwargs): indexer = RevisionMetadataTestIndexer() indexer.run(*args, **kwargs) return indexer.results @task def origin_intrinsic_metadata_test_task(*args, **kwargs): indexer = OriginMetadataTestIndexer() indexer.run(*args, **kwargs) return indexer.results class OriginHeadTestIndexer(OriginHeadTestIndexer): - revision_metadata_task = 'revision_metadata_test_task' - origin_intrinsic_metadata_task = 'origin_intrinsic_metadata_test_task' + def prepare(self): + super().prepare() + self.config['tasks'] = { + 'revision_metadata': 'revision_metadata_test_task', + 'origin_intrinsic_metadata': 'origin_intrinsic_metadata_test_task', + } class TestOriginMetadata(SchedulerTestFixture, unittest.TestCase): def setUp(self): super().setUp() self.maxDiff = None MockIndexerStorage.added_data = [] self.add_scheduler_task_type( 'revision_metadata_test_task', 'swh.indexer.tests.test_origin_metadata.' 'revision_metadata_test_task') self.add_scheduler_task_type( 'origin_intrinsic_metadata_test_task', 'swh.indexer.tests.test_origin_metadata.' 'origin_intrinsic_metadata_test_task') RevisionMetadataTestIndexer.scheduler = self.scheduler def tearDown(self): del RevisionMetadataTestIndexer.scheduler super().tearDown() def test_pipeline(self): indexer = OriginHeadTestIndexer() indexer.scheduler = self.scheduler indexer.run( ["git+https://github.com/librariesio/yarn-parser"], policy_update='update-dups', parse_ids=True) self.run_ready_tasks() # Run the first task time.sleep(0.1) # Give it time to complete and schedule the 2nd one self.run_ready_tasks() # Run the second task metadata = { '@context': 'https://doi.org/10.5063/schema/codemeta-2.0', 'url': 'https://github.com/librariesio/yarn-parser#readme', 'schema:codeRepository': 'git+https://github.com/librariesio/yarn-parser.git', 'schema:author': 'Andrew Nesbitt', 'license': 'AGPL-3.0', 'version': '1.0.0', 'description': 'Tiny web service for parsing yarn.lock files', 'codemeta:issueTracker': 'https://github.com/librariesio/yarn-parser/issues', 'name': 'yarn-parser', 'keywords': ['yarn', 'parse', 'lock', 'dependencies'], } rev_metadata = { 'id': '8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f', 'translated_metadata': metadata, 'indexer_configuration_id': 7, } origin_metadata = { 'origin_id': 54974445, 'from_revision': '8dbb6aeb036e7fd80664eb8bfd1507881af1ba9f', 'metadata': metadata, 'indexer_configuration_id': 7, } expected_results = [ ('origin_intrinsic_metadata', True, [origin_metadata]), ('revision_metadata', True, [rev_metadata])] results = list(indexer.idx_storage.added_data) self.assertCountEqual(expected_results, results) diff --git a/version.txt b/version.txt index d1c0402..52f6aed 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.59-0-g45c8f94 \ No newline at end of file +v0.0.60-0-ga1332dd \ No newline at end of file