diff --git a/PKG-INFO b/PKG-INFO index a2cb919..20eaf96 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,69 +1,69 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.137 +Version: 0.0.139 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN -Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/debian/changelog b/debian/changelog index ca98333..93e0708 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,697 +1,709 @@ -swh-indexer (0.0.137-1~swh1~bpo9+1) stretch-swh; urgency=medium +swh-indexer (0.0.139-1~swh1) unstable-swh; urgency=medium - * Rebuild for stretch-swh + * New upstream release 0.0.139 - (tagged by Antoine R. Dumont + (@ardumont) on 2019-02-22 15:53:22 + +0100) + * Upstream changes: - v0.0.139 - Clean up no longer used tasks + + -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Feb 2019 14:59:40 +0000 + +swh-indexer (0.0.138-1~swh1) unstable-swh; urgency=medium + + * New upstream release 0.0.138 - (tagged by Valentin Lorentz + on 2019-02-22 15:30:30 +0100) + * Upstream changes: - Make the 'config' argument of + OriginMetadaIndexer optional again. - -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Feb 2019 10:12:57 +0000 + -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Feb 2019 14:37:35 +0000 swh-indexer (0.0.137-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.137 - (tagged by Antoine R. Dumont (@ardumont) on 2019-02-22 10:59:53 +0100) * Upstream changes: - v0.0.137 - swh.indexer.storage.api.wsgi: Open production wsgi entrypoint - swh.indexer.cli: Move dev app entrypoint in dedicated cli - indexer.storage: Make server load explicit configuration and check - config: use already loaded swh config, if any, when instantiating an Indexer - api: Add support for filtering by tool_id to origin_intrinsic_metadata_search_by_producer. - api: Add storage endpoint to search metadata by mapping. - runtime: Remove implicit configuration from the metadata indexers. - debian: Remove debian packaging from master branch - docs: Update missing documentation -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Feb 2019 10:11:29 +0000 swh-indexer (0.0.136-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.136 - (tagged by Valentin Lorentz on 2019-02-14 17:09:00 +0100) * Upstream changes: - Don't send 'None' as a revision id to storage.revision_get. - This error wasn't caught before because the in-mem storage - accepts None values, but the pg storage doesn't. -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Feb 2019 16:22:41 +0000 swh-indexer (0.0.135-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.135 - (tagged by Valentin Lorentz on 2019-02-14 14:45:24 +0100) * Upstream changes: - Fix deduplication of origins when persisting origin intrinsic metadata. -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Feb 2019 14:32:55 +0000 swh-indexer (0.0.134-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.134 - (tagged by Antoine R. Dumont (@ardumont) on 2019-02-13 23:46:44 +0100) * Upstream changes: - v0.0.134 - package: Break dependency of swh.indexer.storage on swh.indexer. - api/server: Do not read configuration at each request - metadata: Fix gemspec test - metadata: Prevent OriginMetadataIndexer from sending duplicate - revisions to revision_metadata_add. - test: Fix bugs found by hypothesis. - test: Use hypothesis to generate adversarial inputs. - Add more type checks in metadata dictionary. - Add checks in the idx_storage that the same content/rev/orig is not - present twice in the new data. -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Feb 2019 09:16:15 +0000 swh-indexer (0.0.133-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.133 - (tagged by Antoine R. Dumont (@ardumont) on 2019-02-12 10:28:01 +0100) * Upstream changes: - v0.0.133 - Migrate BaseDB api calls from core to storage - Improve storage api calls using latest storage api - OriginIndexer: Refactoring - tests: Refactoring - metadata search: Use index - indexer metadata: Provide stats per origin - indexer metadata: Update mapping column - indexer metadata: Improve and fix issues -- Software Heritage autobuilder (on jenkins-debian1) Tue, 12 Feb 2019 09:34:43 +0000 swh-indexer (0.0.132-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.132 - (tagged by Antoine R. Dumont (@ardumont) on 2019-01-30 15:03:14 +0100) * Upstream changes: - v0.0.132 - swh/indexer/tasks: Fix range indexer tasks - Maven: Add support for empty XML nodes. - Add support for alternative call format for Gem::Specification.new. -- Software Heritage autobuilder (on jenkins-debian1) Wed, 30 Jan 2019 14:09:48 +0000 swh-indexer (0.0.131-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.131 - (tagged by Antoine R. Dumont (@ardumont) on 2019-01-30 10:56:43 +0100) * Upstream changes: - v0.0.131 - fix pep8 violations - fix misspellings -- Software Heritage autobuilder (on jenkins-debian1) Wed, 30 Jan 2019 10:01:47 +0000 swh-indexer (0.0.129-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.129 - (tagged by Valentin Lorentz on 2019-01-29 14:11:22 +0100) * Upstream changes: - Fix missing config file name change. -- Software Heritage autobuilder (on jenkins-debian1) Tue, 29 Jan 2019 13:34:17 +0000 swh-indexer (0.0.128-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.128 - (tagged by Valentin Lorentz on 2019-01-25 15:22:52 +0100) * Upstream changes: - Make metadata indexers store the mappings used to translate metadata. -- Software Heritage autobuilder (on jenkins-debian1) Tue, 29 Jan 2019 12:18:16 +0000 swh-indexer (0.0.127-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.127 - (tagged by Valentin Lorentz on 2019-01-15 15:56:49 +0100) * Upstream changes: - Prevent repository normalization from crashing on malformed input. -- Software Heritage autobuilder (on jenkins-debian1) Tue, 15 Jan 2019 16:20:32 +0000 swh-indexer (0.0.126-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.126 - (tagged by Valentin Lorentz on 2019-01-14 11:42:52 +0100) * Upstream changes: - Don't call OriginHeadIndexer.next_step when there is no revision. -- Software Heritage autobuilder (on jenkins-debian1) Mon, 14 Jan 2019 10:57:34 +0000 swh-indexer (0.0.125-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.125 - (tagged by Antoine R. Dumont (@ardumont) on 2019-01-11 12:01:42 +0100) * Upstream changes: - v0.0.125 - Add journal client that listens for origin visits and schedules - OriginHead - Fix tests to work with the new version of swh.storage -- Software Heritage autobuilder (on jenkins-debian1) Fri, 11 Jan 2019 11:08:51 +0000 swh-indexer (0.0.124-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.124 - (tagged by Antoine R. Dumont (@ardumont) on 2019-01-08 14:09:32 +0100) * Upstream changes: - v0.0.124 - indexer: Fix type check on indexing result -- Software Heritage autobuilder (on jenkins-debian1) Thu, 10 Jan 2019 17:12:07 +0000 swh-indexer (0.0.118-1~swh1) unstable-swh; urgency=medium * v0.0.118 * metadata-indexer: Fix setup initialization * tests: Refactoring -- Antoine R. Dumont (@ardumont) Fri, 30 Nov 2018 14:50:52 +0100 swh-indexer (0.0.67-1~swh1) unstable-swh; urgency=medium * v0.0.67 * mimetype: Migrate to indexed data as text -- Antoine R. Dumont (@ardumont) Wed, 28 Nov 2018 11:35:37 +0100 swh-indexer (0.0.66-1~swh1) unstable-swh; urgency=medium * v0.0.66 * range-indexer: Stream indexing range computations -- Antoine R. Dumont (@ardumont) Tue, 27 Nov 2018 11:48:24 +0100 swh-indexer (0.0.65-1~swh1) unstable-swh; urgency=medium * v0.0.65 * Fix revision metadata indexer -- Antoine R. Dumont (@ardumont) Mon, 26 Nov 2018 19:30:48 +0100 swh-indexer (0.0.64-1~swh1) unstable-swh; urgency=medium * v0.0.64 * indexer: Fix mixed identifier encodings issues * Add missing config filename for origin intrinsic metadata indexer. -- Antoine R. Dumont (@ardumont) Mon, 26 Nov 2018 12:20:01 +0100 swh-indexer (0.0.63-1~swh1) unstable-swh; urgency=medium * v0.0.63 * Make the OriginMetadataIndexer fetch rev metadata from the storage * instead of getting them via the scheduler. * Make the 'result_name' key of 'next_step' optional. * Add missing return. * doc: update index to match new swh-doc format -- Antoine R. Dumont (@ardumont) Fri, 23 Nov 2018 17:56:10 +0100 swh-indexer (0.0.62-1~swh1) unstable-swh; urgency=medium * v0.0.62 * metadata indexer: Add empty tool configuration * Add fulltext search on origin intrinsic metadata -- Antoine R. Dumont (@ardumont) Fri, 23 Nov 2018 14:25:55 +0100 swh-indexer (0.0.61-1~swh1) unstable-swh; urgency=medium * v0.0.61 * indexer: Fix origin indexer's default arguments -- Antoine R. Dumont (@ardumont) Wed, 21 Nov 2018 16:01:50 +0100 swh-indexer (0.0.60-1~swh1) unstable-swh; urgency=medium * v0.0.60 * origin_head: Make next step optional * tests: Increase coverage -- Antoine R. Dumont (@ardumont) Wed, 21 Nov 2018 12:33:13 +0100 swh-indexer (0.0.59-1~swh1) unstable-swh; urgency=medium * v0.0.59 * fossology license: Fix issue on license computation * Improve docstrings * Fix pep8 violations * Increase coverage on content indexers -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 14:27:20 +0100 swh-indexer (0.0.58-1~swh1) unstable-swh; urgency=medium * v0.0.58 * Add missing default configuration for fossology license indexer * tests: Remove dead code -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 12:06:56 +0100 swh-indexer (0.0.57-1~swh1) unstable-swh; urgency=medium * v0.0.57 * storage: Open new endpoint on fossology license range retrieval * indexer: Open new fossology license range indexer -- Antoine R. Dumont (@ardumont) Tue, 20 Nov 2018 11:44:57 +0100 swh-indexer (0.0.56-1~swh1) unstable-swh; urgency=medium * v0.0.56 * storage.api: Open new endpoints (mimetype range, fossology range) * content indexers: Open mimetype and fossology range indexers * Remove orchestrator modules * tests: Improve coverage -- Antoine R. Dumont (@ardumont) Mon, 19 Nov 2018 11:56:06 +0100 swh-indexer (0.0.55-1~swh1) unstable-swh; urgency=medium * v0.0.55 * swh.indexer: Let task reschedule itself through the scheduler * Use swh.scheduler instead of celery leaking all around * swh.indexer.orchestrator: Fix orchestrator initialization step * swh.indexer.tasks: Fix type error when no result or list result -- Antoine R. Dumont (@ardumont) Mon, 29 Oct 2018 10:41:54 +0100 swh-indexer (0.0.54-1~swh1) unstable-swh; urgency=medium * v0.0.54 * swh.indexer.tasks: Fix task to use the scheduler's -- Antoine R. Dumont (@ardumont) Thu, 25 Oct 2018 20:13:51 +0200 swh-indexer (0.0.53-1~swh1) unstable-swh; urgency=medium * v0.0.53 * swh.indexer.rehash: Migrate to latest swh.model.hashutil.MultiHash * indexer: Add the origin intrinsic metadata indexer * indexer: Add OriginIndexer and OriginHeadIndexer. * indexer.storage: Add the origin intrinsic metadata storage database * indexer.storage: Autogenerate the Indexer Storage HTTP API. * setup: prepare for pypi upload * tests: Add a tox file * tests: migrate to pytest * tests: Add tests around celery stack * docs: Improve documentation and reuse README in generated documentation -- Antoine R. Dumont (@ardumont) Thu, 25 Oct 2018 19:03:56 +0200 swh-indexer (0.0.52-1~swh1) unstable-swh; urgency=medium * v0.0.52 * swh.indexer.storage: Refactor fossology license get (first external * contribution, cf. /CONTRIBUTORS) * swh.indexer.storage: Fix typo in invariable name metadata * swh.indexer.storage: No longer use temp table when reading data * swh.indexer.storage: Clean up unused import * swh.indexer.storage: Remove dead entry points origin_metadata* * swh.indexer.storage: Update docstrings information and format -- Antoine R. Dumont (@ardumont) Wed, 13 Jun 2018 11:20:40 +0200 swh-indexer (0.0.51-1~swh1) unstable-swh; urgency=medium * Release swh.indexer v0.0.51 * Update for new db_transaction{,_generator} -- Nicolas Dandrimont Tue, 05 Jun 2018 14:10:39 +0200 swh-indexer (0.0.50-1~swh1) unstable-swh; urgency=medium * v0.0.50 * swh.indexer.api.client: Permit to specify the query timeout option -- Antoine R. Dumont (@ardumont) Thu, 24 May 2018 12:19:06 +0200 swh-indexer (0.0.49-1~swh1) unstable-swh; urgency=medium * v0.0.49 * test_storage: Instantiate the tools during tests' setUp phase * test_storage: Deallocate storage during teardown step * test_storage: Make storage test fixture connect to postgres itself * storage.api.server: Only instantiate storage backend once per import * Use thread-aware psycopg2 connection pooling for database access -- Antoine R. Dumont (@ardumont) Mon, 14 May 2018 11:09:30 +0200 swh-indexer (0.0.48-1~swh1) unstable-swh; urgency=medium * Release swh.indexer v0.0.48 * Update for new swh.storage -- Nicolas Dandrimont Sat, 12 May 2018 18:30:10 +0200 swh-indexer (0.0.47-1~swh1) unstable-swh; urgency=medium * v0.0.47 * d/control: Fix runtime typo in packaging dependency -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 16:54:49 +0100 swh-indexer (0.0.46-1~swh1) unstable-swh; urgency=medium * v0.0.46 * Split swh-indexer packages in 2 python3-swh.indexer.storage and * python3-swh.indexer -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 16:18:04 +0100 swh-indexer (0.0.45-1~swh1) unstable-swh; urgency=medium * v0.0.45 * Fix usual error raised when deploying -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 15:01:01 +0100 swh-indexer (0.0.44-1~swh1) unstable-swh; urgency=medium * v0.0.44 * swh.indexer: Make indexer use their own storage -- Antoine R. Dumont (@ardumont) Thu, 07 Dec 2017 13:20:44 +0100 swh-indexer (0.0.43-1~swh1) unstable-swh; urgency=medium * v0.0.43 * swh.indexer.mimetype: Work around problem in detection -- Antoine R. Dumont (@ardumont) Wed, 29 Nov 2017 10:26:11 +0100 swh-indexer (0.0.42-1~swh1) unstable-swh; urgency=medium * v0.0.42 * swh.indexer: Make indexers register tools in prepare method -- Antoine R. Dumont (@ardumont) Fri, 24 Nov 2017 11:26:03 +0100 swh-indexer (0.0.41-1~swh1) unstable-swh; urgency=medium * v0.0.41 * mimetype: Use magic library api instead of parsing `file` cli output -- Antoine R. Dumont (@ardumont) Mon, 20 Nov 2017 13:05:29 +0100 swh-indexer (0.0.39-1~swh1) unstable-swh; urgency=medium * v0.0.39 * swh.indexer.producer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Thu, 19 Oct 2017 10:03:44 +0200 swh-indexer (0.0.38-1~swh1) unstable-swh; urgency=medium * v0.0.38 * swh.indexer.indexer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 19:57:47 +0200 swh-indexer (0.0.37-1~swh1) unstable-swh; urgency=medium * v0.0.37 * swh.indexer.indexer: Fix argument to match the abstract definition -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 18:59:42 +0200 swh-indexer (0.0.36-1~swh1) unstable-swh; urgency=medium * v0.0.36 * packaging: Cleanup * codemeta: Adding codemeta.json file to document metadata * swh.indexer.mimetype: Fix edge case regarding empty raw content * docs: sanitize docstrings for sphinx documentation generation * swh.indexer.metadata: Add RevisionMetadataIndexer * swh.indexer.metadata: Add ContentMetadataIndexer * swh.indexer: Refactor base class to improve inheritance * swh.indexer.metadata: First draft of the metadata content indexer * for npm (package.json) * swh.indexer.tests: Added tests for language indexer -- Antoine R. Dumont (@ardumont) Wed, 18 Oct 2017 16:24:24 +0200 swh-indexer (0.0.35-1~swh1) unstable-swh; urgency=medium * Release swh.indexer 0.0.35 * Update tasks to new swh.scheduler API -- Nicolas Dandrimont Mon, 12 Jun 2017 18:02:04 +0200 swh-indexer (0.0.34-1~swh1) unstable-swh; urgency=medium * v0.0.34 * Fix unbound local error on edge case -- Antoine R. Dumont (@ardumont) Wed, 07 Jun 2017 11:23:29 +0200 swh-indexer (0.0.33-1~swh1) unstable-swh; urgency=medium * v0.0.33 * language indexer: Improve edge case policy -- Antoine R. Dumont (@ardumont) Wed, 07 Jun 2017 11:02:47 +0200 swh-indexer (0.0.32-1~swh1) unstable-swh; urgency=medium * v0.0.32 * Update fossology license to use the latest swh-storage * Improve language indexer to deal with potential error on bad * chunking -- Antoine R. Dumont (@ardumont) Tue, 06 Jun 2017 18:13:40 +0200 swh-indexer (0.0.31-1~swh1) unstable-swh; urgency=medium * v0.0.31 * Reduce log verbosity on language indexer -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 19:08:52 +0200 swh-indexer (0.0.30-1~swh1) unstable-swh; urgency=medium * v0.0.30 * Fix wrong default configuration -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 18:01:27 +0200 swh-indexer (0.0.29-1~swh1) unstable-swh; urgency=medium * v0.0.29 * Update indexer to resolve indexer configuration identifier * Adapt language indexer to use partial raw content -- Antoine R. Dumont (@ardumont) Fri, 02 Jun 2017 16:21:27 +0200 swh-indexer (0.0.28-1~swh1) unstable-swh; urgency=medium * v0.0.28 * Add error resilience to fossology indexer -- Antoine R. Dumont (@ardumont) Mon, 22 May 2017 12:57:55 +0200 swh-indexer (0.0.27-1~swh1) unstable-swh; urgency=medium * v0.0.27 * swh.indexer.language: Incremental encoding detection -- Antoine R. Dumont (@ardumont) Wed, 17 May 2017 18:04:27 +0200 swh-indexer (0.0.26-1~swh1) unstable-swh; urgency=medium * v0.0.26 * swh.indexer.orchestrator: Add batch size option per indexer * Log caught exception in a unified manner * Add rescheduling option (not by default) on rehash + indexers -- Antoine R. Dumont (@ardumont) Wed, 17 May 2017 14:08:07 +0200 swh-indexer (0.0.25-1~swh1) unstable-swh; urgency=medium * v0.0.25 * Add reschedule on error parameter for indexers -- Antoine R. Dumont (@ardumont) Fri, 12 May 2017 12:13:15 +0200 swh-indexer (0.0.24-1~swh1) unstable-swh; urgency=medium * v0.0.24 * Make rehash indexer more resilient to errors by rescheduling contents * in error (be it reading or updating problems) -- Antoine R. Dumont (@ardumont) Thu, 04 May 2017 14:22:43 +0200 swh-indexer (0.0.23-1~swh1) unstable-swh; urgency=medium * v0.0.23 * Improve producer to optionally make it synchroneous -- Antoine R. Dumont (@ardumont) Wed, 03 May 2017 15:29:44 +0200 swh-indexer (0.0.22-1~swh1) unstable-swh; urgency=medium * v0.0.22 * Improve mimetype indexer implementation * Make the chaining option in the mimetype indexer -- Antoine R. Dumont (@ardumont) Tue, 02 May 2017 16:31:14 +0200 swh-indexer (0.0.21-1~swh1) unstable-swh; urgency=medium * v0.0.21 * swh.indexer.rehash: Actually make the worker log -- Antoine R. Dumont (@ardumont) Tue, 02 May 2017 14:28:55 +0200 swh-indexer (0.0.20-1~swh1) unstable-swh; urgency=medium * v0.0.20 * swh.indexer.rehash: * Improve reading from objstorage only when needed * Fix empty file use case (which was skipped) * Add logging -- Antoine R. Dumont (@ardumont) Fri, 28 Apr 2017 09:39:09 +0200 swh-indexer (0.0.19-1~swh1) unstable-swh; urgency=medium * v0.0.19 * Fix rehash indexer's default configuration file -- Antoine R. Dumont (@ardumont) Thu, 27 Apr 2017 19:17:20 +0200 swh-indexer (0.0.18-1~swh1) unstable-swh; urgency=medium * v0.0.18 * Add new rehash indexer -- Antoine R. Dumont (@ardumont) Wed, 26 Apr 2017 15:23:02 +0200 swh-indexer (0.0.17-1~swh1) unstable-swh; urgency=medium * v0.0.17 * Add information on indexer tools (T610) -- Antoine R. Dumont (@ardumont) Fri, 02 Dec 2016 18:32:54 +0100 swh-indexer (0.0.16-1~swh1) unstable-swh; urgency=medium * v0.0.16 * bug fixes -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 19:31:52 +0100 swh-indexer (0.0.15-1~swh1) unstable-swh; urgency=medium * v0.0.15 * Improve message producer -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 18:16:42 +0100 swh-indexer (0.0.14-1~swh1) unstable-swh; urgency=medium * v0.0.14 * Update package dependency on fossology-nomossa -- Antoine R. Dumont (@ardumont) Tue, 15 Nov 2016 14:13:41 +0100 swh-indexer (0.0.13-1~swh1) unstable-swh; urgency=medium * v0.0.13 * Add new license indexer * ctags indexer: align behavior with other indexers regarding the * conflict update policy -- Antoine R. Dumont (@ardumont) Mon, 14 Nov 2016 14:13:34 +0100 swh-indexer (0.0.12-1~swh1) unstable-swh; urgency=medium * v0.0.12 * Add runtime dependency on universal-ctags -- Antoine R. Dumont (@ardumont) Fri, 04 Nov 2016 13:59:59 +0100 swh-indexer (0.0.11-1~swh1) unstable-swh; urgency=medium * v0.0.11 * Remove dependency on exuberant-ctags -- Antoine R. Dumont (@ardumont) Thu, 03 Nov 2016 16:13:26 +0100 swh-indexer (0.0.10-1~swh1) unstable-swh; urgency=medium * v0.0.10 * Add ctags indexer -- Antoine R. Dumont (@ardumont) Thu, 20 Oct 2016 16:12:42 +0200 swh-indexer (0.0.9-1~swh1) unstable-swh; urgency=medium * v0.0.9 * d/control: Bump dependency to latest python3-swh.storage api * mimetype: Use the charset to filter out data * orchestrator: Separate 2 distincts orchestrators (one for all * contents, one for text contents) * mimetype: once index computed, send text contents to text orchestrator -- Antoine R. Dumont (@ardumont) Thu, 13 Oct 2016 15:28:17 +0200 swh-indexer (0.0.8-1~swh1) unstable-swh; urgency=medium * v0.0.8 * Separate configuration file per indexer (no need for language) * Rename module file_properties to mimetype consistently with other * layers -- Antoine R. Dumont (@ardumont) Sat, 08 Oct 2016 11:46:29 +0200 swh-indexer (0.0.7-1~swh1) unstable-swh; urgency=medium * v0.0.7 * Adapt indexer language and mimetype to store result in storage. * Clean up obsolete code -- Antoine R. Dumont (@ardumont) Sat, 08 Oct 2016 10:26:08 +0200 swh-indexer (0.0.6-1~swh1) unstable-swh; urgency=medium * v0.0.6 * Fix multiple issues on production -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 17:00:11 +0200 swh-indexer (0.0.5-1~swh1) unstable-swh; urgency=medium * v0.0.5 * Fix debian/control dependency issue -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 16:06:20 +0200 swh-indexer (0.0.4-1~swh1) unstable-swh; urgency=medium * v0.0.4 * Upgrade dependencies issues -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 16:01:52 +0200 swh-indexer (0.0.3-1~swh1) unstable-swh; urgency=medium * v0.0.3 * Add encoding detection * Use encoding to improve language detection * bypass language detection for binary files * bypass ctags for binary files or decoding failure file -- Antoine R. Dumont (@ardumont) Fri, 30 Sep 2016 12:30:11 +0200 swh-indexer (0.0.2-1~swh1) unstable-swh; urgency=medium * v0.0.2 * Provide one possible sha1's name for the multiple tools to ease * information extrapolation * Fix debian package dependency issue -- Antoine R. Dumont (@ardumont) Thu, 29 Sep 2016 21:45:44 +0200 swh-indexer (0.0.1-1~swh1) unstable-swh; urgency=medium * Initial release * v0.0.1 * First implementation on poc -- Antoine R. Dumont (@ardumont) Wed, 28 Sep 2016 23:40:13 +0200 diff --git a/debian/control b/debian/control index 7ccda20..3e7ea07 100644 --- a/debian/control +++ b/debian/control @@ -1,50 +1,52 @@ Source: swh-indexer Maintainer: Software Heritage developers Section: python Priority: optional Build-Depends: debhelper (>= 9), dh-python (>= 2), + postgresql-contrib, python3-all, python3-chardet (>= 2.3.0~), python3-click, python3-hypothesis (>= 3.11.0~), python3-pytest, + python3-pytest-postgresql, python3-pygments, python3-magic, python3-pyld, python3-setuptools, - python3-swh.core (>= 0.0.44~), + python3-swh.core (>= 0.0.53~), python3-swh.model (>= 0.0.15~), python3-swh.objstorage (>= 0.0.28~), python3-swh.scheduler (>= 0.0.35~), - python3-swh.storage (>= 0.0.113~), + python3-swh.storage (>= 0.0.123~), python3-vcversioner, - python3-xmltodict + python3-xmltodict, Standards-Version: 3.9.6 Homepage: https://forge.softwareheritage.org/diffusion/78/ Package: python3-swh.indexer.storage Architecture: all -Depends: python3-swh.core (>= 0.0.44~), +Depends: python3-swh.core (>= 0.0.53~), python3-swh.model (>= 0.0.15~), python3-swh.objstorage (>= 0.0.28~), python3-swh.scheduler (>= 0.0.35~), - python3-swh.storage (>= 0.0.113~), + python3-swh.storage (>= 0.0.123~), ${misc:Depends}, ${python3:Depends} Description: Software Heritage Content Indexer Storage Package: python3-swh.indexer Architecture: all Recommends: universal-ctags (>= 0.8~), fossology-nomossa (>= 3.1~) Depends: python3-swh.scheduler (>= 0.0.14~), - python3-swh.core (>= 0.0.44~), + python3-swh.core (>= 0.0.53~), python3-swh.model (>= 0.0.15~), python3-swh.objstorage (>= 0.0.28~), python3-swh.scheduler (>= 0.0.35~), - python3-swh.storage (>= 0.0.113~), + python3-swh.storage (>= 0.0.123~), python3-swh.indexer.storage (= ${binary:Version}), ${misc:Depends}, ${python3:Depends} Description: Software Heritage Content Indexer diff --git a/requirements-swh.txt b/requirements-swh.txt index e390a08..be61102 100644 --- a/requirements-swh.txt +++ b/requirements-swh.txt @@ -1,6 +1,6 @@ swh.core >= 0.0.53 swh.model >= 0.0.15 swh.objstorage >= 0.0.28 -swh.scheduler >= 0.0.39 +swh.scheduler >= 0.0.47 swh.storage >= 0.0.123 swh.journal >= 0.0.6 diff --git a/swh.indexer.egg-info/PKG-INFO b/swh.indexer.egg-info/PKG-INFO index a2cb919..20eaf96 100644 --- a/swh.indexer.egg-info/PKG-INFO +++ b/swh.indexer.egg-info/PKG-INFO @@ -1,69 +1,69 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.137 +Version: 0.0.139 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN -Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): compute tags information - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh.indexer.egg-info/SOURCES.txt b/swh.indexer.egg-info/SOURCES.txt index d7ce799..3974b12 100644 --- a/swh.indexer.egg-info/SOURCES.txt +++ b/swh.indexer.egg-info/SOURCES.txt @@ -1,83 +1,84 @@ MANIFEST.in Makefile README.md requirements-swh.txt requirements.txt setup.py version.txt sql/bin/db-upgrade sql/bin/dot_add_content sql/doc/json/.gitignore sql/doc/json/Makefile sql/doc/json/indexer_configuration.tool_configuration.schema.json sql/doc/json/revision_metadata.translated_metadata.json sql/json/.gitignore sql/json/Makefile sql/json/indexer_configuration.tool_configuration.schema.json sql/json/revision_metadata.translated_metadata.json sql/upgrades/115.sql sql/upgrades/116.sql sql/upgrades/117.sql sql/upgrades/118.sql sql/upgrades/119.sql sql/upgrades/120.sql sql/upgrades/121.sql sql/upgrades/122.sql swh/__init__.py swh.indexer.egg-info/PKG-INFO swh.indexer.egg-info/SOURCES.txt swh.indexer.egg-info/dependency_links.txt swh.indexer.egg-info/entry_points.txt swh.indexer.egg-info/requires.txt swh.indexer.egg-info/top_level.txt swh/indexer/__init__.py swh/indexer/cli.py swh/indexer/codemeta.py swh/indexer/ctags.py swh/indexer/fossology_license.py swh/indexer/indexer.py swh/indexer/journal_client.py swh/indexer/language.py swh/indexer/metadata.py swh/indexer/metadata_detector.py swh/indexer/metadata_dictionary.py swh/indexer/mimetype.py swh/indexer/origin_head.py swh/indexer/rehash.py swh/indexer/tasks.py swh/indexer/data/codemeta/CITATION swh/indexer/data/codemeta/LICENSE swh/indexer/data/codemeta/codemeta.jsonld swh/indexer/data/codemeta/crosswalk.csv swh/indexer/sql/10-swh-init.sql swh/indexer/sql/20-swh-enums.sql swh/indexer/sql/30-swh-schema.sql swh/indexer/sql/40-swh-func.sql swh/indexer/sql/50-swh-data.sql swh/indexer/sql/60-swh-indexes.sql swh/indexer/storage/__init__.py swh/indexer/storage/converters.py swh/indexer/storage/db.py swh/indexer/storage/in_memory.py swh/indexer/storage/api/__init__.py swh/indexer/storage/api/client.py swh/indexer/storage/api/server.py swh/indexer/storage/api/wsgi.py swh/indexer/tests/__init__.py swh/indexer/tests/conftest.py swh/indexer/tests/tasks.py +swh/indexer/tests/test_cli.py swh/indexer/tests/test_ctags.py swh/indexer/tests/test_fossology_license.py swh/indexer/tests/test_language.py swh/indexer/tests/test_metadata.py swh/indexer/tests/test_mimetype.py swh/indexer/tests/test_origin_head.py swh/indexer/tests/test_origin_metadata.py swh/indexer/tests/utils.py swh/indexer/tests/storage/__init__.py swh/indexer/tests/storage/generate_data_test.py swh/indexer/tests/storage/test_api_client.py swh/indexer/tests/storage/test_converters.py swh/indexer/tests/storage/test_in_memory.py swh/indexer/tests/storage/test_server.py swh/indexer/tests/storage/test_storage.py \ No newline at end of file diff --git a/swh.indexer.egg-info/requires.txt b/swh.indexer.egg-info/requires.txt index 04368bc..9d81572 100644 --- a/swh.indexer.egg-info/requires.txt +++ b/swh.indexer.egg-info/requires.txt @@ -1,18 +1,18 @@ vcversioner pygments click chardet file_magic pyld xmltodict swh.core>=0.0.53 swh.model>=0.0.15 swh.objstorage>=0.0.28 -swh.scheduler>=0.0.39 +swh.scheduler>=0.0.47 swh.storage>=0.0.123 swh.journal>=0.0.6 [testing] pytest<4 pytest-postgresql hypothesis>=3.11.0 diff --git a/swh/indexer/cli.py b/swh/indexer/cli.py index 78b0737..cb82793 100644 --- a/swh/indexer/cli.py +++ b/swh/indexer/cli.py @@ -1,25 +1,186 @@ -# Copyright (C) 2015-2019 The Software Heritage developers +# Copyright (C) 2019 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import click +from swh.core import config +from swh.scheduler import get_scheduler +from swh.scheduler.utils import create_task_dict +from swh.storage import get_storage + +from swh.indexer.metadata_dictionary import MAPPINGS +from swh.indexer.storage import get_indexer_storage from swh.indexer.storage.api.server import load_and_check_config, app -@click.command() +CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help']) + +TASK_BATCH_SIZE = 1000 # Number of tasks per query to the scheduler + + +@click.group(context_settings=CONTEXT_SETTINGS) +@click.option('--config-file', '-C', default=None, + type=click.Path(exists=True, dir_okay=False,), + help="Configuration file.") +@click.pass_context +def cli(ctx, config_file): + """Software Heritage Indexer CLI interface + """ + ctx.ensure_object(dict) + + conf = config.read(config_file) + ctx.obj['config'] = conf + + +def _get_api(getter, config, config_key, url): + if url: + config[config_key] = { + 'cls': 'remote', + 'args': {'url': url} + } + elif config_key not in config: + raise click.ClickException( + 'Missing configuration for {}'.format(config_key)) + return getter(**config[config_key]) + + +@cli.group('mapping') +def mapping(): + pass + + +@mapping.command('list') +def mapping_list(): + """Prints the list of known mappings.""" + mapping_names = [mapping.name for mapping in MAPPINGS.values()] + mapping_names.sort() + for mapping_name in mapping_names: + click.echo(mapping_name) + + +@cli.group('schedule') +@click.option('--scheduler-url', '-s', default=None, + help="URL of the scheduler API") +@click.option('--indexer-storage-url', '-i', default=None, + help="URL of the indexer storage API") +@click.option('--storage-url', '-g', default=None, + help="URL of the (graph) storage API") +@click.option('--dry-run/--no-dry-run', is_flag=True, + default=False, + help='Default to list only what would be scheduled.') +@click.pass_context +def schedule(ctx, scheduler_url, storage_url, indexer_storage_url, + dry_run): + """Manipulate indexer tasks via SWH Scheduler's API.""" + ctx.obj['indexer_storage'] = _get_api( + get_indexer_storage, + ctx.obj['config'], + 'indexer_storage', + indexer_storage_url + ) + ctx.obj['storage'] = _get_api( + get_storage, + ctx.obj['config'], + 'storage', + storage_url + ) + ctx.obj['scheduler'] = _get_api( + get_scheduler, + ctx.obj['config'], + 'scheduler', + scheduler_url + ) + if dry_run: + ctx.obj['scheduler'] = None + + +def list_origins_by_producer(idx_storage, mappings, tool_ids): + start = 0 + limit = 10000 + while True: + origins = list( + idx_storage.origin_intrinsic_metadata_search_by_producer( + start=start, limit=limit, ids_only=True, + mappings=mappings or None, tool_ids=tool_ids or None)) + if not origins: + break + start = origins[-1]+1 + yield from origins + + +@schedule.command('reindex_origin_metadata') +@click.option('--batch-size', '-b', 'origin_batch_size', + default=10, show_default=True, type=int, + help="Number of origins per task") +@click.option('--tool-id', '-t', 'tool_ids', type=int, multiple=True, + help="Restrict search of old metadata to this/these tool ids.") +@click.option('--mapping', '-m', 'mappings', multiple=True, + help="Mapping(s) that should be re-scheduled (eg. 'npm', " + "'gemspec', 'maven')") +@click.option('--task-type', + default='indexer_origin_metadata', show_default=True, + help="Name of the task type to schedule.") +@click.pass_context +def schedule_origin_metadata_reindex( + ctx, origin_batch_size, mappings, tool_ids, task_type): + """Schedules indexing tasks for origins that were already indexed.""" + idx_storage = ctx.obj['indexer_storage'] + scheduler = ctx.obj['scheduler'] + + origins = list_origins_by_producer(idx_storage, mappings, tool_ids) + kwargs = {"policy_update": "update-dups", "parse_ids": False} + nb_origins = 0 + nb_tasks = 0 + + while True: + task_batch = [] + for _ in range(TASK_BATCH_SIZE): + # Group origins + origin_batch = [] + for (_, origin) in zip(range(origin_batch_size), origins): + origin_batch.append(origin) + nb_origins += len(origin_batch) + if not origin_batch: + break + + # Create a task for these origins + args = [origin_batch] + task_dict = create_task_dict(task_type, 'oneshot', *args, **kwargs) + task_batch.append(task_dict) + + # Schedule a batch of tasks + if not task_batch: + break + nb_tasks += len(task_batch) + if scheduler: + scheduler.create_tasks(task_batch) + click.echo('Scheduled %d tasks (%d origins).' % (nb_tasks, nb_origins)) + + # Print final status. + if nb_tasks: + click.echo('Done.') + else: + click.echo('Nothing to do (no origin metadata matched the criteria).') + + +@cli.command('api-server') @click.argument('config-path', required=1) @click.option('--host', default='0.0.0.0', help="Host to run the server") @click.option('--port', default=5007, type=click.INT, help="Binding port of the server") @click.option('--debug/--nodebug', default=True, help="Indicates if the server should run in debug mode") -def main(config_path, host, port, debug): +def api_server(config_path, host, port, debug): api_cfg = load_and_check_config(config_path, type='any') app.config.update(api_cfg) app.run(host, port=int(port), debug=bool(debug)) +def main(): + return cli(auto_envvar_prefix='SWH_INDEXER') + + if __name__ == '__main__': main() diff --git a/swh/indexer/metadata.py b/swh/indexer/metadata.py index 672afa0..83f5a93 100644 --- a/swh/indexer/metadata.py +++ b/swh/indexer/metadata.py @@ -1,310 +1,310 @@ # Copyright (C) 2017-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from copy import deepcopy from swh.indexer.indexer import ContentIndexer, RevisionIndexer, OriginIndexer from swh.indexer.origin_head import OriginHeadIndexer from swh.indexer.metadata_dictionary import MAPPINGS from swh.indexer.metadata_detector import detect_metadata from swh.indexer.metadata_detector import extract_minimal_metadata_dict from swh.indexer.storage import INDEXER_CFG_KEY from swh.model import hashutil class ContentMetadataIndexer(ContentIndexer): """Content-level indexer This indexer is in charge of: - filtering out content already indexed in content_metadata - reading content from objstorage with the content's id sha1 - computing translated_metadata by given context - using the metadata_dictionary as the 'swh-metadata-translator' tool - store result in content_metadata table """ def filter(self, ids): """Filter out known sha1s and return only missing ones. """ yield from self.idx_storage.content_metadata_missing(( { 'id': sha1, 'indexer_configuration_id': self.tool['id'], } for sha1 in ids )) def index(self, id, data, log_suffix='unknown revision'): """Index sha1s' content and store result. Args: id (bytes): content's identifier data (bytes): raw content in bytes Returns: dict: dictionary representing a content_metadata. If the translation wasn't successful the translated_metadata keys will be returned as None """ result = { 'id': id, 'indexer_configuration_id': self.tool['id'], 'translated_metadata': None } try: mapping_name = self.tool['tool_configuration']['context'] log_suffix += ', content_id=%s' % hashutil.hash_to_hex(id) result['translated_metadata'] = \ MAPPINGS[mapping_name](log_suffix).translate(data) except Exception: self.log.exception( "Problem during metadata translation " "for content %s" % hashutil.hash_to_hex(id)) if result['translated_metadata'] is None: return None return result def persist_index_computations(self, results, policy_update): """Persist the results in storage. Args: results ([dict]): list of content_metadata, dict with the following keys: - id (bytes): content's identifier (sha1) - translated_metadata (jsonb): detected metadata policy_update ([str]): either 'update-dups' or 'ignore-dups' to respectively update duplicates or ignore them """ self.idx_storage.content_metadata_add( results, conflict_update=(policy_update == 'update-dups')) class RevisionMetadataIndexer(RevisionIndexer): """Revision-level indexer This indexer is in charge of: - filtering revisions already indexed in revision_metadata table with defined computation tool - retrieve all entry_files in root directory - use metadata_detector for file_names containing metadata - compute metadata translation if necessary and possible (depends on tool) - send sha1s to content indexing if possible - store the results for revision """ ADDITIONAL_CONFIG = { 'tools': ('dict', { 'name': 'swh-metadata-detector', 'version': '0.0.2', 'configuration': { 'type': 'local', 'context': list(MAPPINGS), }, }), } def filter(self, sha1_gits): """Filter out known sha1s and return only missing ones. """ yield from self.idx_storage.revision_metadata_missing(( { 'id': sha1_git, 'indexer_configuration_id': self.tool['id'], } for sha1_git in sha1_gits )) def index(self, rev): """Index rev by processing it and organizing result. use metadata_detector to iterate on filenames - if one filename detected -> sends file to content indexer - if multiple file detected -> translation needed at revision level Args: rev (dict): revision artifact from storage Returns: dict: dictionary representing a revision_metadata, with keys: - id (str): rev's identifier (sha1_git) - indexer_configuration_id (bytes): tool used - translated_metadata: dict of retrieved metadata """ result = { 'id': rev['id'], 'indexer_configuration_id': self.tool['id'], 'mappings': None, 'translated_metadata': None } try: root_dir = rev['directory'] dir_ls = self.storage.directory_ls(root_dir, recursive=False) files = [entry for entry in dir_ls if entry['type'] == 'file'] detected_files = detect_metadata(files) (mappings, metadata) = self.translate_revision_metadata( detected_files, log_suffix='revision=%s' % hashutil.hash_to_hex(rev['id'])) result['mappings'] = mappings result['translated_metadata'] = metadata except Exception as e: self.log.exception( 'Problem when indexing rev: %r', e) return result def persist_index_computations(self, results, policy_update): """Persist the results in storage. Args: results ([dict]): list of content_mimetype, dict with the following keys: - id (bytes): content's identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes policy_update ([str]): either 'update-dups' or 'ignore-dups' to respectively update duplicates or ignore them """ # TODO: add functions in storage to keep data in revision_metadata self.idx_storage.revision_metadata_add( results, conflict_update=(policy_update == 'update-dups')) def translate_revision_metadata(self, detected_files, log_suffix): """ Determine plan of action to translate metadata when containing one or multiple detected files: Args: detected_files (dict): dictionary mapping context names (e.g., "npm", "authors") to list of sha1 Returns: (List[str], dict): list of mappings used and dict with translated metadata according to the CodeMeta vocabulary """ used_mappings = [MAPPINGS[context].name for context in detected_files] translated_metadata = [] tool = { 'name': 'swh-metadata-translator', 'version': '0.0.2', 'configuration': { 'type': 'local', 'context': None }, } # TODO: iterate on each context, on each file # -> get raw_contents # -> translate each content config = { k: self.config[k] for k in [INDEXER_CFG_KEY, 'objstorage', 'storage'] } config['tools'] = [tool] for context in detected_files.keys(): cfg = deepcopy(config) cfg['tools'][0]['configuration']['context'] = context c_metadata_indexer = ContentMetadataIndexer(config=cfg) # sha1s that are in content_metadata table sha1s_in_storage = [] metadata_generator = self.idx_storage.content_metadata_get( detected_files[context]) for c in metadata_generator: # extracting translated_metadata sha1 = c['id'] sha1s_in_storage.append(sha1) local_metadata = c['translated_metadata'] # local metadata is aggregated if local_metadata: translated_metadata.append(local_metadata) sha1s_filtered = [item for item in detected_files[context] if item not in sha1s_in_storage] if sha1s_filtered: # content indexing try: c_metadata_indexer.run(sha1s_filtered, policy_update='ignore-dups', log_suffix=log_suffix) # on the fly possibility: for result in c_metadata_indexer.results: local_metadata = result['translated_metadata'] translated_metadata.append(local_metadata) except Exception: self.log.exception( "Exception while indexing metadata on contents") # transform translated_metadata into min set with swh-metadata-detector min_metadata = extract_minimal_metadata_dict(translated_metadata) return (used_mappings, min_metadata) class OriginMetadataIndexer(OriginIndexer): ADDITIONAL_CONFIG = RevisionMetadataIndexer.ADDITIONAL_CONFIG USE_TOOLS = False - def __init__(self, config, **kwargs): + def __init__(self, config=None, **kwargs): super().__init__(config=config, **kwargs) self.origin_head_indexer = OriginHeadIndexer(config=config) self.revision_metadata_indexer = RevisionMetadataIndexer(config=config) def index_list(self, origins): head_rev_ids = [] origins_with_head = [] for origin in origins: head_result = self.origin_head_indexer.index(origin) if head_result: origins_with_head.append(origin) head_rev_ids.append(head_result['revision_id']) head_revs = list(self.storage.revision_get(head_rev_ids)) assert len(head_revs) == len(head_rev_ids) results = [] for (origin, rev) in zip(origins_with_head, head_revs): if not rev: self.log.warning('Missing head revision of origin %r', origin) continue rev_metadata = self.revision_metadata_indexer.index(rev) orig_metadata = { 'from_revision': rev_metadata['id'], 'origin_id': origin['id'], 'metadata': rev_metadata['translated_metadata'], 'mappings': rev_metadata['mappings'], 'indexer_configuration_id': rev_metadata['indexer_configuration_id'], } results.append((orig_metadata, rev_metadata)) return results def persist_index_computations(self, results, policy_update): conflict_update = (policy_update == 'update-dups') # Deduplicate revisions rev_metadata = [] orig_metadata = [] for (orig_item, rev_item) in results: if rev_item not in rev_metadata: rev_metadata.append(rev_item) if orig_item not in orig_metadata: orig_metadata.append(orig_item) self.idx_storage.revision_metadata_add( rev_metadata, conflict_update=conflict_update) self.idx_storage.origin_intrinsic_metadata_add( orig_metadata, conflict_update=conflict_update) diff --git a/swh/indexer/tasks.py b/swh/indexer/tasks.py index f123500..97f921c 100644 --- a/swh/indexer/tasks.py +++ b/swh/indexer/tasks.py @@ -1,79 +1,64 @@ # Copyright (C) 2016-2019 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from celery import current_app as app from .mimetype import MimetypeIndexer, MimetypeRangeIndexer from .language import LanguageIndexer from .ctags import CtagsIndexer from .fossology_license import ( FossologyLicenseIndexer, FossologyLicenseRangeIndexer ) from .rehash import RecomputeChecksums -from .metadata import ( - RevisionMetadataIndexer, OriginMetadataIndexer -) -from .origin_head import OriginHeadIndexer - - -@app.task(name=__name__ + '.RevisionMetadata') -def revision_metadata(*args, **kwargs): - results = RevisionMetadataIndexer().run(*args, **kwargs) - return getattr(results, 'results', results) +from .metadata import OriginMetadataIndexer @app.task(name=__name__ + '.OriginMetadata') def origin_metadata(*args, **kwargs): results = OriginMetadataIndexer().run(*args, **kwargs) return getattr(results, 'results', results) -@app.task(name=__name__ + '.OriginHead') -def origin_head(*args, **kwargs): - results = OriginHeadIndexer().run(*args, **kwargs) - return getattr(results, 'results', results) - - @app.task(name=__name__ + '.ContentLanguage') def content_language(*args, **kwargs): results = LanguageIndexer().run(*args, **kwargs) return getattr(results, 'results', results) @app.task(name=__name__ + '.Ctags') def ctags(*args, **kwargs): results = CtagsIndexer().run(*args, **kwargs) return getattr(results, 'results', results) @app.task(name=__name__ + '.ContentFossologyLicense') def fossology_license(*args, **kwargs): results = FossologyLicenseIndexer().run(*args, **kwargs) return getattr(results, 'results', results) @app.task(name=__name__ + '.RecomputeChecksums') def recompute_checksums(*args, **kwargs): results = RecomputeChecksums().run(*args, **kwargs) return getattr(results, 'results', results) @app.task(name=__name__ + '.ContentMimetype') def mimetype(*args, **kwargs): results = MimetypeIndexer().run(*args, **kwargs) return {'status': 'eventful' if results else 'uneventful'} @app.task(name=__name__ + '.ContentRangeMimetype') def range_mimetype(*args, **kwargs): results = MimetypeRangeIndexer().run(*args, **kwargs) return {'status': 'eventful' if results else 'uneventful'} @app.task(name=__name__ + '.ContentRangeFossologyLicense') def range_license(*args, **kwargs): results = FossologyLicenseRangeIndexer().run(*args, **kwargs) return {'status': 'eventful' if results else 'uneventful'} diff --git a/swh/indexer/tests/test_cli.py b/swh/indexer/tests/test_cli.py new file mode 100644 index 0000000..d14b186 --- /dev/null +++ b/swh/indexer/tests/test_cli.py @@ -0,0 +1,289 @@ +# Copyright (C) 2019 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from functools import reduce +import tempfile +from unittest.mock import patch + +from click.testing import CliRunner + +from swh.model.hashutil import hash_to_bytes + +from swh.indexer.cli import cli + + +CLI_CONFIG = ''' +scheduler: + cls: foo + args: {} +storage: + cls: memory + args: {} +indexer_storage: + cls: memory + args: {} +''' + + +def fill_idx_storage(idx_storage, nb_rows): + tools = [ + { + 'tool_name': 'tool %d' % i, + 'tool_version': '0.0.1', + 'tool_configuration': {}, + } + for i in range(2) + ] + tools = idx_storage.indexer_configuration_add(tools) + + origin_metadata = [ + { + 'origin_id': origin_id, + 'from_revision': hash_to_bytes('abcd{:0>4}'.format(origin_id)), + 'indexer_configuration_id': tools[origin_id % 2]['id'], + 'metadata': {'name': 'origin %d' % origin_id}, + 'mappings': ['mapping%d' % (origin_id % 10)] + } + for origin_id in range(nb_rows) + ] + revision_metadata = [ + { + 'id': hash_to_bytes('abcd{:0>4}'.format(origin_id)), + 'indexer_configuration_id': tools[origin_id % 2]['id'], + 'metadata': {'name': 'origin %d' % origin_id}, + 'mappings': ['mapping%d' % (origin_id % 10)] + } + for origin_id in range(nb_rows) + ] + + idx_storage.revision_metadata_add(revision_metadata) + idx_storage.origin_intrinsic_metadata_add(origin_metadata) + + return [tool['id'] for tool in tools] + + +def _origins_in_task_args(tasks): + """Returns the set of origins contained in the arguments of the + provided tasks (assumed to be of type indexer_origin_metadata).""" + return reduce( + set.union, + (set(task['arguments']['args'][0]) for task in tasks), + set() + ) + + +def _assert_tasks_for_origins(tasks, origins): + expected_kwargs = {"policy_update": "update-dups", "parse_ids": False} + assert {task['type'] for task in tasks} == {'indexer_origin_metadata'} + assert all(len(task['arguments']['args']) == 1 for task in tasks) + assert all(task['arguments']['kwargs'] == expected_kwargs + for task in tasks) + assert _origins_in_task_args(tasks) == set(origins) + + +def invoke(scheduler, catch_exceptions, args): + runner = CliRunner() + with patch('swh.indexer.cli.get_scheduler') as get_scheduler_mock, \ + tempfile.NamedTemporaryFile('a', suffix='.yml') as config_fd: + config_fd.write(CLI_CONFIG) + config_fd.seek(0) + get_scheduler_mock.return_value = scheduler + result = runner.invoke(cli, ['-C' + config_fd.name] + args) + if not catch_exceptions and result.exception: + print(result.output) + raise result.exception + return result + + +def test_mapping_list(indexer_scheduler): + result = invoke(indexer_scheduler, False, [ + 'mapping', 'list', + ]) + expected_output = '\n'.join([ + 'codemeta', 'gemspec', 'maven', 'npm', 'pkg-info', '', + ]) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_empty_db( + indexer_scheduler, idx_storage, storage): + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + ]) + expected_output = ( + 'Nothing to do (no origin metadata matched the criteria).\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 0 + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_divisor( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when origin_batch_size*task_batch_size is a + divisor of nb_origins.""" + fill_idx_storage(idx_storage, 90) + + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + ]) + + # Check the output + expected_output = ( + 'Scheduled 3 tasks (30 origins).\n' + 'Scheduled 6 tasks (60 origins).\n' + 'Scheduled 9 tasks (90 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 9 + _assert_tasks_for_origins(tasks, range(90)) + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_dry_run( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when origin_batch_size*task_batch_size is a + divisor of nb_origins.""" + fill_idx_storage(idx_storage, 90) + + result = invoke(indexer_scheduler, False, [ + 'schedule', '--dry-run', 'reindex_origin_metadata', + ]) + + # Check the output + expected_output = ( + 'Scheduled 3 tasks (30 origins).\n' + 'Scheduled 6 tasks (60 origins).\n' + 'Scheduled 9 tasks (90 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 0 + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_nondivisor( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when neither origin_batch_size or + task_batch_size is a divisor of nb_origins.""" + fill_idx_storage(idx_storage, 70) + + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + '--batch-size', '20', + ]) + + # Check the output + expected_output = ( + 'Scheduled 3 tasks (60 origins).\n' + 'Scheduled 4 tasks (70 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 4 + _assert_tasks_for_origins(tasks, range(70)) + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_filter_one_mapping( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when origin_batch_size*task_batch_size is a + divisor of nb_origins.""" + fill_idx_storage(idx_storage, 110) + + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + '--mapping', 'mapping1', + ]) + + # Check the output + expected_output = ( + 'Scheduled 2 tasks (11 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 2 + _assert_tasks_for_origins( + tasks, + [1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101]) + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_filter_two_mappings( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when origin_batch_size*task_batch_size is a + divisor of nb_origins.""" + fill_idx_storage(idx_storage, 110) + + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + '--mapping', 'mapping1', '--mapping', 'mapping2', + ]) + + # Check the output + expected_output = ( + 'Scheduled 3 tasks (22 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 3 + _assert_tasks_for_origins( + tasks, + [1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, + 2, 12, 22, 32, 42, 52, 62, 72, 82, 92, 102]) + + +@patch('swh.indexer.cli.TASK_BATCH_SIZE', 3) +def test_origin_metadata_reindex_filter_one_tool( + indexer_scheduler, idx_storage, storage): + """Tests the re-indexing when origin_batch_size*task_batch_size is a + divisor of nb_origins.""" + tool_ids = fill_idx_storage(idx_storage, 110) + + result = invoke(indexer_scheduler, False, [ + 'schedule', 'reindex_origin_metadata', + '--tool-id', str(tool_ids[0]), + ]) + + # Check the output + expected_output = ( + 'Scheduled 3 tasks (30 origins).\n' + 'Scheduled 6 tasks (55 origins).\n' + 'Done.\n' + ) + assert result.exit_code == 0, result.output + assert result.output == expected_output + + # Check scheduled tasks + tasks = indexer_scheduler.search_tasks() + assert len(tasks) == 6 + _assert_tasks_for_origins( + tasks, + [x*2 for x in range(55)]) diff --git a/version.txt b/version.txt index b302a09..791e0c6 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.137-0-g8de1d6a \ No newline at end of file +v0.0.139-0-g8e12a00 \ No newline at end of file