diff --git a/PKG-INFO b/PKG-INFO index 53d9ce3..e8fd3b6 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,114 +1,114 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.53 +Version: 0.0.54 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata ## Context SWH has currently stored around 5B contents. The table `content` holds their checksums. Those contents are physically stored in an object storage (using disks) and replicated in another. Those object storages are not destined for reading yet. We are in the process to copy those contents over to azure's blob storages. As such, we will use that opportunity to trigger the computations on these contents once those have been copied over. ## Workers There are two types of workers: - orchestrators (orchestrator, orchestrator-text) - indexer (mimetype, language, ctags, fossology-license) ### Orchestrator The orchestrator is in charge of dispatching a batch of sha1 hashes to different indexers. Orchestration procedure: - receive batch of sha1s - split those batches into groups (according to setup) - broadcast those group to indexers There are two types of orchestrators: - orchestrator (swh_indexer_orchestrator_content_all): Receives and broadcast sha1 ids (of contents) to indexers (currently only the mimetype indexer) - orchestrator-text (swh_indexer_orchestrator_content_text): Receives batch of sha1 ids (of textual contents) and broadcast those to indexers (currently language, ctags, and fossology-license indexers). ### Indexers An indexer is in charge of the content retrieval and indexation of the extracted information in the swh-indexer db. There are two types of indexers: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage - (and possibly do some broadcast itself) Current content indexers: - mimetype (queue swh_indexer_content_mimetype): compute the mimetype, filter out the textual contents and broadcast the list to the orchestrator-text - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): try and compute tags information - fossology-license (queue swh_indexer_fossology_license): try and compute the license - metadata : translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh.indexer.egg-info/PKG-INFO b/swh.indexer.egg-info/PKG-INFO index 53d9ce3..e8fd3b6 100644 --- a/swh.indexer.egg-info/PKG-INFO +++ b/swh.indexer.egg-info/PKG-INFO @@ -1,114 +1,114 @@ Metadata-Version: 2.1 Name: swh.indexer -Version: 0.0.53 +Version: 0.0.54 Summary: Software Heritage Content Indexer Home-page: https://forge.softwareheritage.org/diffusion/78/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer Description: swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - ctags - language - fossology-license - metadata - revision: - metadata ## Context SWH has currently stored around 5B contents. The table `content` holds their checksums. Those contents are physically stored in an object storage (using disks) and replicated in another. Those object storages are not destined for reading yet. We are in the process to copy those contents over to azure's blob storages. As such, we will use that opportunity to trigger the computations on these contents once those have been copied over. ## Workers There are two types of workers: - orchestrators (orchestrator, orchestrator-text) - indexer (mimetype, language, ctags, fossology-license) ### Orchestrator The orchestrator is in charge of dispatching a batch of sha1 hashes to different indexers. Orchestration procedure: - receive batch of sha1s - split those batches into groups (according to setup) - broadcast those group to indexers There are two types of orchestrators: - orchestrator (swh_indexer_orchestrator_content_all): Receives and broadcast sha1 ids (of contents) to indexers (currently only the mimetype indexer) - orchestrator-text (swh_indexer_orchestrator_content_text): Receives batch of sha1 ids (of textual contents) and broadcast those to indexers (currently language, ctags, and fossology-license indexers). ### Indexers An indexer is in charge of the content retrieval and indexation of the extracted information in the swh-indexer db. There are two types of indexers: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage - (and possibly do some broadcast itself) Current content indexers: - mimetype (queue swh_indexer_content_mimetype): compute the mimetype, filter out the textual contents and broadcast the list to the orchestrator-text - language (queue swh_indexer_content_language): detect the programming language - ctags (queue swh_indexer_content_ctags): try and compute tags information - fossology-license (queue swh_indexer_fossology_license): try and compute the license - metadata : translate file into translated_metadata dict Current revision indexers: - metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files. Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: testing diff --git a/swh/indexer/tasks.py b/swh/indexer/tasks.py index 1be3bed..fb136a3 100644 --- a/swh/indexer/tasks.py +++ b/swh/indexer/tasks.py @@ -1,105 +1,105 @@ # Copyright (C) 2016-2017 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging -import celery +from swh.scheduler.task import Task as SchedulerTask from .orchestrator import OrchestratorAllContentsIndexer from .orchestrator import OrchestratorTextContentsIndexer from .mimetype import ContentMimetypeIndexer from .language import ContentLanguageIndexer from .ctags import CtagsIndexer from .fossology_license import ContentFossologyLicenseIndexer from .rehash import RecomputeChecksums from .metadata import RevisionMetadataIndexer, OriginMetadataIndexer logging.basicConfig(level=logging.INFO) -class Task(celery.Task): +class Task(SchedulerTask): def run_task(self, *args, **kwargs): indexer = self.Indexer().run(*args, **kwargs) return indexer.results class OrchestratorAllContents(Task): """Main task in charge of reading batch contents (of any type) and broadcasting them back to other tasks. """ task_queue = 'swh_indexer_orchestrator_content_all' Indexer = OrchestratorAllContentsIndexer class OrchestratorTextContents(Task): """Main task in charge of reading batch contents (of type text) and broadcasting them back to other tasks. """ task_queue = 'swh_indexer_orchestrator_content_text' Indexer = OrchestratorTextContentsIndexer class RevisionMetadata(Task): task_queue = 'swh_indexer_revision_metadata' serializer = 'msgpack' Indexer = RevisionMetadataIndexer class OriginMetadata(Task): task_queue = 'swh_indexer_origin_intrinsic_metadata' Indexer = OriginMetadataIndexer class ContentMimetype(Task): """Task which computes the mimetype, encoding from the sha1's content. """ task_queue = 'swh_indexer_content_mimetype' Indexer = ContentMimetypeIndexer class ContentLanguage(Task): """Task which computes the language from the sha1's content. """ task_queue = 'swh_indexer_content_language' def run_task(self, *args, **kwargs): ContentLanguageIndexer().run(*args, **kwargs) class Ctags(Task): """Task which computes ctags from the sha1's content. """ task_queue = 'swh_indexer_content_ctags' Indexer = CtagsIndexer class ContentFossologyLicense(Task): """Task which computes licenses from the sha1's content. """ task_queue = 'swh_indexer_content_fossology_license' Indexer = ContentFossologyLicenseIndexer class RecomputeChecksums(Task): """Task which recomputes hashes and possibly new ones. """ task_queue = 'swh_indexer_content_rehash' Indexer = RecomputeChecksums diff --git a/version.txt b/version.txt index bee75cf..3003c66 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.53-0-gad017c8 \ No newline at end of file +v0.0.54-0-g4b44f3a \ No newline at end of file