diff --git a/PKG-INFO b/PKG-INFO index 1887cc9f..e87727ab 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,153 +1,153 @@ Metadata-Version: 2.1 Name: swh.storage -Version: 0.0.116 +Version: 0.0.117 Summary: Software Heritage storage manager Home-page: https://forge.softwareheritage.org/diffusion/DSTO/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-storage Description: swh-storage =========== Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata. See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. Tests ----- Python tests for this module include tests that cannot be run without a local Postgres database. You are not obliged to run those tests though: - `make test`: will run all tests - `make test-nodb`: will run only tests that do not need a local DB - `make test-db`: will run only tests that do need a local DB If you do want to run DB-related tests, you should ensure you have access zith sufficient privileges to a Postgresql database. ### Using your system database You need to ensure that your user is authorized to create and drop DBs, and in particular DBs named "softwareheritage-test" and "softwareheritage-dev" Note: the testdata repository (swh-storage-testdata) is not required any more. ### Using pifpaf [pifpaf](https://github.com/jd/pifpaf) is a suite of fixtures and a command-line tool that allows to start and stop daemons for a quick throw-away usage. It can be used to run tests that need a Postgres database without any other configuration reauired nor the need to have special access to a running database: ```bash $ pifpaf run postgresql make test-db [snip] ---------------------------------------------------------------------- Ran 124 tests in 56.203s OK ``` Note that pifpaf is not yet available as a Debian package, so you may have to install it in a venv. Development ----------- A test server could locally be running for tests. ### Sample configuration In either /etc/softwareheritage/storage/storage.yml, ~/.config/swh/storage.yml or ~/.swh/storage.yml: ``` storage: cls: local args: db: "dbname=softwareheritage-dev user=" objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to softwareheritage-dev local instance - the objstorage uses a local objstorage instance whose: - root path is /home/storage/swh-storage - slicing scheme is 0:2/2:4/4:6. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the 'root' path should exist on disk. ### Run server Command: ``` python3 -m swh.storage.api.server ~/.config/swh/storage.yml ``` This runs a local swh-storage api at 5002 port. ### And then what? In your upper layer (loader-git, loader-svn, etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote args: url: http://localhost:5002/ ``` You could directly define a local storage with the following snippet: ``` storage: cls: local args: db: service=swh-dev objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: listener Provides-Extra: schemata Provides-Extra: testing diff --git a/swh.storage.egg-info/PKG-INFO b/swh.storage.egg-info/PKG-INFO index 1887cc9f..e87727ab 100644 --- a/swh.storage.egg-info/PKG-INFO +++ b/swh.storage.egg-info/PKG-INFO @@ -1,153 +1,153 @@ Metadata-Version: 2.1 Name: swh.storage -Version: 0.0.116 +Version: 0.0.117 Summary: Software Heritage storage manager Home-page: https://forge.softwareheritage.org/diffusion/DSTO/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-storage Description: swh-storage =========== Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata. See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. Tests ----- Python tests for this module include tests that cannot be run without a local Postgres database. You are not obliged to run those tests though: - `make test`: will run all tests - `make test-nodb`: will run only tests that do not need a local DB - `make test-db`: will run only tests that do need a local DB If you do want to run DB-related tests, you should ensure you have access zith sufficient privileges to a Postgresql database. ### Using your system database You need to ensure that your user is authorized to create and drop DBs, and in particular DBs named "softwareheritage-test" and "softwareheritage-dev" Note: the testdata repository (swh-storage-testdata) is not required any more. ### Using pifpaf [pifpaf](https://github.com/jd/pifpaf) is a suite of fixtures and a command-line tool that allows to start and stop daemons for a quick throw-away usage. It can be used to run tests that need a Postgres database without any other configuration reauired nor the need to have special access to a running database: ```bash $ pifpaf run postgresql make test-db [snip] ---------------------------------------------------------------------- Ran 124 tests in 56.203s OK ``` Note that pifpaf is not yet available as a Debian package, so you may have to install it in a venv. Development ----------- A test server could locally be running for tests. ### Sample configuration In either /etc/softwareheritage/storage/storage.yml, ~/.config/swh/storage.yml or ~/.swh/storage.yml: ``` storage: cls: local args: db: "dbname=softwareheritage-dev user=" objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to softwareheritage-dev local instance - the objstorage uses a local objstorage instance whose: - root path is /home/storage/swh-storage - slicing scheme is 0:2/2:4/4:6. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the 'root' path should exist on disk. ### Run server Command: ``` python3 -m swh.storage.api.server ~/.config/swh/storage.yml ``` This runs a local swh-storage api at 5002 port. ### And then what? In your upper layer (loader-git, loader-svn, etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote args: url: http://localhost:5002/ ``` You could directly define a local storage with the following snippet: ``` storage: cls: local args: db: service=swh-dev objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Description-Content-Type: text/markdown Provides-Extra: listener Provides-Extra: schemata Provides-Extra: testing diff --git a/swh.storage.egg-info/SOURCES.txt b/swh.storage.egg-info/SOURCES.txt index 6d551056..61085424 100644 --- a/swh.storage.egg-info/SOURCES.txt +++ b/swh.storage.egg-info/SOURCES.txt @@ -1,195 +1,196 @@ MANIFEST.in Makefile Makefile.local README.md requirements-swh.txt requirements.txt setup.py version.txt bin/swh-storage-add-dir sql/.gitignore sql/Makefile sql/TODO sql/clusters.dot sql/bin/db-upgrade sql/bin/dot_add_content sql/doc/json/.gitignore sql/doc/json/Makefile sql/doc/json/entity.lister_metadata.schema.json sql/doc/json/entity.metadata.schema.json sql/doc/json/entity_history.lister_metadata.schema.json sql/doc/json/entity_history.metadata.schema.json sql/doc/json/fetch_history.result.schema.json sql/doc/json/list_history.result.schema.json sql/doc/json/listable_entity.list_params.schema.json sql/doc/json/origin_visit.metadata.json sql/doc/json/tool.tool_configuration.schema.json sql/json/.gitignore sql/json/Makefile sql/json/entity.lister_metadata.schema.json sql/json/entity.metadata.schema.json sql/json/entity_history.lister_metadata.schema.json sql/json/entity_history.metadata.schema.json sql/json/fetch_history.result.schema.json sql/json/list_history.result.schema.json sql/json/listable_entity.list_params.schema.json sql/json/origin_visit.metadata.json sql/json/tool.tool_configuration.schema.json sql/upgrades/015.sql sql/upgrades/016.sql sql/upgrades/017.sql sql/upgrades/018.sql sql/upgrades/019.sql sql/upgrades/020.sql sql/upgrades/021.sql sql/upgrades/022.sql sql/upgrades/023.sql sql/upgrades/024.sql sql/upgrades/025.sql sql/upgrades/026.sql sql/upgrades/027.sql sql/upgrades/028.sql sql/upgrades/029.sql sql/upgrades/030.sql sql/upgrades/032.sql sql/upgrades/033.sql sql/upgrades/034.sql sql/upgrades/035.sql sql/upgrades/036.sql sql/upgrades/037.sql sql/upgrades/038.sql sql/upgrades/039.sql sql/upgrades/040.sql sql/upgrades/041.sql sql/upgrades/042.sql sql/upgrades/043.sql sql/upgrades/044.sql sql/upgrades/045.sql sql/upgrades/046.sql sql/upgrades/047.sql sql/upgrades/048.sql sql/upgrades/049.sql sql/upgrades/050.sql sql/upgrades/051.sql sql/upgrades/052.sql sql/upgrades/053.sql sql/upgrades/054.sql sql/upgrades/055.sql sql/upgrades/056.sql sql/upgrades/057.sql sql/upgrades/058.sql sql/upgrades/059.sql sql/upgrades/060.sql sql/upgrades/061.sql sql/upgrades/062.sql sql/upgrades/063.sql sql/upgrades/064.sql sql/upgrades/065.sql sql/upgrades/066.sql sql/upgrades/067.sql sql/upgrades/068.sql sql/upgrades/069.sql sql/upgrades/070.sql sql/upgrades/071.sql sql/upgrades/072.sql sql/upgrades/073.sql sql/upgrades/074.sql sql/upgrades/075.sql sql/upgrades/076.sql sql/upgrades/077.sql sql/upgrades/078.sql sql/upgrades/079.sql sql/upgrades/080.sql sql/upgrades/081.sql sql/upgrades/082.sql sql/upgrades/083.sql sql/upgrades/084.sql sql/upgrades/085.sql sql/upgrades/086.sql sql/upgrades/087.sql sql/upgrades/088.sql sql/upgrades/089.sql sql/upgrades/090.sql sql/upgrades/091.sql sql/upgrades/092.sql sql/upgrades/093.sql sql/upgrades/094.sql sql/upgrades/095.sql sql/upgrades/096.sql sql/upgrades/097.sql sql/upgrades/098.sql sql/upgrades/099.sql sql/upgrades/100.sql sql/upgrades/101.sql sql/upgrades/102.sql sql/upgrades/103.sql sql/upgrades/104.sql sql/upgrades/105.sql sql/upgrades/106.sql sql/upgrades/107.sql sql/upgrades/108.sql sql/upgrades/109.sql sql/upgrades/110.sql sql/upgrades/111.sql sql/upgrades/112.sql sql/upgrades/113.sql sql/upgrades/114.sql sql/upgrades/115.sql sql/upgrades/116.sql sql/upgrades/117.sql sql/upgrades/118.sql sql/upgrades/119.sql sql/upgrades/120.sql sql/upgrades/121.sql sql/upgrades/122.sql sql/upgrades/123.sql sql/upgrades/124.sql sql/upgrades/125.sql sql/upgrades/126.sql sql/upgrades/127.sql sql/upgrades/128.sql sql/upgrades/129.sql swh/__init__.py swh.storage.egg-info/PKG-INFO swh.storage.egg-info/SOURCES.txt swh.storage.egg-info/dependency_links.txt swh.storage.egg-info/requires.txt swh.storage.egg-info/top_level.txt swh/storage/__init__.py swh/storage/common.py swh/storage/converters.py swh/storage/db.py swh/storage/db_utils.py swh/storage/exc.py swh/storage/in_memory.py swh/storage/listener.py swh/storage/storage.py swh/storage/algos/__init__.py swh/storage/algos/diff.py swh/storage/algos/dir_iterators.py swh/storage/algos/revisions_walker.py swh/storage/algos/snapshot.py swh/storage/api/__init__.py swh/storage/api/client.py swh/storage/api/server.py swh/storage/schemata/__init__.py swh/storage/schemata/distribution.py swh/storage/sql/10-swh-init.sql swh/storage/sql/20-swh-enums.sql swh/storage/sql/30-swh-schema.sql swh/storage/sql/40-swh-func.sql swh/storage/sql/60-swh-indexes.sql swh/storage/sql/70-swh-triggers.sql swh/storage/tests/__init__.py swh/storage/tests/generate_data_test.py swh/storage/tests/storage_testing.py swh/storage/tests/test_api_client.py swh/storage/tests/test_converters.py swh/storage/tests/test_db.py swh/storage/tests/test_in_memory.py +swh/storage/tests/test_listener.py swh/storage/tests/test_storage.py swh/storage/tests/algos/__init__.py swh/storage/tests/algos/test_diff.py swh/storage/tests/algos/test_dir_iterator.py swh/storage/tests/algos/test_revisions_walker.py swh/storage/tests/algos/test_snapshot.py \ No newline at end of file diff --git a/swh/storage/listener.py b/swh/storage/listener.py index e86c8d0d..52b1ef49 100644 --- a/swh/storage/listener.py +++ b/swh/storage/listener.py @@ -1,112 +1,137 @@ -# Copyright (C) 2016-2017 The Software Heritage developers +# Copyright (C) 2016-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json import logging import kafka import msgpack import swh.storage.db from swh.core.config import load_named_config +from swh.model import hashutil CONFIG_BASENAME = 'storage/listener' DEFAULT_CONFIG = { 'database': ('str', 'service=softwareheritage'), 'brokers': ('list[str]', ['getty.internal.softwareheritage.org']), 'topic_prefix': ('str', 'swh.tmp_journal.new'), 'poll_timeout': ('int', 10), } -def decode_sha(value): - """Decode the textual representation of a SHA hash""" - if isinstance(value, str): - return bytes.fromhex(value) - return value +def decode(object_type, obj): + """Decode a JSON obj of nature object_type. Depending on the nature of + the object, this can contain hex hashes + (cf. `/swh/storage/sql/70-swh-triggers.sql`). + Args: + object_type (str): Nature of the object + obj (str): json dict representation whose values might be hex + identifier. -def decode_json(value): - """Decode a JSON value containing hashes and other types""" - value = json.loads(value) + Returns: + dict representation ready for journal serialization - return {k: decode_sha(v) for k, v in value.items()} + """ + value = json.loads(obj) + + if object_type in ('origin', 'origin_visit'): + result = value + else: + result = {} + for k, v in value.items(): + result[k] = hashutil.hash_to_bytes(v) + return result OBJECT_TYPES = { 'content', 'skipped_content', 'directory', 'revision', 'release', 'snapshot', 'origin_visit', 'origin', } def register_all_notifies(db): """Register to notifications for all object types listed in OBJECT_TYPES""" with db.transaction() as cur: for object_type in OBJECT_TYPES: db.register_listener('new_%s' % object_type, cur) + logging.debug('Registered to notify events %s' % object_type) def dispatch_notify(topic_prefix, producer, notify): """Dispatch a notification to the proper topic""" + logging.debug('topic_prefix: %s, producer: %s, notify: %s' % ( + topic_prefix, producer, notify)) channel = notify.channel if not channel.startswith('new_') or channel[4:] not in OBJECT_TYPES: logging.warn("Got unexpected notify %s" % notify) return object_type = channel[4:] - topic = '%s.%s' % (topic_prefix, object_type) - data = decode_json(notify.payload) - producer.send(topic, value=data) + producer.send(topic, value=decode(object_type, notify.payload)) def run_from_config(config): """Run the Software Heritage listener from configuration""" db = swh.storage.db.Db.connect(config['database']) def key_to_kafka(key): """Serialize a key, possibly a dict, in a predictable way. Duplicated from swh.journal to avoid a cyclic dependency.""" p = msgpack.Packer(use_bin_type=True) if isinstance(key, dict): return p.pack_map_pairs(sorted(key.items())) else: return p.pack(key) producer = kafka.KafkaProducer( bootstrap_servers=config['brokers'], value_serializer=key_to_kafka, ) register_all_notifies(db) topic_prefix = config['topic_prefix'] poll_timeout = config['poll_timeout'] try: while True: for notify in db.listen_notifies(poll_timeout): + logging.debug('Notified by event %s' % notify) dispatch_notify(topic_prefix, producer, notify) producer.flush() except Exception: logging.exception("Caught exception") producer.flush() if __name__ == '__main__': - logging.basicConfig( - level=logging.INFO, - format='%(asctime)s %(process)d %(levelname)s %(message)s' - ) - config = load_named_config(CONFIG_BASENAME, DEFAULT_CONFIG) - run_from_config(config) + import click + + @click.command() + @click.option('--verbose', is_flag=True, default=False, + help='Be verbose if asked.') + def main(verbose): + logging.basicConfig( + level=logging.DEBUG if verbose else logging.INFO, + format='%(asctime)s %(process)d %(levelname)s %(message)s' + ) + _log = logging.getLogger('kafka') + _log.setLevel(logging.INFO) + + config = load_named_config(CONFIG_BASENAME, DEFAULT_CONFIG) + run_from_config(config) + + main() diff --git a/swh/storage/tests/test_listener.py b/swh/storage/tests/test_listener.py new file mode 100644 index 00000000..4b32ea04 --- /dev/null +++ b/swh/storage/tests/test_listener.py @@ -0,0 +1,46 @@ +# Copyright (C) 2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import json +import unittest + +from swh.storage.listener import decode + + +class ListenerUtils(unittest.TestCase): + def test_decode(self): + inputs = [ + ('content', json.dumps({ + 'sha1': '34973274ccef6ab4dfaaf86599792fa9c3fe4689', + })), + ('origin', json.dumps({ + 'url': 'https://some/origin', + 'type': 'svn', + })), + ('origin_visit', json.dumps({ + 'visit': 2, + 'origin': { + 'url': 'https://some/origin', + 'type': 'hg', + } + })) + ] + + expected_inputs = [{ + 'sha1': bytes.fromhex('34973274ccef6ab4dfaaf86599792fa9c3fe4689'), + }, { + 'url': 'https://some/origin', + 'type': 'svn', + }, { + 'visit': 2, + 'origin': { + 'url': 'https://some/origin', + 'type': 'hg' + }, + }] + + for i, (object_type, obj) in enumerate(inputs): + actual_value = decode(object_type, obj) + self.assertEqual(actual_value, expected_inputs[i]) diff --git a/version.txt b/version.txt index b90e25db..e26fde55 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.116-0-gd19a0d1 \ No newline at end of file +v0.0.117-0-gfc7c534 \ No newline at end of file