diff --git a/README.md b/README.md index 99cf12d..cb37633 100644 --- a/README.md +++ b/README.md @@ -1,122 +1,131 @@ # SWH Tarball Loader The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset its tree representation. ## Configuration This is the loader's (or task's) configuration file. *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: ```YAML extraction_dir: /home/storage/tmp/ storage: cls: remote args: url: http://localhost:5002/ ``` ## API Load tarball directly from code or python3's toplevel: ``` Python - from swh.loader.tar.tasks import LoadTarRepository - - # Fill in those - tarpath = '/some/path/to/blah-7.8.3.tgz' - origin = {'url': 'some-origin', 'type': 'dir'} - visit_date = 'Tue, 3 May 2017 17:16:32 +0200' - revision = {} - occurrence = {} - - # Send message to the task queue - LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) +# Fill in those +repo = 'loader-tar.tgz' +tarpath = '/home/storage/tar/%s' % repo +origin = {'url': 'ftp://%s' % repo, 'type': 'tar'} +visit_date = 'Tue, 3 May 2017 17:16:32 +0200' +revision = { + 'author': {'name': 'some', 'fullname': 'one', 'email': 'something'}, + 'committer': {'name': 'some', 'fullname': 'one', 'email': 'something'}, + 'message': '1.0 Released', + 'date': None, + 'committer_date': None, + 'type': 'tar', +} +import logging +logging.basicConfig(level=logging.DEBUG) + +from swh.loader.tar.tasks import LoadTarRepository +l = LoadTarRepository() +l.run_task(tar_path=tarpath, origin=origin, visit_date=visit_date, + revision=revision, branch_name='master') ``` ## Celery Load tarball using celery. Providing you have a properly configured celery up and running, the celery worker configuration file needs to be updated: *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*: ``` YAML task_modules: - swh.loader.tar.tasks task_queues: - swh_loader_tar ``` cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details ## Tar Producer Its job is to compulse from a file or a folder a list of existing tarballs. From this list, compute the corresponding messages to send to the broker. ### Configuration Message producer's configuration file (`tar.yml`): ``` YAML # Mirror's root directory holding tarballs to load into swh mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/ # Url scheme prefix used to create the origin url url_scheme: http://ftp.gnu.org/gnu/ type: ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives: /home/storage/missing-archives # Randomize blocks of messages and send for consumption block_messages: 250 ``` ### Run Trigger the message computations: ```Shell python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml ``` This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the `mirror_root_directory` is still used so please be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. ### Dry run ``` Shell python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run ``` This will do the same as previously described but only display the number of potential tarball messages computed. ### Help ``` Shell python3 -m swh.loader.tar.producer --help ```