diff --git a/README b/README.md similarity index 86% rename from README rename to README.md index 4ef40c7..99cf12d 100644 --- a/README +++ b/README.md @@ -1,141 +1,122 @@ # SWH Tarball Loader The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset its tree representation. ## Configuration This is the loader's (or task's) configuration file. *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: ```YAML extraction_dir: /home/storage/tmp/ storage: - cls: local + cls: remote args: - db: service=swh-dev - objstorage: - cls: pathslicing - args: - root: /home/storage/swh-storage - slicing: 0:2/2:4/4:6 - -send_contents: True -send_directories: True -send_revisions: True -send_releases: True -send_occurrences: True - -content_packet_size: 10000 -content_packet_block_size_bytes: 104857600 -content_packet_size_bytes: 1073741824 -directory_packet_size: 25000 -revision_packet_size: 100000 -release_packet_size: 100000 -occurrence_packet_size: 100000 + url: http://localhost:5002/ ``` ## API Load tarball directly from code or python3's toplevel: ``` Python from swh.loader.tar.tasks import LoadTarRepository # Fill in those tarpath = '/some/path/to/blah-7.8.3.tgz' origin = {'url': 'some-origin', 'type': 'dir'} visit_date = 'Tue, 3 May 2017 17:16:32 +0200' revision = {} occurrence = {} # Send message to the task queue LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) ``` ## Celery Load tarball using celery. Providing you have a properly configured celery up and running, the celery worker configuration file needs to be updated: *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*: ``` YAML task_modules: - swh.loader.tar.tasks task_queues: - swh_loader_tar ``` cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details ## Tar Producer Its job is to compulse from a file or a folder a list of existing tarballs. From this list, compute the corresponding messages to send to the broker. ### Configuration Message producer's configuration file (`tar.yml`): ``` YAML # Mirror's root directory holding tarballs to load into swh mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/ # Url scheme prefix used to create the origin url url_scheme: http://ftp.gnu.org/gnu/ type: ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives: /home/storage/missing-archives # Randomize blocks of messages and send for consumption block_messages: 250 ``` ### Run Trigger the message computations: ```Shell python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml ``` This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the `mirror_root_directory` is still used so please be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. ### Dry run ``` Shell python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run ``` This will do the same as previously described but only display the number of potential tarball messages computed. ### Help ``` Shell python3 -m swh.loader.tar.producer --help ```