diff --git a/README b/README index 33a34b2..45586c0 100644 --- a/README +++ b/README @@ -1,197 +1,195 @@ SWH-loader-tar ============== The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset all unknown contained files. Tarball loader ============== Its job is to uncompress a tarball and load its content in swh storage. ### Configuration This is the loader's (or task's) configuration file. loader/tar.yml: ``` extraction_dir: /home/storage/tmp/ storage: cls: local args: db: service=swh-dev objstorage: cls: pathslicing args: root: /home/storage/swh-storage slicing: 0:2/2:4/4:6 send_contents: True send_directories: True send_revisions: True send_releases: True send_occurrences: True content_packet_size: 10000 content_packet_block_size_bytes: 104857600 content_packet_size_bytes: 1073741824 directory_packet_size: 25000 revision_packet_size: 100000 release_packet_size: 100000 occurrence_packet_size: 100000 extraction_dir = /home/storage/tmp/ ``` Present in possible locations: - ~/.config/swh/loader/tar.ini - ~/.swh/loader/tar.ini - /etc/softwareheritage/loader/tar.ini ### API Load tarball directly from code or toplevel: - from swh.loader.tar.tasks import LoadTarRepository + from swh.loader.tar.loader import TarLoader + tarpath = '/some/path/to/blah-7.8.3.tgz' # Fill in those - origin = {} + origin = {'url': 'some-origin', 'type': 'dir'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' revision = {} occurrence = {} - LoadTarRepository().load('/some/path/to/blah-7.8.3.tgz', - origin, - revision, - [occurrence]) + TarLoader().load(tarpath, origin, visit_date, revision, [occurrence]) ### Celery Load tarball using celery. Providing you have a properly configured celery up and running worker.ini needs to be updated with the following keys: task_modules = swh.loader.tar.tasks task_queues = swh_loader_tar cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details #### Toplevel You can send the following message to the task queue: from swh.loader.tar.tasks import LoadTarRepository # Fill in those - origin = {} + tarpath = '/some/path/to/blah-7.8.3.tgz' + origin = {'url': 'some-origin', 'type': 'dir'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' revision = {} occurrence = {} # Send message to the task queue - LoadTarRepository().load(('/some/path/to/blah-7.8.3.tgz', - origin, - revision, - [occurrence])) + LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) Tar Producer ============ Its job is to compulse from a file or a folder a list of existing tarballs. From this list, compute the corresponding messages to send to the broker. #### Configuration Message producer's configuration file: [main] # Mirror's root directory holding tarballs to load into swh mirror_root_directory = /srv/storage/space/mirrors/gnu.org/gnu/ # mirror_root_directory = /srv/storage/space/mirrors/gnu.org/old-gnu/ # Url scheme prefix used to create the origin url url_scheme = http://ftp.gnu.org/gnu/ # url_scheme = rsync://ftp.gnu.org/old-gnu/ # Origin type used for tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives # Randomize blocks of messages and send for consumption block_messages = 250 #### Run Trigger the message computations: python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.ini This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the mirror_root_directory is still used so please be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. Dry run: python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.ini --dry-run This will do the same as previously described but only display the number of potential tarball messages computed. Help: python3 -m swh.loader.tar.producer --help diff-db-mirror ============== Utility to compute the difference between the `occurrence_history` table (column branch) and the actual mirror path on disk. This will output the path to the tarballs not injected in db (for any reason). This output is to be consumed by the swh-loader-tar-producer in replay mode. Sample use: ./bin/swh-diff-db-mirror \ --db-url 'host= dbname= user= password=' \ --mirror-root-directory /path/to/mirrors/gnu.org/old-gnu Here is a sample output: ... /home/storage/space/mirrors/gnu.org/gnu/miscfiles/miscfiles-1.4.2.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.10.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.8.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.7.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.14.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.2.59.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.9.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.11.tar.gz