diff --git a/README b/README index 47d02ec..ca9ca8d 100644 --- a/README +++ b/README @@ -1,204 +1,197 @@ SWH-loader-tar ============== -The Software Heritage Tarball Loader is a tool and a library to uncompress a local -tarball and inject into the SWH dataset all unknown contained files. +The Software Heritage Tarball Loader is a tool and a library to +uncompress a local tarball and inject into the SWH dataset all unknown +contained files. Tarball loader ============== -Its job is to uncompress a tarball and load its content in swh storage. +Its job is to uncompress a tarball and load its content in swh +storage. ### Configuration This is the loader's (or task's) configuration file. -loader/tar.ini: - - [main] - - # the path where to extract the tarball before loading it into swh +loader/tar.yml: + +``` + extraction_dir: /home/storage/tmp/ + storage: + cls: local + args: + db: service=swh-dev + objstorage: + cls: pathslicing + args: + root: /home/storage/swh-storage + slicing: 0:2/2:4/4:6 + send_contents: True + send_directories: True + send_revisions: True + send_releases: True + send_occurrences: True + + content_packet_size: 10000 + content_packet_block_size_bytes: 104857600 + content_packet_size_bytes: 1073741824 + directory_packet_size: 25000 + revision_packet_size: 100000 + release_packet_size: 100000 + occurrence_packet_size: 100000 extraction_dir = /home/storage/tmp/ - - # access to swh's storage - storage_class = remote_storage - storage_args = http://localhost:5000/ - - # parameters to condition loading into swh storage - send_contents = True - send_directories = True - send_revisions = True - send_releases = True - send_occurrences = True - content_packet_size = 10000 - content_packet_size_bytes = 1073741824 - directory_packet_size = 25000 - revision_packet_size = 100000 - release_packet_size = 100000 - occurrence_packet_size = 100000 +``` Present in possible locations: - ~/.config/swh/loader/tar.ini - ~/.swh/loader/tar.ini - /etc/softwareheritage/loader/tar.ini ### API Load tarball directly from code or toplevel: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} - release = None revision = {} occurrence = {} - LoadTarRepository().run('/some/path/to/blah-7.8.3.tgz', - origin, - revision, - release, - [occurrence]) + LoadTarRepository().prepare_and_load('/some/path/to/blah-7.8.3.tgz', + origin, + revision, + [occurrence]) ### Celery Load tarball using celery. Providing you have a properly configured celery up and running worker.ini needs to be updated with the following keys: task_modules = swh.loader.tar.tasks task_queues = swh_loader_tar cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details #### Toplevel You can send the following message to the task queue: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} - release = None revision = {} occurrence = {} # Send message to the task queue - LoadTarRepository().apply_async(('/some/path/to/blah-7.8.3.tgz', - origin, - revision, - release, - [occurrence])) + LoadTarRepository().prepare_and_load(('/some/path/to/blah-7.8.3.tgz', + origin, + revision, + [occurrence])) Tar Producer ============ Its job is to compulse from a file or a folder a list of existing -tarball. From this list, compute the corresponding messages to -send to the broker. +tarballs. From this list, compute the corresponding messages to send +to the broker. #### Configuration Message producer's configuration file: [main] # Mirror's root directory holding tarballs to load into swh mirror_root_directory = /srv/storage/space/mirrors/gnu.org/gnu/ # mirror_root_directory = /srv/storage/space/mirrors/gnu.org/old-gnu/ # Url scheme prefix used to create the origin url url_scheme = http://ftp.gnu.org/gnu/ # url_scheme = rsync://ftp.gnu.org/old-gnu/ # Origin type used for tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives - # Authorities - gnu_authority = 4706c92a-8173-45d9-93d7-06523f249398 - swh_authority = 5f4d4c51-498a-4e28-88b3-b3e4e8396cba - # Randomize blocks of messages and send for consumption block_messages = 250 - # DEV options - - # Tryouts purposes (no limit if not specified) - # limit = 10 - - #### Run Trigger the message computations: - swh-loader-tar-producer --config ~/.swh/producer/tar.ini + python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.ini This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the mirror_root_directory is still used so please be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. Dry run: - swh-loader-tar-producer --config ~/.swh/producer/tar.ini --dry-run + python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.ini --dry-run This will do the same as previously described but only display the number of potential tarball messages computed. Help: - swh-loader-tar-producer -h + python3 -m swh.loader.tar.producer --help diff-db-mirror ============== Utility to compute the difference between the `occurrence_history` table (column branch) and the actual mirror path on disk. This will output the path to the tarballs not injected in db (for any reason). This output is to be consumed by the swh-loader-tar-producer in replay mode. Sample use: ./bin/swh-diff-db-mirror \ --db-url 'host= dbname= user= password=' \ --mirror-root-directory /path/to/mirrors/gnu.org/old-gnu Here is a sample output: ... /home/storage/space/mirrors/gnu.org/gnu/miscfiles/miscfiles-1.4.2.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.10.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.8.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.7.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.14.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.2.59.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.9.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.11.tar.gz