SWH-loader-tar ============== The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset all unknown contained files. Tarball loader ============== Its job is to uncompress a tarball and load its content in swh storage. ### Configuration This is the loader's (or task's) configuration file. loader/tar.ini: [main] # the path where to extract the tarball before loading it into swh extraction_dir = /home/storage/tmp/ # access to swh's storage storage_class = remote_storage storage_args = http://localhost:5000/ # parameters to condition loading into swh storage send_contents = True send_directories = True send_revisions = True send_releases = True send_occurrences = True content_packet_size = 10000 content_packet_size_bytes = 1073741824 directory_packet_size = 25000 revision_packet_size = 100000 release_packet_size = 100000 occurrence_packet_size = 100000 Present in possible locations: - ~/.config/swh/loader/tar.ini - ~/.swh/loader/tar.ini - /etc/softwareheritage/loader/tar.ini ### API Load tarball directly from code or toplevel: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} release = None revision = {} occurrence = {} LoadTarRepository().run('/some/path/to/blah-7.8.3.tgz', origin, revision, release, [occurrence]) ### Celery Load tarball using celery. Providing you have a properly configured celery up and running worker.ini needs to be updated with the following keys: task_modules = swh.loader.tar.tasks task_queues = swh_loader_tar cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details #### Toplevel You can send the following message to the task queue: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} release = None revision = {} occurrence = {} # Send message to the task queue LoadTarRepository().apply_async(('/some/path/to/blah-7.8.3.tgz', origin, revision, release, [occurrence])) Tar Producer ============ Its job is to compulse from a file or a folder a list of existing tarball. From this list, compute the celery message to send to the loader tar worker to consume. #### Configuration Message producer's configuration file: [main] # mirror's root directory holding tarballs to load into swh mirror_root_directory = /home/storage/space/mirrors/gnu.org/gnu/ # url scheme prefix used to create the origin url url_scheme = http://ftp.gnu.org/gnu/ # origin type used for those tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives # For tryouts purposes (no limit if commented or omitted) # limit = 1 #### Run Trigger the message computations: swh-loader-tar-producer --config ~/.swh/producer/tar.ini This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the mirror_root_directory is still used so be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. Dry run: swh-loader-tar-producer --config ~/.swh/producer/tar.ini --dry-run This will do the same as previously described but only display the number of potential tarball messages computed. Help: swh-loader-tar-producer -h