diff --git a/README b/README index 2626c2a..47d02ec 100644 --- a/README +++ b/README @@ -1,198 +1,204 @@ SWH-loader-tar ============== The Software Heritage Tarball Loader is a tool and a library to uncompress a local tarball and inject into the SWH dataset all unknown contained files. Tarball loader ============== Its job is to uncompress a tarball and load its content in swh storage. ### Configuration This is the loader's (or task's) configuration file. loader/tar.ini: [main] # the path where to extract the tarball before loading it into swh extraction_dir = /home/storage/tmp/ # access to swh's storage storage_class = remote_storage storage_args = http://localhost:5000/ # parameters to condition loading into swh storage send_contents = True send_directories = True send_revisions = True send_releases = True send_occurrences = True content_packet_size = 10000 content_packet_size_bytes = 1073741824 directory_packet_size = 25000 revision_packet_size = 100000 release_packet_size = 100000 occurrence_packet_size = 100000 Present in possible locations: - ~/.config/swh/loader/tar.ini - ~/.swh/loader/tar.ini - /etc/softwareheritage/loader/tar.ini ### API Load tarball directly from code or toplevel: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} release = None revision = {} occurrence = {} LoadTarRepository().run('/some/path/to/blah-7.8.3.tgz', origin, revision, release, [occurrence]) ### Celery Load tarball using celery. Providing you have a properly configured celery up and running worker.ini needs to be updated with the following keys: task_modules = swh.loader.tar.tasks task_queues = swh_loader_tar cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details #### Toplevel You can send the following message to the task queue: from swh.loader.tar.tasks import LoadTarRepository # Fill in those origin = {} release = None revision = {} occurrence = {} # Send message to the task queue LoadTarRepository().apply_async(('/some/path/to/blah-7.8.3.tgz', origin, revision, release, [occurrence])) Tar Producer ============ Its job is to compulse from a file or a folder a list of existing tarball. From this list, compute the corresponding messages to send to the broker. #### Configuration Message producer's configuration file: [main] - # mirror's root directory holding tarballs to load into swh + # Mirror's root directory holding tarballs to load into swh mirror_root_directory = /srv/storage/space/mirrors/gnu.org/gnu/ # mirror_root_directory = /srv/storage/space/mirrors/gnu.org/old-gnu/ - # url scheme prefix used to create the origin url + # Url scheme prefix used to create the origin url url_scheme = http://ftp.gnu.org/gnu/ # url_scheme = rsync://ftp.gnu.org/old-gnu/ - # origin type used for those tarballs + # Origin type used for tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives - # For tryouts purposes (no limit if commented or omitted) - # limit = 1 - + # Authorities gnu_authority = 4706c92a-8173-45d9-93d7-06523f249398 swh_authority = 5f4d4c51-498a-4e28-88b3-b3e4e8396cba + # Randomize blocks of messages and send for consumption + block_messages = 250 + + # DEV options + + # Tryouts purposes (no limit if not specified) + # limit = 10 + #### Run Trigger the message computations: swh-loader-tar-producer --config ~/.swh/producer/tar.ini This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will be computed from such file (the mirror_root_directory is still used so please be consistent). If problem arises during tarball message computation, a message will be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. Dry run: swh-loader-tar-producer --config ~/.swh/producer/tar.ini --dry-run This will do the same as previously described but only display the number of potential tarball messages computed. Help: swh-loader-tar-producer -h diff-db-mirror ============== Utility to compute the difference between the `occurrence_history` table (column branch) and the actual mirror path on disk. This will output the path to the tarballs not injected in db (for any reason). This output is to be consumed by the swh-loader-tar-producer in replay mode. Sample use: ./bin/swh-diff-db-mirror \ --db-url 'host= dbname= user= password=' \ --mirror-root-directory /path/to/mirrors/gnu.org/old-gnu Here is a sample output: ... /home/storage/space/mirrors/gnu.org/gnu/miscfiles/miscfiles-1.4.2.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.10.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.8.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.5.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.7.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.14.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.2.59.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.9.tar.gz /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.11.tar.gz diff --git a/resources/producer/tar-gnu.ini b/resources/producer/tar-gnu.ini index efc7a41..a13f954 100644 --- a/resources/producer/tar-gnu.ini +++ b/resources/producer/tar-gnu.ini @@ -1,31 +1,31 @@ [main] # Mirror's root directory holding tarballs to load into swh mirror_root_directory = /home/storage/space/mirrors/gnu.org/gnu/ # Origin setup's possible scheme url url_scheme = rsync://ftp.gnu.org/gnu/ # Origin type used for tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives # Authorities gnu_authority = 4706c92a-8173-45d9-93d7-06523f249398 swh_authority = 5f4d4c51-498a-4e28-88b3-b3e4e8396cba # Randomize blocks of messages and send for consumption -# block_messages = 100 +block_messages = 250 # DEV options # Tryouts purposes (no limit if not specified) # limit = 10 diff --git a/resources/producer/tar-old-gnu.ini b/resources/producer/tar-old-gnu.ini index 9b848f6..4906232 100644 --- a/resources/producer/tar-old-gnu.ini +++ b/resources/producer/tar-old-gnu.ini @@ -1,31 +1,31 @@ [main] # Mirror's root directory holding tarballs to load into swh mirror_root_directory = /home/storage/space/mirrors/gnu.org/old-gnu/ # Origin setup's possible scheme url url_scheme = rsync://ftp.gnu.org/old-gnu/ # Origin type used for tarballs type = ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual # mirror_subset_archives = /home/storage/missing-archives # Authorities gnu_authority = 4706c92a-8173-45d9-93d7-06523f249398 swh_authority = 5f4d4c51-498a-4e28-88b3-b3e4e8396cba # Randomize blocks of messages and send for consumption -block_messages = 100 +block_messages = 250 # DEV options # Tryouts purposes (no limit if not specified) # limit = 10