diff --git a/README b/README index 45586c0..4ef40c7 100644 --- a/README +++ b/README @@ -1,195 +1,141 @@ -SWH-loader-tar -============== +# SWH Tarball Loader The Software Heritage Tarball Loader is a tool and a library to -uncompress a local tarball and inject into the SWH dataset all unknown -contained files. +uncompress a local tarball and inject into the SWH dataset its tree +representation. -Tarball loader -============== - -Its job is to uncompress a tarball and load its content in swh -storage. - -### Configuration +## Configuration This is the loader's (or task's) configuration file. -loader/tar.yml: +*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: -``` - extraction_dir: /home/storage/tmp/ - storage: - cls: local +```YAML +extraction_dir: /home/storage/tmp/ +storage: + cls: local + args: + db: service=swh-dev + objstorage: + cls: pathslicing args: - db: service=swh-dev - objstorage: - cls: pathslicing - args: - root: /home/storage/swh-storage - slicing: 0:2/2:4/4:6 - send_contents: True - send_directories: True - send_revisions: True - send_releases: True - send_occurrences: True - - content_packet_size: 10000 - content_packet_block_size_bytes: 104857600 - content_packet_size_bytes: 1073741824 - directory_packet_size: 25000 - revision_packet_size: 100000 - release_packet_size: 100000 - occurrence_packet_size: 100000 - extraction_dir = /home/storage/tmp/ + root: /home/storage/swh-storage + slicing: 0:2/2:4/4:6 + +send_contents: True +send_directories: True +send_revisions: True +send_releases: True +send_occurrences: True + +content_packet_size: 10000 +content_packet_block_size_bytes: 104857600 +content_packet_size_bytes: 1073741824 +directory_packet_size: 25000 +revision_packet_size: 100000 +release_packet_size: 100000 +occurrence_packet_size: 100000 ``` -Present in possible locations: -- ~/.config/swh/loader/tar.ini -- ~/.swh/loader/tar.ini -- /etc/softwareheritage/loader/tar.ini +## API +Load tarball directly from code or python3's toplevel: -### API - -Load tarball directly from code or toplevel: - - from swh.loader.tar.loader import TarLoader +``` Python + from swh.loader.tar.tasks import LoadTarRepository - tarpath = '/some/path/to/blah-7.8.3.tgz' # Fill in those + tarpath = '/some/path/to/blah-7.8.3.tgz' origin = {'url': 'some-origin', 'type': 'dir'} visit_date = 'Tue, 3 May 2017 17:16:32 +0200' revision = {} occurrence = {} - TarLoader().load(tarpath, origin, visit_date, revision, [occurrence]) - + # Send message to the task queue + LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) +``` -### Celery +## Celery Load tarball using celery. -Providing you have a properly configured celery up and running +Providing you have a properly configured celery up and running, the +celery worker configuration file needs to be updated: -worker.ini needs to be updated with the following keys: +*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*: - task_modules = swh.loader.tar.tasks - task_queues = swh_loader_tar +``` YAML +task_modules: + - swh.loader.tar.tasks +task_queues: + - swh_loader_tar +``` cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details -#### Toplevel - -You can send the following message to the task queue: - - from swh.loader.tar.tasks import LoadTarRepository - - # Fill in those - tarpath = '/some/path/to/blah-7.8.3.tgz' - origin = {'url': 'some-origin', 'type': 'dir'} - visit_date = 'Tue, 3 May 2017 17:16:32 +0200' - revision = {} - occurrence = {} - - # Send message to the task queue - LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) - - -Tar Producer -============ +## Tar Producer Its job is to compulse from a file or a folder a list of existing tarballs. From this list, compute the corresponding messages to send to the broker. -#### Configuration - -Message producer's configuration file: - - [main] - - # Mirror's root directory holding tarballs to load into swh - mirror_root_directory = /srv/storage/space/mirrors/gnu.org/gnu/ - # mirror_root_directory = /srv/storage/space/mirrors/gnu.org/old-gnu/ - - # Url scheme prefix used to create the origin url - url_scheme = http://ftp.gnu.org/gnu/ - # url_scheme = rsync://ftp.gnu.org/old-gnu/ - - # Origin type used for tarballs - type = ftp - - # File containing a subset list tarballs from mirror_root_directory to load. - # The file's format is one absolute path name to a tarball per line. - # NOTE: - # - This file must contain data consistent with the mirror_root_directory - # - if this option is not provided, the mirror_root_directory is scanned - # completely as usual - # mirror_subset_archives = /home/storage/missing-archives +### Configuration - # Randomize blocks of messages and send for consumption - block_messages = 250 +Message producer's configuration file (`tar.yml`): + +``` YAML +# Mirror's root directory holding tarballs to load into swh +mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/ +# Url scheme prefix used to create the origin url +url_scheme: http://ftp.gnu.org/gnu/ +type: ftp + +# File containing a subset list tarballs from mirror_root_directory to load. +# The file's format is one absolute path name to a tarball per line. +# NOTE: +# - This file must contain data consistent with the mirror_root_directory +# - if this option is not provided, the mirror_root_directory is scanned +# completely as usual +# mirror_subset_archives: /home/storage/missing-archives + +# Randomize blocks of messages and send for consumption +block_messages: 250 +``` -#### Run +### Run Trigger the message computations: - python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.ini +```Shell +python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml +``` This will walk the `mirror_root_directory` folder and send encountered tarball messages for the swh-loader-tar to uncompress (through celery). If the `mirror_subset_archives` is provided, the tarball messages will -be computed from such file (the mirror_root_directory is still used so -please be consistent). +be computed from such file (the `mirror_root_directory` is still used +so please be consistent). -If problem arises during tarball message computation, a message will be -outputed with the tarball that present a problem. +If problem arises during tarball message computation, a message will +be outputed with the tarball that present a problem. It will displayed the number of tarball messages sent at the end. -Dry run: +### Dry run - python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.ini --dry-run +``` Shell +python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run +``` This will do the same as previously described but only display the number of potential tarball messages computed. -Help: - - python3 -m swh.loader.tar.producer --help - - -diff-db-mirror -============== - -Utility to compute the difference between the `occurrence_history` table -(column branch) and the actual mirror path on disk. -This will output the path to the tarballs not injected in db (for any reason). +### Help -This output is to be consumed by the swh-loader-tar-producer in replay mode. - - -Sample use: - - ./bin/swh-diff-db-mirror \ - --db-url 'host= dbname= user= password=' \ - --mirror-root-directory /path/to/mirrors/gnu.org/old-gnu - -Here is a sample output: - - ... - /home/storage/space/mirrors/gnu.org/gnu/miscfiles/miscfiles-1.4.2.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.5.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.10.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.8.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.5.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.7.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.14.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.2.59.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.9.tar.gz - /home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.11.tar.gz +``` Shell +python3 -m swh.loader.tar.producer --help +``` diff --git a/docs/.gitignore b/docs/.gitignore index 58a761e..f6b5c55 100644 --- a/docs/.gitignore +++ b/docs/.gitignore @@ -1,3 +1,4 @@ _build/ apidoc/ *-stamp +README.md diff --git a/docs/Makefile b/docs/Makefile index c30c50a..c491218 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1 +1,6 @@ include ../../swh-docs/Makefile.sphinx + +html: copy_md + +copy_md: + cp ../README README.md diff --git a/docs/index.rst b/docs/index.rst index 8b64117..2e88ed8 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,15 +1,15 @@ Software Heritage - Development Documentation ============================================= .. toctree:: :maxdepth: 2 :caption: Contents: - + README.md Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/resources/loader/tar.ini b/resources/loader/tar.ini deleted file mode 100644 index c0aaf45..0000000 --- a/resources/loader/tar.ini +++ /dev/null @@ -1,39 +0,0 @@ -[main] - -# NOT FOR PRODUCTION -tar_path = /home/tony/work/inria/repo/linux.tgz -dir_path = /tmp/linux/ - -# synthetic origin -origin_url = file:///dev/null -origin_type = tar - -# occurrence -occurrence_branch = master2 -occurrence_authority = 1 -occurrence_validity = 2015-01-01 00:00:00+00 - -# occurrence 2 -occurrence2_branch = dev -occurrence2_authority = 2 -occurrence2_validity = 2015-01-01 00:00:00+00 - -# synthetic revision -revision_author_name = swh author -revision_author_email = swh@inria.fr -revision_author_date = 1444054085 -revision_author_offset = +0200 -revision_committer_name = swh committer -revision_committer_email = swh@inria.fr -revision_committer_date = 1444054085 -revision_committer_offset = +0200 -revision_type = tar -revision_message = synthetic revision message - -# synthetic release -release_name = v0.0.1 -release_date = 1444054085 -release_offset = +0200 -release_author_name = swh author -release_author_email = swh@inria.fr -release_comment = synthetic release diff --git a/resources/producer/tar-gnu.ini b/resources/producer/tar-gnu.yml similarity index 69% rename from resources/producer/tar-gnu.ini rename to resources/producer/tar-gnu.yml index a1660bc..fc920f4 100644 --- a/resources/producer/tar-gnu.ini +++ b/resources/producer/tar-gnu.yml @@ -1,24 +1,22 @@ -[main] - # Mirror's root directory holding tarballs to load into swh -mirror_root_directory = /home/storage/space/mirrors/gnu.org/gnu/ +mirror_root_directory: /home/storage/space/mirrors/gnu.org/gnu/ # Origin setup's possible scheme url -url_scheme = rsync://ftp.gnu.org/gnu/ +url_scheme: rsync://ftp.gnu.org/gnu/ # Origin type used for tarballs -type = ftp +type: ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual -# mirror_subset_archives = /home/storage/missing-archives +# mirror_subset_archives: /home/storage/missing-archives # Retrieval date information (rsync, etc...) -date = Fri, 28 Aug 2015 13:13:26 +0200 +date: Fri, 28 Aug 2015 13:13:26 +0200 # Randomize blocks of messages and send for consumption -block_messages = 250 +block_messages: 250 diff --git a/resources/producer/tar-old-gnu.ini b/resources/producer/tar-old-gnu.yml similarity index 65% rename from resources/producer/tar-old-gnu.ini rename to resources/producer/tar-old-gnu.yml index 54b78ff..bf4e5fe 100644 --- a/resources/producer/tar-old-gnu.ini +++ b/resources/producer/tar-old-gnu.yml @@ -1,24 +1,22 @@ -[main] - # Mirror's root directory holding tarballs to load into swh -mirror_root_directory = /home/storage/space/mirrors/gnu.org/old-gnu/ +mirror_root_directory: /home/storage/space/mirrors/gnu.org/old-gnu/ # Origin setup's possible scheme url -url_scheme = rsync://ftp.gnu.org/old-gnu/ +url_scheme: rsync://ftp.gnu.org/old-gnu/ # Origin type used for tarballs -type = ftp +type: ftp # File containing a subset list tarballs from mirror_root_directory to load. # The file's format is one absolute path name to a tarball per line. # NOTE: # - This file must contain data consistent with the mirror_root_directory # - if this option is not provided, the mirror_root_directory is scanned # completely as usual -# mirror_subset_archives = /home/tony/work/inria/repo/swh-environment/swh-loader-tar/old-gnu-missing +# mirror_subset_archives: /home/tony/work/inria/repo/swh-environment/swh-loader-tar/old-gnu-missing # Retrieval date information (rsync, etc...) -date = Fri, 28 Aug 2015 13:13:26 +0200 +date: Fri, 28 Aug 2015 13:13:26 +0200 # Randomize blocks of messages and send for consumption -block_messages = 100 +block_messages: 100