Changeset View
Changeset View
Standalone View
Standalone View
README.md
# SWH Tarball Loader | # SWH Tarball Loader | ||||
The Software Heritage Tarball Loader is a tool and a library to | The Software Heritage Tarball Loader is in charge of ingesting the | ||||
uncompress a local tarball and inject into the SWH dataset its tree | directory representation of the tarball into the Software Heritage | ||||
representation. | archive. | ||||
## Configuration | ## Configuration | ||||
This is the loader's (or task's) configuration file. | This is the loader's (or task's) configuration file. | ||||
*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: | *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: | ||||
```YAML | ```YAML | ||||
extraction_dir: /home/storage/tmp/ | working_dir: /home/storage/tmp/ | ||||
storage: | storage: | ||||
cls: remote | cls: remote | ||||
args: | args: | ||||
url: http://localhost:5002/ | url: http://localhost:5002/ | ||||
``` | ``` | ||||
## API | ## API | ||||
Load tarball directly from code or python3's toplevel: | ### local | ||||
Load local tarball directly from code or python3's toplevel: | |||||
``` Python | ``` Python | ||||
# Fill in those | # Fill in those | ||||
repo = 'loader-tar.tgz' | repo = 'convert-tryout.tgz' | ||||
tarpath = '/home/storage/tar/%s' % repo | tarpath = '/home/storage/tar/%s' % repo | ||||
origin = {'url': 'ftp://%s' % repo, 'type': 'tar'} | origin = {'url': 'file://%s' % repo, 'type': 'tar'} | ||||
visit_date = 'Tue, 3 May 2017 17:16:32 +0200' | visit_date = 'Tue, 3 May 2017 17:16:32 +0200' | ||||
revision = { | last_modified = 'Tue, 10 May 2016 16:16:32 +0200' | ||||
'author': {'name': 'some', 'fullname': 'one', 'email': 'something'}, | |||||
'committer': {'name': 'some', 'fullname': 'one', 'email': 'something'}, | |||||
'message': '1.0 Released', | |||||
'date': None, | |||||
'committer_date': None, | |||||
'type': 'tar', | |||||
} | |||||
import logging | import logging | ||||
logging.basicConfig(level=logging.DEBUG) | logging.basicConfig(level=logging.DEBUG) | ||||
from swh.loader.tar.tasks import LoadTarRepository | from swh.loader.tar.tasks import LoadTarRepository | ||||
l = LoadTarRepository() | l = LoadTarRepository() | ||||
l.run_task(tar_path=tarpath, origin=origin, visit_date=visit_date, | l.run_task(origin=origin, visit_date=visit_date, | ||||
revision=revision, branch_name='master') | last_modified=last_modified) | ||||
``` | |||||
## Celery | |||||
Load tarball using celery. | |||||
Providing you have a properly configured celery up and running, the | |||||
celery worker configuration file needs to be updated: | |||||
*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*: | |||||
``` YAML | |||||
task_modules: | |||||
- swh.loader.tar.tasks | |||||
task_queues: | |||||
- swh_loader_tar | |||||
``` | |||||
cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md | |||||
for more details | |||||
## Tar Producer | |||||
Its job is to compulse from a file or a folder a list of existing | |||||
tarballs. From this list, compute the corresponding messages to send | |||||
to the broker. | |||||
### Configuration | |||||
Message producer's configuration file (`tar.yml`): | |||||
``` YAML | |||||
# Mirror's root directory holding tarballs to load into swh | |||||
mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/ | |||||
# Url scheme prefix used to create the origin url | |||||
url_scheme: http://ftp.gnu.org/gnu/ | |||||
type: ftp | |||||
# File containing a subset list tarballs from mirror_root_directory to load. | |||||
# The file's format is one absolute path name to a tarball per line. | |||||
# NOTE: | |||||
# - This file must contain data consistent with the mirror_root_directory | |||||
# - if this option is not provided, the mirror_root_directory is scanned | |||||
# completely as usual | |||||
# mirror_subset_archives: /home/storage/missing-archives | |||||
# Randomize blocks of messages and send for consumption | |||||
block_messages: 250 | |||||
``` | |||||
### Run | |||||
Trigger the message computations: | |||||
```Shell | |||||
python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml | |||||
``` | ``` | ||||
This will walk the `mirror_root_directory` folder and send encountered | ### remote | ||||
tarball messages for the swh-loader-tar to uncompress (through | |||||
celery). | |||||
If the `mirror_subset_archives` is provided, the tarball messages will | Load remote tarball is the same sample | ||||
be computed from such file (the `mirror_root_directory` is still used | |||||
so please be consistent). | |||||
If problem arises during tarball message computation, a message will | ```Python | ||||
be output with the tarball that present a problem. | url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz' | ||||
origin = {'url': url, 'type': 'tar'} | |||||
It will displayed the number of tarball messages sent at the end. | visit_date = 'Tue, 3 May 2017 17:16:32 +0200' | ||||
last_modified = '2016-04-22 16:35' | |||||
### Dry run | import logging | ||||
logging.basicConfig(level=logging.DEBUG) | |||||
``` Shell | |||||
python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run | |||||
``` | |||||
This will do the same as previously described but only display the | |||||
number of potential tarball messages computed. | |||||
### Help | |||||
``` Shell | from swh.loader.tar.tasks import LoadTarRepository | ||||
python3 -m swh.loader.tar.producer --help | l = LoadTarRepository() | ||||
l.run_task(origin=origin, visit_date=visit_date, | |||||
last_modified=last_modified) | |||||
``` | ``` |