Page MenuHomeSoftware Heritage

README
No OneTemporary

SWH-loader-tar
==============
The Software Heritage Tarball Loader is a tool and a library to uncompress a local
tarball and inject into the SWH dataset all unknown contained files.
Tarball loader
==============
Its job is to uncompress a tarball and load its content in swh storage.
### Configuration
This is the loader's (or task's) configuration file.
loader/tar.ini:
[main]
# the path where to extract the tarball before loading it into swh
extraction_dir = /home/storage/tmp/
# access to swh's storage
storage_class = remote_storage
storage_args = http://localhost:5000/
# parameters to condition loading into swh storage
send_contents = True
send_directories = True
send_revisions = True
send_releases = True
send_occurrences = True
content_packet_size = 10000
content_packet_size_bytes = 1073741824
directory_packet_size = 25000
revision_packet_size = 100000
release_packet_size = 100000
occurrence_packet_size = 100000
Present in possible locations:
- ~/.config/swh/loader/tar.ini
- ~/.swh/loader/tar.ini
- /etc/softwareheritage/loader/tar.ini
### API
Load tarball directly from code or toplevel:
from swh.loader.tar.tasks import LoadTarRepository
# Fill in those
origin = {}
release = None
revision = {}
occurrence = {}
LoadTarRepository().run('/some/path/to/blah-7.8.3.tgz',
origin,
revision,
release,
[occurrence])
### Celery
Load tarball using celery.
Providing you have a properly configured celery up and running
worker.ini needs to be updated with the following keys:
task_modules = swh.loader.tar.tasks
task_queues = swh_loader_tar
cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md
for more details
#### Toplevel
You can send the following message to the task queue:
from swh.loader.tar.tasks import LoadTarRepository
# Fill in those
origin = {}
release = None
revision = {}
occurrence = {}
# Send message to the task queue
LoadTarRepository().apply_async(('/some/path/to/blah-7.8.3.tgz',
origin,
revision,
release,
[occurrence]))
Tar Producer
============
Its job is to compulse from a file or a folder a list of existing
tarball. From this list, compute the celery message to send to the
loader tar worker to consume.
#### Configuration
Message producer's configuration file:
[main]
# mirror's root directory holding tarballs to load into swh
mirror_root_directory = /home/storage/space/mirrors/gnu.org/gnu/
# url scheme prefix used to create the origin url
url_scheme = http://ftp.gnu.org/gnu/
# origin type used for those tarballs
type = ftp
# File containing a subset list tarballs from mirror_root_directory to load.
# The file's format is one absolute path name to a tarball per line.
# NOTE:
# - This file must contain data consistent with the mirror_root_directory
# - if this option is not provided, the mirror_root_directory is scanned
# completely as usual
# mirror_subset_archives = /home/storage/missing-archives
# For tryouts purposes (no limit if commented or omitted)
# limit = 1
#### Run
Trigger the message computations:
swh-loader-tar-producer --config ~/.swh/producer/tar.ini
This will walk the `mirror_root_directory` folder and send encountered
tarball messages for the swh-loader-tar to uncompress (through
celery).
If the `mirror_subset_archives` is provided, the tarball messages will
be computed from such file (the mirror_root_directory is still used so
be consistent).
If problem arises during tarball message computation, a message will be
outputed with the tarball that present a problem.
It will displayed the number of tarball messages sent at the end.
Dry run:
swh-loader-tar-producer --config ~/.swh/producer/tar.ini --dry-run
This will do the same as previously described but only display the
number of potential tarball messages computed.
Help:
swh-loader-tar-producer -h

File Metadata

Mime Type
text/plain
Expires
Mon, Aug 25, 6:16 PM (3 d, 19 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3383533

Event Timeline