Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9749781
README
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
5 KB
Subscribers
None
README
View Options
SWH-loader-tar
==============
The Software Heritage Tarball Loader is a tool and a library to uncompress a local
tarball and inject into the SWH dataset all unknown contained files.
Tarball loader
==============
Its job is to uncompress a tarball and load its content in swh storage.
### Configuration
This is the loader's (or task's) configuration file.
loader/tar.ini:
[main]
# the path where to extract the tarball before loading it into swh
extraction_dir = /home/storage/tmp/
# access to swh's storage
storage_class = remote_storage
storage_args = http://localhost:5000/
# parameters to condition loading into swh storage
send_contents = True
send_directories = True
send_revisions = True
send_releases = True
send_occurrences = True
content_packet_size = 10000
content_packet_size_bytes = 1073741824
directory_packet_size = 25000
revision_packet_size = 100000
release_packet_size = 100000
occurrence_packet_size = 100000
Present in possible locations:
- ~/.config/swh/loader/tar.ini
- ~/.swh/loader/tar.ini
- /etc/softwareheritage/loader/tar.ini
### API
Load tarball directly from code or toplevel:
from swh.loader.tar.tasks import LoadTarRepository
# Fill in those
origin = {}
release = None
revision = {}
occurrence = {}
LoadTarRepository().run('/some/path/to/blah-7.8.3.tgz',
origin,
revision,
release,
[occurrence])
### Celery
Load tarball using celery.
Providing you have a properly configured celery up and running
worker.ini needs to be updated with the following keys:
task_modules = swh.loader.tar.tasks
task_queues = swh_loader_tar
cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md
for more details
#### Toplevel
You can send the following message to the task queue:
from swh.loader.tar.tasks import LoadTarRepository
# Fill in those
origin = {}
release = None
revision = {}
occurrence = {}
# Send message to the task queue
LoadTarRepository().apply_async(('/some/path/to/blah-7.8.3.tgz',
origin,
revision,
release,
[occurrence]))
Tar Producer
============
Its job is to compulse from a file or a folder a list of existing
tarball. From this list, compute the celery message to send to the
loader tar worker to consume.
#### Configuration
Message producer's configuration file:
[main]
# mirror's root directory holding tarballs to load into swh
mirror_root_directory = /home/storage/space/mirrors/gnu.org/gnu/
# url scheme prefix used to create the origin url
url_scheme = http://ftp.gnu.org/gnu/
# origin type used for those tarballs
type = ftp
# File containing a subset list tarballs from mirror_root_directory to load.
# The file's format is one absolute path name to a tarball per line.
# NOTE:
# - This file must contain data consistent with the mirror_root_directory
# - if this option is not provided, the mirror_root_directory is scanned
# completely as usual
# mirror_subset_archives = /home/storage/missing-archives
# For tryouts purposes (no limit if commented or omitted)
# limit = 1
#### Run
Trigger the message computations:
swh-loader-tar-producer --config ~/.swh/producer/tar.ini
This will walk the `mirror_root_directory` folder and send encountered
tarball messages for the swh-loader-tar to uncompress (through
celery).
If the `mirror_subset_archives` is provided, the tarball messages will
be computed from such file (the mirror_root_directory is still used so
be consistent).
If problem arises during tarball message computation, a message will be
outputed with the tarball that present a problem.
It will displayed the number of tarball messages sent at the end.
Dry run:
swh-loader-tar-producer --config ~/.swh/producer/tar.ini --dry-run
This will do the same as previously described but only display the
number of potential tarball messages computed.
Help:
swh-loader-tar-producer -h
diff-db-mirror
==============
Utility to compute the difference between the `occurrence_history` table
(column branch) and the actual mirror path on disk.
This will output the path to the tarballs not injected in db (for any reason).
This output is to be consumed by the swh-loader-tar-producer in replay mode.
Sample use:
./bin/diff-db-mirror.py \
--db-url 'host=<host> dbname=<db> user=<user> password=<pass>' \
--mirror-root-directory /path/to/mirrors/gnu.org/old-gnu
Here is a sample output:
...
/home/storage/space/mirrors/gnu.org/gnu/miscfiles/miscfiles-1.4.2.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.5.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.10.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.4.8.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.5.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.7.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.14.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.2.59.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.9.tar.gz
/home/storage/space/mirrors/gnu.org/gnu/zile/zile-2.3.11.tar.gz
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Mon, Aug 25, 6:14 PM (6 h, 28 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3375160
Attached To
rDLDTAR Tarball Loader
Event Timeline
Log In to Comment