diff --git a/README b/README index 7f740f9..c89aa32 100644 --- a/README +++ b/README @@ -1,95 +1,98 @@ SWH-loader-dir ============== The Software Heritage Directory Loader is a tool and a library to walk a local directory and inject into the SWH dataset all unknown contained files. Directory loader ================ ### Configuration This is the loader's (or task's) configuration file. -loader/dir.ini: - - [main] - - # access to swh's storage - storage_class = remote_storage - storage_args = http://localhost:5002/ - - # parameters to condition loading into swh storage - send_contents = True - send_directories = True - send_revisions = True - send_releases = True - send_occurrences = True - content_packet_size = 10000 - content_packet_size_bytes = 1073741824 - directory_packet_size = 25000 - revision_packet_size = 100000 - release_packet_size = 100000 - occurrence_packet_size = 100000 +loader/dir.yml: + + storage: + cls: remote + args: + url: http://localhost:5002/ + + send_contents: True + send_directories: True + send_revisions: True + send_releases: True + send_occurrences: True + # nb of max contents to send for storage + content_packet_size: 100 + # 100 Mib of content data + content_packet_block_size_bytes: 104857600 + # limit for swh content storage for one blob (beyond that limit, the + # content's data is not sent for storage) + content_packet_size_bytes: 1073741824 + directory_packet_size: 250 + revision_packet_size: 100 + release_packet_size: 100 + occurrence_packet_size: 100 Present in possible locations: - ~/.config/swh/loader/dir.ini - ~/.swh/loader/dir.ini - /etc/softwareheritage/loader/dir.ini #### Toplevel Load directory directly from code or toplevel: from swh.loader.dir.tasks import LoadDirRepository dir_path = '/path/to/directory # Fill in those origin = {} release = None revision = {} occurrence = {} - LoadDirRepository().run(dir_path, origin, revision, release, [occurrence]) + LoaderDir().load(dir_path, origin, revision, release, [occurrence]) #### Celery Load directory using celery. Providing you have a properly configured celery up and running worker.ini needs to be updated with the following keys: task_modules = swh.loader.dir.tasks task_queues = swh_loader_dir cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md for more details You can send the following message to the task queue: from swh.loader.dir.tasks import LoadDirRepository # Fill in those origin = {} release = None revision = {} occurrence = {} # Send message to the task queue - LoadDirRepository().apply_async(('/path/to/dir, - origin, - revision, - release, - [occurrence])) + LoaderDir().load(('/path/to/dir, + origin, + revision, + release, + [occurrence])) Directory producer ================== None