The S3 bucket containing objects is very outdated. We need to keep fill the gap and keep it up to date. This will be done with the content-replayer, that reads new objects from Kafka and then copies them from the object storages on Banco and Uffizi.
To speed up the replay, it will use a 60GB file containing hashes of files that are already on S3. (It's a sorted list of hashes, so the replayer will do random short reads).
The content replayer is single-threaded and has a high latency, so we should run lots of instances of it, at least to fill the initial gap (I tried up to 100 on my desktop, the speedup is linear).
Example systemd unit to run it:
[Unit] Description=Content replayer Rocq to S3 (service %i) After=network.target [Service] Type=simple ExecStart=/bin/bash -c 'sleep $(( RANDOM % 60 )); /home/dev/.local/bin/swh --log-level=INFO journal --config-file ~/replay_content_rocq_to_s3.yml content-replay --exclude-sha1-file /srv/softwareheritage/cassandra-test-0/scratch/sorted_inventory.bin' Restart=on-failure SyslogIdentifier=content-replayer-%i Nice=10 [Install] WantedBy=default.target
(The random sleep at the beginning is to workaround a crash that happens if too many kafka clients start at the same time.)
Example config file:
objstorage_src: cls: multiplexer args: objstorages: - cls: filtered args: storage_conf: cls: remote args: url: http://banco.internal.softwareheritage.org:5003/ filters_conf: - type: readonly - cls: filtered args: storage_conf: cls: remote args: url: http://uffizi.internal.softwareheritage.org:5003/ filters_conf: - type: readonly objstorage_dst: cls: s3 args: container_name: NAME_OF_THE_S3_BUCKET key: KEY_OF_THE_S3_USER secret: SECRET_OF_THE_S3_USER journal: brokers: - esnode1.internal.softwareheritage.org - esnode2.internal.softwareheritage.org - esnode3.internal.softwareheritage.org group_id: vlorentz-test-replay-rocq-to-s3 max_poll_records: 100