Page MenuHomeSoftware Heritage

Up-to-date objstorage mirror on S3
Open, HighPublic

Description

The S3 bucket containing objects is very outdated. We need to keep fill the gap and keep it up to date. This will be done with the content-replayer, that reads new objects from Kafka and then copies them from the object storages on Banco and Uffizi.

To speed up the replay, it will use a 60GB file containing hashes of files that are already on S3. (It's a sorted list of hashes, so the replayer will do random short reads).

The content replayer is single-threaded and has a high latency, so we should run lots of instances of it, at least to fill the initial gap (I tried up to 100 on my desktop, the speedup is linear).

Example systemd unit to run it:

[Unit]
Description=Content replayer Rocq to S3 (service %i)
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'sleep $(( RANDOM % 60 )); /home/dev/.local/bin/swh --log-level=INFO journal --config-file ~/replay_content_rocq_to_s3.yml content-replay --exclude-sha1-file /srv/softwareheritage/cassandra-test-0/scratch/sorted_inventory.bin'
Restart=on-failure
SyslogIdentifier=content-replayer-%i

Nice=10

[Install]
WantedBy=default.target

(The random sleep at the beginning is to workaround a crash that happens if too many kafka clients start at the same time.)

Example config file:

objstorage_src:
  cls: multiplexer
  args:
    objstorages:
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://banco.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://uffizi.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly

objstorage_dst:
  cls: s3
  args:
    container_name: NAME_OF_THE_S3_BUCKET
    key: KEY_OF_THE_S3_USER
    secret: SECRET_OF_THE_S3_USER

journal:
  brokers:
  - esnode1.internal.softwareheritage.org
  group_id: vlorentz-test-replay-rocq-to-s3
  max_poll_records: 100

Event Timeline

vlorentz triaged this task as High priority.Aug 19 2019, 11:44 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)Aug 19 2019, 11:47 AM