Page MenuHomeSoftware Heritage

Up-to-date objstorage mirror on S3
Started, Work in Progress, HighPublic

Description

The S3 bucket containing objects is very outdated. We need to keep fill the gap and keep it up to date. This will be done with the content-replayer, that reads new objects from Kafka and then copies them from the object storages on Banco and Uffizi.

To speed up the replay, it will use a 60GB file containing hashes of files that are already on S3. (It's a sorted list of hashes, so the replayer will do random short reads).

The content replayer is single-threaded and has a high latency, so we should run lots of instances of it, at least to fill the initial gap (I tried up to 100 on my desktop, the speedup is linear).

Example systemd unit to run it:

[Unit]
Description=Content replayer Rocq to S3 (service %i)
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'sleep $(( RANDOM % 60 )); /home/dev/.local/bin/swh --log-level=INFO journal --config-file ~/replay_content_rocq_to_s3.yml content-replay --exclude-sha1-file /srv/softwareheritage/cassandra-test-0/scratch/sorted_inventory.bin'
Restart=on-failure
SyslogIdentifier=content-replayer-%i

Nice=10

[Install]
WantedBy=default.target

(The random sleep at the beginning is to workaround a crash that happens if too many kafka clients start at the same time.)

Example config file:

objstorage_src:
  cls: multiplexer
  args:
    objstorages:
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://banco.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://uffizi.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly

objstorage_dst:
  cls: s3
  args:
    container_name: NAME_OF_THE_S3_BUCKET
    key: KEY_OF_THE_S3_USER
    secret: SECRET_OF_THE_S3_USER

journal:
  brokers:
  - esnode1.internal.softwareheritage.org
  - esnode2.internal.softwareheritage.org
  - esnode3.internal.softwareheritage.org
  group_id: vlorentz-test-replay-rocq-to-s3
  max_poll_records: 100

Event Timeline

vlorentz triaged this task as High priority.Aug 19 2019, 11:44 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)Aug 19 2019, 11:47 AM
vlorentz updated the task description. (Show Details)Tue, Oct 29, 11:34 AM
olasd changed the task status from Open to Work in Progress.Fri, Nov 8, 11:41 AM

So I've deployed this (by hand for now) on uffizi and it seems to be doing its job.

Deployment steps:

  • create an IAM policy in the AWS management console with only read/write access to the softwareheritage/contents bucket
  • create an IAM account with that policy enabled
  • get access credentials for that IAM account
  • notice that the objstorage doesn't implement having contents in a subdirectory; fix that and release the objstorage
  • retrieve vlorentz's exclude file
  • tweak the config and the unit file (separate user, proper objstorage config with compression, ...)
  • systemctl start content-replayer-s3@{01..20}
  • notice that we process 2 objects per second per client
  • systemctl start content-replayer-s3@{21..40}
  • notice that we still process 2 objects per second per client and that the loadavg is < 15
  • systemctl start content-replayer-s3@{41..60}
  • notice that we still process 2 objects per second per client and that the loadavg is < 15

I'll probably start more workers if things stay stable.

I've also patched swh.journal by hand to process batches of 1000 objects instead of 20 to reduce log spam. I'm not 100% sure how to handle that properly.

olasd added a comment.Fri, Nov 8, 7:12 PM

I've added a metric with the S3 objects to https://grafana.softwareheritage.org/d/jScG7g6mk/objstorage-object-counts. There's... "some" work to do still.