Page MenuHomeSoftware Heritage

Up-to-date objstorage mirror on S3
Closed, MigratedEdits Locked

Description

The S3 bucket containing objects is very outdated. We need to keep fill the gap and keep it up to date. This will be done with the content-replayer, that reads new objects from Kafka and then copies them from the object storages on Banco and Uffizi.

To speed up the replay, it will use a 60GB file containing hashes of files that are already on S3. (It's a sorted list of hashes, so the replayer will do random short reads).

The content replayer is single-threaded and has a high latency, so we should run lots of instances of it, at least to fill the initial gap (I tried up to 100 on my desktop, the speedup is linear).

Example systemd unit to run it:

[Unit]
Description=Content replayer Rocq to S3 (service %i)
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'sleep $(( RANDOM % 60 )); /home/dev/.local/bin/swh --log-level=INFO journal --config-file ~/replay_content_rocq_to_s3.yml content-replay --exclude-sha1-file /srv/softwareheritage/cassandra-test-0/scratch/sorted_inventory.bin'
Restart=on-failure
SyslogIdentifier=content-replayer-%i

Nice=10

[Install]
WantedBy=default.target

(The random sleep at the beginning is to workaround a crash that happens if too many kafka clients start at the same time.)

Example config file:

objstorage_src:
  cls: multiplexer
  args:
    objstorages:
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://banco.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly
    - cls: filtered
      args:
        storage_conf:
          cls: remote
          args:
            url: http://uffizi.internal.softwareheritage.org:5003/
        filters_conf:
        - type: readonly

objstorage_dst:
  cls: s3
  args:
    container_name: NAME_OF_THE_S3_BUCKET
    key: KEY_OF_THE_S3_USER
    secret: SECRET_OF_THE_S3_USER

journal:
  brokers:
  - esnode1.internal.softwareheritage.org
  - esnode2.internal.softwareheritage.org
  - esnode3.internal.softwareheritage.org
  group_id: vlorentz-test-replay-rocq-to-s3
  max_poll_records: 100

Event Timeline

vlorentz created this task.
vlorentz updated the task description. (Show Details)
olasd changed the task status from Open to Work in Progress.Nov 8 2019, 11:41 AM

So I've deployed this (by hand for now) on uffizi and it seems to be doing its job.

Deployment steps:

  • create an IAM policy in the AWS management console with only read/write access to the softwareheritage/contents bucket
  • create an IAM account with that policy enabled
  • get access credentials for that IAM account
  • notice that the objstorage doesn't implement having contents in a subdirectory; fix that and release the objstorage
  • retrieve vlorentz's exclude file
  • tweak the config and the unit file (separate user, proper objstorage config with compression, ...)
  • systemctl start content-replayer-s3@{01..20}
  • notice that we process 2 objects per second per client
  • systemctl start content-replayer-s3@{21..40}
  • notice that we still process 2 objects per second per client and that the loadavg is < 15
  • systemctl start content-replayer-s3@{41..60}
  • notice that we still process 2 objects per second per client and that the loadavg is < 15

I'll probably start more workers if things stay stable.

I've also patched swh.journal by hand to process batches of 1000 objects instead of 20 to reduce log spam. I'm not 100% sure how to handle that properly.

I've added a metric with the S3 objects to https://grafana.softwareheritage.org/d/jScG7g6mk/objstorage-object-counts. There's... "some" work to do still.

So, the amount of contents on S3 went up fairly quickly during in between Nov 10th and Nov 20th, but then it stopped again, is it expected/normal?

(thanks for the metric, it really helps)

In T1954#39027, @zack wrote:

So, the amount of contents on S3 went up fairly quickly during in between Nov 10th and Nov 20th, but then it stopped again, is it expected/normal?

(thanks for the metric, it really helps)

I haven't touched the replayers since last week; They're suffering from T2034 and end up hanging one by one. I'll follow up to the other task once/when a diagnosis happens.

I've grown tired of babysitting this, so I've added systemd notify calls to the journal replayer, allowing us to just use the systemd watchdog to restart hung processes.

I've also bumped up the parallelism to the maximum of 128 (which is the number of partitions on the content topic).

I've updated the exclusion file with data from the 2020-04-19 s3 inventory.

The throughput is now 50% higher than what it used to be.

https://grafana.softwareheritage.org/d/d3l2oqXWz/s3-object-copy?orgId=1&from=now-31d&to=now is a grafana dashboard to monitor the copy.

The s3 object copy is now completely caught up with where kafka was when the backfilling of all objects from postgresql ended. This means we're now copying the "newer" objects, and there's pretty much no hits at all on the inventory file anymore.

I'll leave this going for a while (over the weekend), then I'll remove the inventory file option to see whether we see a change in throughput.

The journal clients copying objects to S3 are blocked on being unable to read messages from kafka.

It seems that rdkafka is unable to decompress the messages because they're too large for its decompression buffer. But even bumping the size up a lot didn't fix that issue. I haven't had time to investigate the specific messages the client has issue with.

This warrants a separate task, because we should fix that message size issue before we start filling up our new kafka cluster.

The process has been restarted and is well ongoing (we have 800 million objects left to copy, at around 500 ops, so the ETA until reaching the tail of the log is around 3 weeks now).

We've now swapped the order of operations in swh.storage so all objects are written to the object storage (on saam and azure) before kafka gets a reference to the object. This should make the copy process fully work even when reaching the tail of the kafka topics (we're not there yet).

For the last few days, we've been seeing a small uptick in 503 errors from S3 (around 2500 objects failed to copy in the last week). We'll need to investigate these a bit further, and we'll need to process the backlog of errors to copy the missing objects.

unless I'm mistaken, this task can be closed now, it looks to have reached a steady state where the lag is near 0

We should probably add monitoring alerts (if we don't already have them) before closing the task

well this task should be closed, and a new subtask could be added for the alerting

T3477