Page MenuHomeSoftware Heritage

extid topic is misconfigured in staging and production
Closed, MigratedEdits Locked

Description

The extid topic has the default configuration in staging and production:

  • staging:
/opt/kafka/bin/kafka-topics.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --describe --topic swh.journal.objects.extid 
Topic: swh.journal.objects.extid	PartitionCount: 1	ReplicationFactor: 1	Configs: max.message.bytes=104857600
  • production:
/opt/kafka/bin/kafka-topics.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --describe --topic swh.journal.objects.extid 
Topic: swh.journal.objects.extid	PartitionCount: 1	ReplicationFactor: 1	Configs: max.message.bytes=104857600

The cleanup policy need to be configured to compact and the partition count to 64 in staging and 256 in production. The replication also needs to be increased to 2 in production

For staging:

/opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partition 64

For production:

/opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partition 256 --replication-factor 2

The content of the topic needs to be back filed so the previous content will be cleaned after the next compact

Event Timeline

vsellier changed the task status from Open to Work in Progress.Aug 17 2021, 10:57 AM
vsellier triaged this task as High priority.
vsellier created this task.

staging

vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects.extid.
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep ReplicationFactor      
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600

The backfill is running in staging (launch with the P1121 and P1122 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):

swhstorage@storage1:~$ ./backfill.sh | tee output.log
swhstorage@storage1:~$ grep Starting output.log
Starting  swh storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh storage backfill  extid --start-object f80001

Production

Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform

  1. generate the current mapping:
kafka1 ~ % cat topic-to-move.json 
{
  "topics": [
    {
      "topic": "swh.journal.objects.extid"
    }
  ],
  "version": 1
}

kafka1 ~ % /opt/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server $SERVER --generate --broker-list 1,2,3,4 --topics-to-move-json-file topic-to-move.json 2>&1 | tee output.log
Current partition replica assignment
{"version":1,"partitions":[{"topic":"swh.journal.objects.extid","partition":0,"replicas":[3],"log_dirs":["any"]}]}

Proposed partition reassignment configuration
{"version":1,"partitions":[{"topic":"swh.journal.objects.extid","partition":0,"replicas":[4],"log_dirs":["any"]}]}
  1. Generate the 2 assignment with 2 replicas

The new assignment can be done manually as there is only one partition actually, if there were more, there are online tools that can help, like https://kafka-optimizer.sqooba.io/#kafka-partitions-assignment-optimizer

kafka1 ~ % cat reassign.json 
{
  "version": 1,
  "partitions": [
    {
      "topic": "swh.journal.objects.extid",
      "partition": 0,
      "replicas": [
        3,
        2
      ]
    }
  ]
}
  1. Apply the new configuration
kafka1 ~ % /opt/kafka/bin/kafka-reassign-partitions.sh --zookeeper $ZK --reassignment-json-file reassign.json -execute
Warning: --zookeeper is deprecated, and will be removed in a future version of Kafka.
Current partition replica assignment

{"version":1,"partitions":[{"topic":"swh.journal.objects.extid","partition":0,"replicas":[3],"log_dirs":["any"]}]}

Save this to use as the --reassignment-json-file option during rollback
Successfully started partition reassignment for swh.journal.objects.extid-0

kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 1	ReplicationFactor: 2	Configs: max.message.bytes=104857600
  1. Increase the number of partition and change the cleanup policy
kafka1 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 256
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects.extid.
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!

kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"                   
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600

Production backfill in progress:

root@getty:~/T3485# ./backfill.sh | tee output.log
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f80001

The backfill process was interrupted by a restart of kafka on kafka1 (!).

2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 12 identical error(s) suppressed)"}
2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 5 identical error(s) suppressed)"}
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update
...
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) update
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush
    raise self.delivery_error("Failed deliveries after flush()")
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)])
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
...
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)])
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush
    "flush() exceeded timeout (%ss)" % self.flush_timeout,
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])

The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)

The back fill was relaunched using the script pasted in P1124

In ~40h, the backfill is done at ~5% for staging and less than 1% for the production

The backfill was stopped as the performance was (much) lower than expected (worked around with D6127).

To allow cleaning up the gigantic partition, and considering that we're going to backfill the whole topic, we can temporarily set a time-based retention policy on the topic with the retention.ms setting (to, e.g., one day = 86400000, and adding delete to cleanup.policy.

olasd@journal0:~$ /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --add-config 'cleanup.policy=[compact,delete],retention.ms=86400000' --entity-type=topics --entity-name swh.journal.objects.extid
  • the retention policy was restore to compact on staging:
vsellier@journal0 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid
% /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.
% /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600,min.cleanable.dirty.ratio=0.01
  • the backfill was restarted with 16 clients and with the new backfill query.

According to the import rate in the 5 first minutes, the performances are better as the backfill should be done in ~3 days

  • on production:
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=[compact,delete],retention.ms=86400000' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.

In the kafka logs:

...
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Found deletable segments with base offsets [0] due to retention time 86400000ms breach (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Scheduling segments for deletion LogSegment(baseOffset=0, size=2720767, lastModifiedTime=1629815520833, largestTime=1629815520702) (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Incremented log start offset to 20623 due to segment deletion (kafka.log.Log)
....
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid                                                
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'retention.ms' --entity-type=topics --entity-name swh.journal.objects.extid 
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"                                          
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600
  • backfill restarted with 16 worker
  • after a couple of minutes the ETA is ~10 days

It was really faster than expected in staging. The backfilling is already done:

real    71m34.776s
user    154m33.794s
sys     38m51.614s
vsellier@journal0 ~ % /opt/kafka/bin/kafka-console-consumer.sh  --bootstrap-server $SERVER --topic swh.journal.objects.extid --group vse-test6 --from-beginning > /dev/null                              
^CProcessed a total of 11422638 messages
swh=> select count(*) from extid;
  count   
----------
 11421463
(1 row)

The difference is because I stopped the consumer before doing the sql query

vsellier moved this task from in-progress to done on the System administration board.

The backfill is also done for the production.
It tooks less than 4h30

...
2021-08-25T19:25:25 INFO     swh.storage.backfill Processing extid range 700000 to 700001

real    261m39.720s
user    222m4.947s
sys     79m2.717s

There were no errors on the logs:

root@getty:~/T3485/logs# ls -al
total 92424
drwxr-xr-x 2 root root    4096 Aug 26 09:54 .
drwxr-xr-x 5 root root    4096 Aug 25 15:03 ..
-rw-r--r-- 1 root root 5907432 Aug 25 19:23 0.log.gz
-rw-r--r-- 1 root root 5907273 Aug 25 19:24 1.log.gz
-rw-r--r-- 1 root root 5907242 Aug 25 19:24 10.log.gz
-rw-r--r-- 1 root root 5907442 Aug 25 19:24 11.log.gz
-rw-r--r-- 1 root root 5907561 Aug 25 19:24 12.log.gz
-rw-r--r-- 1 root root 5907345 Aug 25 19:24 13.log.gz
-rw-r--r-- 1 root root 5907293 Aug 25 19:23 14.log.gz
-rw-r--r-- 1 root root 5907788 Aug 25 19:24 15.log.gz
-rw-r--r-- 1 root root 5907620 Aug 25 19:23 2.log.gz
-rw-r--r-- 1 root root 5907589 Aug 25 19:23 3.log.gz
-rw-r--r-- 1 root root 5907551 Aug 25 19:24 4.log.gz
-rw-r--r-- 1 root root 5907123 Aug 25 19:23 5.log.gz
-rw-r--r-- 1 root root 5908002 Aug 25 19:25 6.log.gz
-rw-r--r-- 1 root root 5907235 Aug 25 19:24 7.log.gz
-rw-r--r-- 1 root root 5907348 Aug 25 19:22 8.log.gz
-rw-r--r-- 1 root root 5907488 Aug 25 19:24 9.log.gz
root@getty:~/T3485/logs# zcat *.log.gz  | grep -v Processing
root@getty:~/T3485/logs#