ok roger that :).
I will increase to 524288 in the diff
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 25 2021
all the loaders are restarted on worker01 and workers02, it seems the cluster is ok.
The open file limit was manually increased to stabilize the cluster:
# puppet agent --disable T3501 # diff -U3 /tmp/kafka.service kafka.service --- /tmp/kafka.service 2021-08-25 07:32:28.068928972 +0000 +++ kafka.service 2021-08-25 07:32:31.384955246 +0000 @@ -15,7 +15,7 @@ Environment='LOG_DIR=/var/log/kafka' Type=simple ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties -LimitNOFILE=65536 +LimitNOFILE=131072
- Incident created on status.io
- Loader disabled:
root@pergamon:~# clush -b -w @swh-workers 'puppet agent --disable "Kafka incident T3501"; systemctl stop cron; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@loader_*; do systemctl disable $unit; done; systemctl stop "swh-worker@loader_*"'
Aug 24 2021
Some live data from a git loader with a batch size of 1000 for each object types (with D6118 applied):
"object type";"input count";"missing_id duration (s)";"_missing_id count","_add duration(s)" content;1000;0.4928;999;35.3384 content;1000;0.4095;1000;34.1440 content;1000;0.4374;998;35.6249 content;492;0.2960;488;16.7028 directory;1000;0.3978;999;71.2518 directory;1000;0.4484;1000;39.6845 directory;1000;0.4356;1000;54.0077 directory;1000;0.3833;1000;36.1437 directory;1000;0.4319;1000;30.5690 directory;402;0.1718;402;19.2335 revision;1000;0.8671;1000;10.3417 revision;575;0.4639;575;4.0819
The performance are ok now for the read part with a batch size of 1000 for content, directory and revision.
An alert was sent by email the 2021-05-22 at 05:30 AM so the monitoring has well detected the issue ;) :
This message was generated by the smartd daemon running on:
on hypervisor3 and branly
- A new lvm volume was created and mounted on /var/lib/vz (40G on hypervisor3 / 100G on branly)
- local storage type was activated on proxmox via the ui (Datacenter / storage / local, check enable)
- pushkin and glytotek disks moved via to ui to the local storage (<vm> / hardware click on the disk / move disk button / target storage 'local')
LGTM (double checked with @olasd ;) )
Aug 23 2021
It seems the problem is no longer present now (tested with several origins)
root@parasilo-19:~/swh-environment/docker# docker exec -ti docker_swh-loader_1 bash swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git' INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula {'status': 'uneventful'} swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git' INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula {'status': 'uneventful'} swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git' INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula {'status': 'uneventful'}
The origin_visit topic was replayed with your diff during the weekend. let's test now if the worker behavior is more deterministic
Aug 20 2021
Aug 19 2021
The gros cluster at Nancy[1] has a lot of nodes(124) with small reservable SSD of 960Go. This can be a good candidate to create the second cluster. It will also allow to check the performance with data (and commit logs) on SSDs.
According to the main cluster, a minimum of 8 nodes are necessary to handle the volume of data (7.3 To and growing). Starting with 10 nodes will allow to have some remaining space.
it seems some more precise information can be logged by activating the full query logs without a big performance impact: https://cassandra.apache.org/doc/latest/cassandra/new/fqllogging.html
Should be fixed by T3482
In ~40h, the backfill is done at ~5% for staging and less than 1% for the production
Aug 18 2021
The back fill was relaunched using the script pasted in P1124
The backfill process was interrupted by a restart of kafka on kafka1 (!).
2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio n refused (after 0ms in state CONNECT, 12 identical error(s) suppressed) 2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)"} 2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio n refused (after 0ms in state CONNECT, 5 identical error(s) suppressed) 2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)"} 2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update ... 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) update Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush raise self.delivery_error("Failed deliveries after flush()") swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)]) 2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128 ... swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)]) Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush "flush() exceeded timeout (%ss)" % self.flush_timeout, swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])
The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)
Aug 17 2021
One very important thing to get right is the Build-Depends line in the source package stanza. setuptools/distribute-based packages have the nasty habit of downloading dependencies from PyPI if they are needed at python setup.py build time. If the package is available from the system (as would be the case when Build-Depends > is up-to-date), then distribute will not try to download the package, otherwise it will try to download it. This is a huge no-no, and pybuild internally sets the http_proxy and https_proxy environment variables (to 127.0.0.1:9) to prevent this from happening.
The pypi build is still working well with the 2 last diff.
Now there is a new error during the debian ones:
dh: warning: Compatibility levels before 10 are deprecated (level 9 in use) dh_auto_clean -O--buildsystem=pybuild dh_auto_clean: warning: Compatibility levels before 10 are deprecated (level 9 in use) I: pybuild base:232: python3.9 setup.py clean WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101bcd0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101beb0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b850>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b730>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b610>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ ERROR: Could not find a version that satisfies the requirement tree-sitter==0.19.0 ERROR: No matching distribution found for tree-sitter==0.19.0 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/setuptools/installer.py", line 75, in fetch_build_egg subprocess.check_call(cmd) File "/usr/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpdrbws3hq', '--quiet', 'tree-sitter==0.19.0']' returned non-zero exit status 1.
current status:
Production backfill in progress:
root@getty:~/T3485# ./backfill.sh | tee output.log Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 000000 --end-object 080000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 080001 --end-object 100000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 100001 --end-object 180000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 180001 --end-object 200000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 200001 --end-object 280000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 280001 --end-object 300000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 300001 --end-object 380000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 380001 --end-object 400000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 400001 --end-object 480000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 480001 --end-object 500000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 500001 --end-object 580000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 580001 --end-object 600000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 600001 --end-object 680000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 680001 --end-object 700000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 700001 --end-object 780000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 780001 --end-object 800000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 800001 --end-object 880000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 880001 --end-object 900000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 900001 --end-object 980000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 980001 --end-object a00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object a00001 --end-object a80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object a80001 --end-object b00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object b00001 --end-object b80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object b80001 --end-object c00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object c00001 --end-object c80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object c80001 --end-object d00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object d00001 --end-object d80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object d80001 --end-object e00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object e00001 --end-object e80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object e80001 --end-object f00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object f00001 --end-object f80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object f80001
Production
Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform
The backfill is running in staging (launch with the P1121 and P1122 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):
swhstorage@storage1:~$ ./backfill.sh | tee output.log
swhstorage@storage1:~$ grep Starting output.log Starting swh storage backfill extid --start-object 000000 --end-object 080000 Starting swh storage backfill extid --start-object 080001 --end-object 100000 Starting swh storage backfill extid --start-object 100001 --end-object 180000 Starting swh storage backfill extid --start-object 180001 --end-object 200000 Starting swh storage backfill extid --start-object 200001 --end-object 280000 Starting swh storage backfill extid --start-object 280001 --end-object 300000 Starting swh storage backfill extid --start-object 300001 --end-object 380000 Starting swh storage backfill extid --start-object 380001 --end-object 400000 Starting swh storage backfill extid --start-object 400001 --end-object 480000 Starting swh storage backfill extid --start-object 480001 --end-object 500000 Starting swh storage backfill extid --start-object 500001 --end-object 580000 Starting swh storage backfill extid --start-object 580001 --end-object 600000 Starting swh storage backfill extid --start-object 600001 --end-object 680000 Starting swh storage backfill extid --start-object 680001 --end-object 700000 Starting swh storage backfill extid --start-object 700001 --end-object 780000 Starting swh storage backfill extid --start-object 780001 --end-object 800000 Starting swh storage backfill extid --start-object 800001 --end-object 880000 Starting swh storage backfill extid --start-object 880001 --end-object 900000 Starting swh storage backfill extid --start-object 900001 --end-object 980000 Starting swh storage backfill extid --start-object 980001 --end-object a00000 Starting swh storage backfill extid --start-object a00001 --end-object a80000 Starting swh storage backfill extid --start-object a80001 --end-object b00000 Starting swh storage backfill extid --start-object b00001 --end-object b80000 Starting swh storage backfill extid --start-object b80001 --end-object c00000 Starting swh storage backfill extid --start-object c00001 --end-object c80000 Starting swh storage backfill extid --start-object c80001 --end-object d00000 Starting swh storage backfill extid --start-object d00001 --end-object d80000 Starting swh storage backfill extid --start-object d80001 --end-object e00000 Starting swh storage backfill extid --start-object e00001 --end-object e80000 Starting swh storage backfill extid --start-object e80001 --end-object f00000 Starting swh storage backfill extid --start-object f00001 --end-object f80000 Starting swh storage backfill extid --start-object f80001
to use with P1122
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64 WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionality Updated config for topic swh.journal.objects.extid. WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected Adding partitions succeeded!
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep ReplicationFactor Topic: swh.journal.objects.extid PartitionCount: 64 ReplicationFactor: 1 Configs: cleanup.policy=compact,max.message.bytes=104857600
Aug 16 2021
we have tested with @vlorentz , it's ok if the yarn's build target is updated to not call the build-so and build-wasm targets and if the tree-sitter module is kept in the docker image.
Ceph status go back to OK with these actions:
- Cleanup the crash history
- to check status:
ceph crash ls cepg crash info <id>
- to cleanup
ceph crash archive-all
- Remove the mons are allowing insecure global_id reclaim error:
- apply this configuration: https://ceph.io/releases/v14-2-20-nautilus-released/
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false
Aug 13 2021
Not sure if I'm missing something or if something is missing on this diff (I can't find it's parent on my repo) but I have applied it and the build is still failing when the yarn command is launched which sound logical as the yarn config is still launching directly the tree-sitter command
fix version selection
Current import status before the run of this week-end: