In ~40h, the backfill is done at ~5% for staging and less than 1% for the production
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 19 2021
Aug 18 2021
The back fill was relaunched using the script pasted in P1124
The backfill process was interrupted by a restart of kafka on kafka1 (!).
2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio n refused (after 0ms in state CONNECT, 12 identical error(s) suppressed) 2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)"} 2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio n refused (after 0ms in state CONNECT, 5 identical error(s) suppressed) 2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)"} 2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update ... 2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) update Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush raise self.delivery_error("Failed deliveries after flush()") swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)]) 2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128 ... swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)]) Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush "flush() exceeded timeout (%ss)" % self.flush_timeout, swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])
The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)
Aug 17 2021
One very important thing to get right is the Build-Depends line in the source package stanza. setuptools/distribute-based packages have the nasty habit of downloading dependencies from PyPI if they are needed at python setup.py build time. If the package is available from the system (as would be the case when Build-Depends > is up-to-date), then distribute will not try to download the package, otherwise it will try to download it. This is a huge no-no, and pybuild internally sets the http_proxy and https_proxy environment variables (to 127.0.0.1:9) to prevent this from happening.
The pypi build is still working well with the 2 last diff.
Now there is a new error during the debian ones:
dh: warning: Compatibility levels before 10 are deprecated (level 9 in use) dh_auto_clean -O--buildsystem=pybuild dh_auto_clean: warning: Compatibility levels before 10 are deprecated (level 9 in use) I: pybuild base:232: python3.9 setup.py clean WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101bcd0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101beb0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b850>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b730>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b610>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/ ERROR: Could not find a version that satisfies the requirement tree-sitter==0.19.0 ERROR: No matching distribution found for tree-sitter==0.19.0 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/setuptools/installer.py", line 75, in fetch_build_egg subprocess.check_call(cmd) File "/usr/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpdrbws3hq', '--quiet', 'tree-sitter==0.19.0']' returned non-zero exit status 1.
current status:
Production backfill in progress:
root@getty:~/T3485# ./backfill.sh | tee output.log Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 000000 --end-object 080000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 080001 --end-object 100000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 100001 --end-object 180000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 180001 --end-object 200000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 200001 --end-object 280000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 280001 --end-object 300000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 300001 --end-object 380000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 380001 --end-object 400000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 400001 --end-object 480000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 480001 --end-object 500000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 500001 --end-object 580000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 580001 --end-object 600000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 600001 --end-object 680000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 680001 --end-object 700000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 700001 --end-object 780000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 780001 --end-object 800000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 800001 --end-object 880000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 880001 --end-object 900000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 900001 --end-object 980000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object 980001 --end-object a00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object a00001 --end-object a80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object a80001 --end-object b00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object b00001 --end-object b80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object b80001 --end-object c00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object c00001 --end-object c80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object c80001 --end-object d00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object d00001 --end-object d80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object d80001 --end-object e00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object e00001 --end-object e80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object e80001 --end-object f00000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object f00001 --end-object f80000 Starting swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill extid --start-object f80001
Production
Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform
The backfill is running in staging (launch with the P1121 and P1122 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):
swhstorage@storage1:~$ ./backfill.sh | tee output.log
swhstorage@storage1:~$ grep Starting output.log Starting swh storage backfill extid --start-object 000000 --end-object 080000 Starting swh storage backfill extid --start-object 080001 --end-object 100000 Starting swh storage backfill extid --start-object 100001 --end-object 180000 Starting swh storage backfill extid --start-object 180001 --end-object 200000 Starting swh storage backfill extid --start-object 200001 --end-object 280000 Starting swh storage backfill extid --start-object 280001 --end-object 300000 Starting swh storage backfill extid --start-object 300001 --end-object 380000 Starting swh storage backfill extid --start-object 380001 --end-object 400000 Starting swh storage backfill extid --start-object 400001 --end-object 480000 Starting swh storage backfill extid --start-object 480001 --end-object 500000 Starting swh storage backfill extid --start-object 500001 --end-object 580000 Starting swh storage backfill extid --start-object 580001 --end-object 600000 Starting swh storage backfill extid --start-object 600001 --end-object 680000 Starting swh storage backfill extid --start-object 680001 --end-object 700000 Starting swh storage backfill extid --start-object 700001 --end-object 780000 Starting swh storage backfill extid --start-object 780001 --end-object 800000 Starting swh storage backfill extid --start-object 800001 --end-object 880000 Starting swh storage backfill extid --start-object 880001 --end-object 900000 Starting swh storage backfill extid --start-object 900001 --end-object 980000 Starting swh storage backfill extid --start-object 980001 --end-object a00000 Starting swh storage backfill extid --start-object a00001 --end-object a80000 Starting swh storage backfill extid --start-object a80001 --end-object b00000 Starting swh storage backfill extid --start-object b00001 --end-object b80000 Starting swh storage backfill extid --start-object b80001 --end-object c00000 Starting swh storage backfill extid --start-object c00001 --end-object c80000 Starting swh storage backfill extid --start-object c80001 --end-object d00000 Starting swh storage backfill extid --start-object d00001 --end-object d80000 Starting swh storage backfill extid --start-object d80001 --end-object e00000 Starting swh storage backfill extid --start-object e00001 --end-object e80000 Starting swh storage backfill extid --start-object e80001 --end-object f00000 Starting swh storage backfill extid --start-object f00001 --end-object f80000 Starting swh storage backfill extid --start-object f80001
to use with P1122
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64 WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionality Updated config for topic swh.journal.objects.extid. WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected Adding partitions succeeded!
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep ReplicationFactor Topic: swh.journal.objects.extid PartitionCount: 64 ReplicationFactor: 1 Configs: cleanup.policy=compact,max.message.bytes=104857600
Aug 16 2021
we have tested with @vlorentz , it's ok if the yarn's build target is updated to not call the build-so and build-wasm targets and if the tree-sitter module is kept in the docker image.
Ceph status go back to OK with these actions:
- Cleanup the crash history
- to check status:
ceph crash ls cepg crash info <id>
- to cleanup
ceph crash archive-all
- Remove the mons are allowing insecure global_id reclaim error:
- apply this configuration: https://ceph.io/releases/v14-2-20-nautilus-released/
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false
Aug 13 2021
Not sure if I'm missing something or if something is missing on this diff (I can't find it's parent on my repo) but I have applied it and the build is still failing when the yarn command is launched which sound logical as the yarn config is still launching directly the tree-sitter command
fix version selection
Current import status before the run of this week-end:
there are no more errors. The fix will deployed in production with the deployment of swh-search:v0.11.0 (T3433)
Aug 12 2021
Remove unused import of Set
Thanks. looks good, some minor formatting suggests inline
Aug 11 2021
LGTM
LGTM
The complete import is running almost continuously with 5 cassandra nodes since monday.
Aug 10 2021
A prometheus exporter for proxmox is available at https://github.com/prometheus-pve/prometheus-pve-exporter
An interesting reading: https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/
LGTM
as expected, there is an increase of the number of oom killers on the workers [1]:
Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:
Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git' Aug 10 08:54:17 worker05 python3[871]: [62B blob data] Aug 10 08:54:17 worker05 python3[871]: [586B blob data] Aug 10 08:54:17 worker05 python3[871]: [473B blob data] Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401 Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs Aug 10 08:59:21 worker05 kernel: [ 871] 1004 871 247194 161634 1826816 46260 0 python3 Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped. Aug 10 09:15:29 worker05 kernel: [ 871] 1004 871 412057 372785 3145728 0 0 python3 Aug 10 09:16:57 worker05 kernel: [ 871] 1004 871 823648 784496 6443008 0 0 python3 Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1 Aug 10 09:24:44 worker05 kernel: [ 871] 1004 871 2800000 2760713 22286336 0 0 python3 Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004 Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0 Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 9 2021
Aug 6 2021
The cleanup of the old counters is done so it can be closed