- Fix the database password resolution after the database name update
- Restore the profile::grafana::objects call to manage the orgs and database declarations It's not ideal as it introduces a dependency with the reverse proxy
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 7 2022
Jan 6 2022
thanks for the validation, I have some pending changes in progress and to reply to the olasd's remarks so I change the status to planned changes
- fix database name (not directly used by the configuration)
- fix the prometheus snippets configuration
The fix was released in the version v0.23.0 and deployed in staging and production.
Everything looks good.
Jan 5 2022
upgrade accordingly the olasd's feedback
Upgrade bullseye template to 11.2
Jan 4 2022
a dirty fix on the code to force the table sampling looks efficient
The initial index was a try to improve the query for the origins_without_last_update policy. Indeed, it's not used by for the other policies (never_visited_oldest_update_first / already_visited_order_by_lag) so it's not ok
Jan 3 2022
the first diagnostic is it seems relative to the origin selection query taking a long time to respond (usually > 30mn)
good catches, thanks
rebase
Dec 23 2021
The diffs are ready to be reviewed.
The migration will be performed at the beginning of January.
allow the database monitoring
install the auto-generated dashboards with puppet
add the grafana-piechart-panel plugin installation
rebase
Dec 22 2021
Thank, the build is now green \o/
Thanks
Thanks for creating the diff and submitting the issue on the frozen dict repo.
It seems the rancher network issue is fixed in version 2.6.3 which is quite a good news
swhworker@poc-rancher:~$ ./test-network.sh => Start network overlay test poc-rancher-sw0 can reach poc-rancher-sw0 poc-rancher-sw0 can reach poc-rancher-sw1 poc-rancher-sw1 can reach poc-rancher-sw0 poc-rancher-sw1 can reach poc-rancher-sw1 => End network overlay test
It seems the network issue is fixed in version 2.6.3 which is quite a good news
Closing due to inactivity. Feel free to reopen if needed.
Closing, I guess we can live with this ;)
Feel free to reopen if you disagree
Package upgraded in T3705.
Dec 21 2021
Closing this task as all the possible upgrade are done.
The delayed upgrades will be followed in dedicated task as it will be integradated in a more global task relative to the elastic infrastructure or the pergamon splitting task
root@kelvingrove:~# task=T3807 root@kelvingrove:/etc# puppet agent --disable "$task: dist-upgrade to bullseye" root@kelvingrove:/etc# sed -i -e 's/buster/bullseye/;s,bullseye/updates,bullseye-security,' /etc/apt/sources.list.d/*
The tests with vagrant are not showing any issue with puppet / keycloak after the upgrade so let's proceed to the upgrade.
In D6866#178393, @olasd wrote:This deserves an upstream bug on frozendict 2.1.2, if you've managed to track it down...
fix a typo
A memory alert is logged on the idrac
Correctable memory error logging disabled for a memory device at location DIMM_A9. Fri 17 Dec 2021 16:15:39
We will have to monitor in the future to check if this memory dimm has some weaknesses
on moma:
- puppet disabled
root@moma:/etc/softwareheritage/storage# puppet agent --disable 'T3801 upgrade database servers'
- storage configuration update to use belvedere database and service restarted
Dec 20 2021
It seems the problem is related to the new version 2.1.2 of the frozendict library released the 18h December.
Pinning the version to the previous 2.1.1 solved the problem
For the segfault, I suspect an issue due to the OS difference inside the docker container and the host (debian 10 / debian 11)
root@e35f7a024575:/home/jenkins/swh-environment/swh-indexer# gdb python3 core (gdb) where #0 raise (sig=11) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 <signal handler called> #2 0x00007f6548d70d46 in frozendict_new_barebone (type=0x7f6548d800e0 <PyFrozenDict_Type>) at /project/frozendict/src/3_7/frozendictobject.c:2214 #3 _frozendict_new (use_empty_frozendict=1, kwds=0x0, args=<optimized out>, type=0x7f6548d800e0 <PyFrozenDict_Type>) at /project/frozendict/src/3_7/frozendictobject.c:2255 #4 frozendict_new (type=0x7f6548d800e0 <PyFrozenDict_Type>, args=<optimized out>, kwds=0x0) at /project/frozendict/src/3_7/frozendictobject.c:2290 #5 0x00000000005d9bd7 in _PyObject_FastCallKeywords () #136 0x000000000065468e in _Py_UnixMain () #137 0x00007f654efe109b in __libc_start_main (main=0x4bc560 <main>, argc=9, argv=0x7ffe6f651488, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe6f651478) at ../csu/libc-start.c:308 #138 0x00000000005e0e8a in _start () (gdb)
I'm trying to reproduce the problem locally in a vm to check if a workaround can be foud.
Dec 17 2021
fact installed on the staging nodes:
root@pergamon:/etc/clustershell# clush -b -w @staging 'if [ -e /etc/systemd/system/cloud-init.target.wants/cloud-init.service ]; then echo "cloud-init installed"; echo cloudinit_enabled=true > /etc/facter/facts.d/cloud-init.txt; else echo "cloud-init not installed"; fi' --------------- counters0.internal.staging.swh.network,deposit.internal.staging.swh.network,objstorage0.internal.staging.swh.network,poc-rancher-sw[0-1].internal.staging.swh.network,poc-rancher.internal.staging.swh.network,rp0.internal.staging.swh.network,scheduler0.internal.staging.swh.network,search0.internal.staging.swh.network,vault.internal.staging.swh.network,webapp.internal.staging.swh.network,worker[0-3].internal.staging.swh.network (15) --------------- cloud-init installed --------------- db1.internal.staging.swh.network,storage1.internal.staging.swh.network (2) --------------- cloud-init not installed
rebase
During the week, only one request took more than 1s.
As it looks rare, it seems it's relative to the load on the server during the build, so I'm not sure it worst the case to investigate further.
workers:
Before the migration
root@pergamon:~# clush -b -w @staging-workers 'set -e; puppet agent --disable "T3812"; puppet agent --disable T3771; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop --no-block swh-worker@*; sleep 300; systemctl kill swh-worker@* -s 9'
fix a typo on the commit message
Testing with this config file:
#cloud-config-jsonp [{ "op": "replace", "path": "/manage_etc_hosts", "value": "False"}]
gives this error:
2021-12-16 22:35:11,471 - __init__.py[DEBUG]: Calling handler CloudConfigPartHandler: [['text/cloud-config', 'text/cloud-config-jsonp']] (text/cloud-config-jsonp, part-001, 3) with frequency always 2021-12-16 22:35:11,472 - cloud_config.py[DEBUG]: Merging by applying json patch [{"op": "replace", "path": "/manage_etc_hosts", "value": "False"}] 2021-12-16 22:35:11,472 - util.py[WARNING]: Failed at merging in cloud config part from part-001 2021-12-16 22:35:11,474 - util.py[DEBUG]: Failed at merging in cloud config part from part-001 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/handlers/cloud_config.py", line 138, in handle_part self._merge_patch(payload) File "/usr/lib/python3/dist-packages/cloudinit/handlers/cloud_config.py", line 113, in _merge_patch self.cloud_buf = patch.apply(self.cloud_buf, in_place=False) File "/usr/lib/python3/dist-packages/jsonpatch.py", line 312, in apply obj = operation.apply(obj) File "/usr/lib/python3/dist-packages/jsonpatch.py", line 483, in apply raise JsonPatchConflict(msg) jsonpatch.JsonPatchConflict: can't replace non-existent object 'manage_etc_hosts' 2021-12-16 22:35:11,475 - __init__.py[DEBUG]: Calling handler CloudConfigPartHandler: [['text/cloud-config', 'text/cloud-config-jsonp']] (__end__, None, 3) with frequency always
Dec 16 2021
it seems cloud init does not support overriding a property defined in the user-data configuration:
thanks