It seems since the 8th of december, there were no requests > 1s in the builds
I will monitor it during the current week, if it not occurs again, I will change the status to resolved
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 14 2021
Dec 13 2021
upgrade done following the T3799 procedure.
After several tests in vagrant, the upgrade looks ok, even if I couldn't succeed to have a complete local dns environment.
LGTM thanks
If not defined, this variable is set by the elasticsearch launch script https://github.com/elastic/elasticsearch/pull/80699/files#diff-ddfc3a6ea1404997e56f2e771adede06b173f0fea37b4779d827c85d6cc52897R35
I guess as the fixture is not starting elasticsearch[1] throught the startup script, the variable is not defined
olasd: I transfer you the ownership of this task as you manage the subject. Feel free to close the task if the installation can be considered as done.
Dec 10 2021
All the hypervisors are migrated and the services restored
root@pergamon:/usr/local/sbin# ./swh-puppet-master-decommission louvre.internal.softwareheritage.org + puppet node deactivate louvre.internal.softwareheritage.org Submitted 'deactivate node' for louvre.internal.softwareheritage.org with UUID edca37d0-0976-4598-aadd-aef13a033a34 + puppet node clean louvre.internal.softwareheritage.org Notice: Revoked certificate with serial 156 Notice: Removing file Puppet::SSL::Certificate louvre.internal.softwareheritage.org at '/var/lib/puppet/ssl/ca/signed/louvre.internal.softwareheritage.org.pem' louvre.internal.softwareheritage.org + puppet cert clean louvre.internal.softwareheritage.org Warning: `puppet cert` is deprecated and will be removed in a future release. (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run') Notice: Revoked certificate with serial 156 + systemctl restart apache2
- vm 108 removed
The ceph packages need to be also updated on the proxmox nodes even if they are not in the ceph cluster (from the output of pve6to7)
Dec 9 2021
it's good for me to close it.
No requests took more than 1s during the last build this night.
I will continue to monitor the builds and try to diagnose the problem more accurately
a couple of remarks. sorry in advance if it's just because it's a bootstrap and everything is not yet finalized
Output of the pve6to7 script on uffizi:
Preconditions checklist from the proxmox upgrade guide:
- Upgraded to the latest version of Proxmox VE 6.4 (check correct package repository configuration)
On all nodes:
root@pergamon:/etc/clustershell# clush -b -w @hypervisors "pveversion" --------------- branly,pompidou,uffizi (3) --------------- pve-manager/6.4-13/9f411e79 (running kernel: 5.4.103-1-pve) --------------- beaubourg --------------- pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve) --------------- hypervisor3 --------------- pve-manager/6.4-13/9f411e79 (running kernel: 5.4.128-1-pve)
- TODO Hyper-converged Ceph: upgrade the Ceph Nautilus cluster to Ceph 15.2 Octopus before you start the Proxmox VE upgrade to 7.0. Follow the guide Ceph Nautilus to Octopus
- No backup server Co-installed Proxmox Backup Server: see the Proxmox Backup Server 1.1 to 2.x upgrade how-to
- Reliable access to the node (through ssh, iKVM/IPMI or physical access)
- A healthy cluster
- Valid and tested backup of all VMs and CTs (in case something goes wrong) At least 4 GiB free disk space on the root mount point.
- Check known upgrade issues
- from later on the doc Test the pve6to7 migration checklist
Dec 8 2021
WDYT to put this value in production/common.yaml to also align webapp1 with this value ?
The lister was fixed with the deployment of the swh-scheduler v0.22.0.
deployment of version v0.22.0 in production
Deployment of the version v0.22.0 in staging
The timeout occurs after 1s on the swh-web side on a directory/ls call.
04:11:10 nginx_1 | 172.23.0.1 - - [08/Dec/2021:03:11:09 +0000] "GET /api/1/directory/877df54c7dda406e9ad56ca09f793799aedbb26b/ HTTP/1.1" 500 4996 "-" "curl/7.64.0" 1.013
Dec 7 2021
Thanks
More info here: https://www.jenkins.io/doc/book/managing/built-in-node-migration/
the last builds were successful and are not indicating any response time too long.
let's see tomorrow if the response times are slower at the usual build time.
the dependency is already declared on the docker-compose, it just ensure the container is started before launching swh-web-cron, not the service is well initialized and responding.
thanks @anlambert for the tips, it works as expected
it's related to this change because if it's let to DEBUG, the access log is logged 2 times.
LGTM thanks
LGTM
on saatchi:
root@saatchi:/etc/systemd/system# rm -v swh-scheduler-updater* ssh-ghtorrent.service removed 'swh-scheduler-updater-consumer-ghtorrent.service' removed 'swh-scheduler-updater-writer.service' removed 'swh-scheduler-updater-writer.timer' removed 'ssh-ghtorrent.service'
on scheduler0:
root@scheduler0:/etc/systemd/system# systemctl stop ssh-ghtorrent root@scheduler0:/etc/systemd/system# systemctl disable ssh-ghtorrent root@scheduler0:/etc/systemd/system# systemctl stop swh-scheduler-updater* root@scheduler0:/etc/systemd/system# systemctl disable swh-scheduler-updater* root@scheduler0:/etc/systemd/system# rm -v swh-scheduler-updater* removed 'swh-scheduler-updater-consumer-ghtorrent.service' removed 'swh-scheduler-updater-writer.service' removed 'swh-scheduler-updater-writer.timer' root@scheduler0:/etc/systemd/system# rm ssh-ghtorrent.service removed 'ssh-ghtorrent.service' root@scheduler0:/etc/systemd/system# systemctl reset-failed
Version v0.21.0 deployed in staging:
root@scheduler0:~# apt list --upgradable 2>/dev/null | grep swh | cut -f1 -d'/' | xargs -t apt install apt install python3-swh.core python3-swh.counters python3-swh.journal python3-swh.lister python3-swh.loader.core python3-swh.model python3-swh.objstorage python3-swh.scheduler python3-swh.storage ... root@scheduler0:~# systemctl reload gunicorn-swh-scheduler.service
The problem is reproduced in staging before the deployment
swhworker@worker1:~$ swh lister -C /etc/softwareheritage/lister.yml run -l npm Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.15.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/lister/cli.py", line 65, in run get_lister(lister, **config).run() File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 130, in run full_stats.origins += self.send_origins(origins) File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 234, in send_origins ret = self.scheduler.record_listed_origins(batch_origins) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_ return self.post(meth._endpoint_path, post_data) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post return self._decode_response(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 354, in _decode_response self.raise_for_status(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 344, in raise_for_status raise exception from None swh.core.api.RemoteException: <RemoteException 500 CardinalityViolation: ['ON CONFLICT DO UPDATE command cannot affect row a second time\nHINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.\n']>
Dec 6 2021
keep the dict comprehension version
The last run failed with this error:
Unstuck the task scheduling:
softwareheritage-scheduler=> begin; update task set next_run=now(), status='next_run_not_scheduled' where id=153874548; BEGIN UPDATE 1 softwareheritage-scheduler=*> commit; COMMIT