there are no more errors. The fix will deployed in production with the deployment of swh-search:v0.11.0 (T3433)
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 13 2021
Aug 12 2021
Remove unused import of Set
Thanks. looks good, some minor formatting suggests inline
Aug 11 2021
LGTM
LGTM
The complete import is running almost continuously with 5 cassandra nodes since monday.
Aug 10 2021
A prometheus exporter for proxmox is available at https://github.com/prometheus-pve/prometheus-pve-exporter
An interesting reading: https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/
LGTM
as expected, there is an increase of the number of oom killers on the workers [1]:
Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:
Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git' Aug 10 08:54:17 worker05 python3[871]: [62B blob data] Aug 10 08:54:17 worker05 python3[871]: [586B blob data] Aug 10 08:54:17 worker05 python3[871]: [473B blob data] Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401 Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs Aug 10 08:59:21 worker05 kernel: [ 871] 1004 871 247194 161634 1826816 46260 0 python3 Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped. Aug 10 09:15:29 worker05 kernel: [ 871] 1004 871 412057 372785 3145728 0 0 python3 Aug 10 09:16:57 worker05 kernel: [ 871] 1004 871 823648 784496 6443008 0 0 python3 Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1 Aug 10 09:24:44 worker05 kernel: [ 871] 1004 871 2800000 2760713 22286336 0 0 python3 Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004 Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0 Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 9 2021
Aug 6 2021
The cleanup of the old counters is done so it can be closed
- D6064 landed
- manual cleanup:
- the apache vhost was removed by puppet
- /var/www/stats.export.softwareheritage.org directory removed
- the crontab was removed by puppet
- /usr/local/bin/export_archive_counters.py file removed
- /usr/local/share/swh-data directory removed
- the refresh of the database counter is now scheduled each monday at 6:29 AM
postgres@belvedere:~$ crontab -l | grep counter 29 6 * * mon /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"
rebase
The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()
root@parasilo-2:/opt# apt install scylla-node-exporter Reading package lists... Done Building dependency tree Reading state information... Done The following packages were automatically installed and are no longer required: libio-pty-perl libipc-run-perl moreutils Use 'apt autoremove' to remove them. The following packages will be REMOVED: prometheus-node-exporter The following NEW packages will be installed: scylla-node-exporter 0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded. Need to get 0 B/4,076 kB of archives. After this operation, 3,243 kB of additional disk space will be used.
there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]: [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]: [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one
vsellier@parasilo-2:~$ nodetool status Using /etc/scylla/scylla.yaml as the config file Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 172.16.97.2 2.36 MB 256 ? 866bbcc4-d496-4ebb-ab3b-12ef4942beaa rack1 UN 172.16.97.3 3.37 MB 256 ? 21fdd0a9-15cd-473f-814c-c8ac24870aca rack1 UN 172.16.97.4 3.48 MB 256 ? 1ed61715-01a0-4c15-a4bc-f9972f575437 rack1
scylladb test
run7 results - cassandra heap from 16g to 32g
run6 results - commitlog on a HDD
Aug 5 2021
Pergamon manual cleanup after D6064 is apply:
- Remove /var/www/stats.export.softwareheritage.org directory
- Remove apache vhosts:
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_ssl.conf
- check crontab removal: export_archive_counters
- remove '/usr/local/bin/export_archive_counters.py'
- remove '/usr/local/share/swh-data' directory
- louvre configuration reverted
- pushking configuration reverted
- access documented: D6063
well I have misinterpreted the routing issue.
It seems it's only because the openvpn service is also started on pushkin (it's normal to be ready in case of a primary/secondary switch).
The route for the openvpn traffic is also declared on the secondary so a packet from the VPN is falling in a black hole:
the pushkin's ip was changed and the new ip was declared on puppet.
It seems the firewall is still not reachable with the new ip.
I'm trying to diagnose the problem
- network configuration manually changed on louvre:
root@louvre:~# diff -u3 /tmp/interfaces /etc/network/interfaces --- /tmp/interfaces 2021-08-05 12:44:20.213896058 +0000 +++ /etc/network/interfaces 2021-08-05 12:37:29.480805493 +0000 @@ -5,7 +5,7 @@
Some news about the tests running since the beginning of the week:
- The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
- The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
- With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota
Aug 4 2021
we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.
Aug 3 2021
LGTM
It seems the systemd module upgrade add an internal change :
******************************************* Systemd::Service_limits[rabbitmq-server.service] => parameters => selinux_ignore_defaults => + false
but as we have checked together, it seems it has no impacts on the service configuration / content of the dropin files
It looks like it solves the concurrency issue and will allow to keep the logs
I have just saw the paste so I have the response to my first question ;)
If there any log in output? Is-it possible to monitor the duration of the command and most importantly avoiding it to run several times in parallel?