Feed Advanced Search

Advanced Search
Use Results
Edit Query
Hide Query

	Include stories about projects I am a member of.

Aug 13 2021

vsellier added a comment to T3373: Metadata search is failing due to a boolean field in the mapping of the metadata fields.

there are no more errors. The fix will deployed in production with the deployment of swh-search:v0.11.0 (T3433)

Aug 13 2021, 10:28 AM · System administration, Archive search

vsellier renamed T3043: journalbeat:/filebeat Add an environment field on the logs from journalbeat: Add an environment field on the logs to journalbeat:/filebeat Add an environment field on the logs.

Aug 13 2021, 10:09 AM · System administration

Aug 12 2021

vsellier committed rSPSITEb972034f8ceb: prometheus/pve_exporter: Split the metrics_path and the parameters (authored by vsellier).

prometheus/pve_exporter: Split the metrics_path and the parameters

Aug 12 2021, 4:01 PM

vsellier closed D6082: prometheus: Support http parameters in exporter configuration.

Aug 12 2021, 4:01 PM

vsellier committed rSPSITE9fe99c19b47a: prometheus: Support http parameters in exporter configuration (authored by vsellier).

prometheus: Support http parameters in exporter configuration

Aug 12 2021, 4:01 PM

vsellier updated the diff for D6082: prometheus: Support http parameters in exporter configuration.

Remove unused import of Set

Aug 12 2021, 3:55 PM

vsellier updated the summary of D6082: prometheus: Support http parameters in exporter configuration.

Aug 12 2021, 3:51 PM

vsellier updated the summary of D6082: prometheus: Support http parameters in exporter configuration.

Aug 12 2021, 3:50 PM

vsellier requested review of D6082: prometheus: Support http parameters in exporter configuration.

Aug 12 2021, 3:50 PM

vsellier added a revision to T3462: Add proxmox / ceph monitoring: D6082: prometheus: Support http parameters in exporter configuration.

Aug 12 2021, 3:50 PM · System administration

vsellier committed rSPSITE4b7ac2269de4: pve-exporter: fix the prometheus scrapping url (authored by vsellier).

pve-exporter: fix the prometheus scrapping url

Aug 12 2021, 10:19 AM

vsellier accepted D6078: pve-exporter: Install properly configuration and service.

Thanks. looks good, some minor formatting suggests inline

Aug 12 2021, 9:51 AM

Aug 11 2021

vsellier accepted D6077: pve-exporter: Install prometheus-pve-exporter on hypervisor nodes.

LGTM

Aug 11 2021, 6:17 PM

vsellier committed rPPPE570833188f29: Fix buster build (authored by vsellier).

Fix buster build

Aug 11 2021, 5:22 PM

vsellier committed rPPPE892866b083e4: pristine-tar data for prometheus-pve-exporter_2.1.2.orig.tar.gz (authored by vsellier).

pristine-tar data for prometheus-pve-exporter_2.1.2.orig.tar.gz

Aug 11 2021, 5:16 PM

vsellier committed rPPPE381b497c01f5: Configure buster build (authored by vsellier).

Configure buster build

Aug 11 2021, 5:16 PM

vsellier committed rPPPE211ce4bbbd47: Initial packaging for prometheus-pve-exporter (authored by vsellier).

Initial packaging for prometheus-pve-exporter

Aug 11 2021, 5:16 PM

vsellier committed rPPPEe46b8aefe236: New upstream version 2.1.2 (authored by vsellier).

New upstream version 2.1.2

Aug 11 2021, 5:16 PM

vsellier accepted D6076: Declare debian package build for proxmox-pve-exporter.

LGTM

Aug 11 2021, 5:09 PM

vsellier accepted D6075: Activate data scraping on hypervisors.

LGTM

Aug 11 2021, 12:31 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The complete import is running almost continuously with 5 cassandra nodes since monday.

Aug 11 2021, 10:21 AM · System administration, Storage manager

Aug 10 2021

vsellier added a comment to T3462: Add proxmox / ceph monitoring.

A prometheus exporter for proxmox is available at https://github.com/prometheus-pve/prometheus-pve-exporter
An interesting reading: https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/

Aug 10 2021, 5:26 PM · System administration

vsellier changed the status of T3462: Add proxmox / ceph monitoring, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.

Aug 10 2021, 5:16 PM · System administration

vsellier changed the status of T3462: Add proxmox / ceph monitoring from Open to Work in Progress.

Aug 10 2021, 5:16 PM · System administration

vsellier closed T3474: Disable swap on workers as Resolved.

Aug 10 2021, 5:15 PM · System administration

vsellier closed T3474: Disable swap on workers, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.

Aug 10 2021, 5:15 PM · System administration

vsellier renamed T3476: One of the system disks of beaubourg is out of order from One of the system disk of beaubourg is out of order to One of the system disks of beaubourg is out of order.

Aug 10 2021, 4:32 PM · System administration

vsellier accepted D6074: loader_git: Decrease concurrency to tentatively decrease oom kill events.

LGTM

Aug 10 2021, 12:45 PM

vsellier triaged T3476: One of the system disks of beaubourg is out of order as High priority.

Aug 10 2021, 12:28 PM · System administration

vsellier added a comment to T3474: Disable swap on workers.

as expected, there is an increase of the number of oom killers on the workers [1]:

Aug 10 2021, 12:19 PM · System administration

vsellier added a comment to T3457: Some git repositories are failing to be ingested because of MemoryError.

Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:

Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git'
Aug 10 08:54:17 worker05 python3[871]: [62B blob data]
Aug 10 08:54:17 worker05 python3[871]: [586B blob data]
Aug 10 08:54:17 worker05 python3[871]: [473B blob data]
Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401                                         
Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs
Aug 10 08:59:21 worker05 kernel: [    871]  1004   871   247194   161634  1826816    46260             0 python3                              
Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped.            
Aug 10 09:15:29 worker05 kernel: [    871]  1004   871   412057   372785  3145728        0             0 python3                              
Aug 10 09:16:57 worker05 kernel: [    871]  1004   871   823648   784496  6443008        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1                   
Aug 10 09:24:44 worker05 kernel: [    871]  1004   871  2800000  2760713 22286336        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004           
Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0
Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Aug 10 2021, 11:32 AM · Git loader

vsellier changed the status of T3474: Disable swap on workers from Open to Work in Progress.

Aug 10 2021, 9:54 AM · System administration

Aug 9 2021

vsellier updated the task description for T3461: Prepare a quote for bare metal servers for the firewalls.

Aug 9 2021, 10:45 AM · System administration

Aug 6 2021

vsellier closed T2912: Next generation archive counters as Resolved.

The cleanup of the old counters is done so it can be closed

Aug 6 2021, 6:32 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier closed T3417: Cleanup the old counters environment as Resolved.

Aug 6 2021, 6:31 PM · System administration, Monitoring

vsellier closed T3417: Cleanup the old counters environment, a subtask of T2912: Next generation archive counters, as Resolved.

Aug 6 2021, 6:31 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier added a comment to T3417: Cleanup the old counters environment.

D6064 landed
manual cleanup:
- the apache vhost was removed by puppet
- /var/www/stats.export.softwareheritage.org directory removed
- the crontab was removed by puppet
- /usr/local/bin/export_archive_counters.py file removed
- /usr/local/share/swh-data directory removed
the refresh of the database counter is now scheduled each monday at 6:29 AM

postgres@belvedere:~$ crontab -l | grep counter
29 6  *   *  mon     /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"

Aug 6 2021, 6:31 PM · System administration, Monitoring

vsellier closed D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 6 2021, 6:11 PM

vsellier committed rSPSITE278cfe53d5fe: Clean counter statistic scripts, data and vhosts on pergamon (authored by vsellier).

Clean counter statistic scripts, data and vhosts on pergamon

Aug 6 2021, 6:11 PM

vsellier updated the diff for D6064: Clean counter statistic scripts, data and vhosts on pergamon.

rebase

Aug 6 2021, 6:08 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Aug 6 2021, 5:33 PM · System administration, Storage manager

vsellier committed rDSNIP326a9e39812a: grid5000/cassandra: save the temporary scylla configuration (authored by vsellier).

grid5000/cassandra: save the temporary scylla configuration

Aug 6 2021, 5:18 PM

vsellier committed rDSNIP78b96c8f1a2b: grid5000/cassandra: add dashboards dedicated to scylla (authored by vsellier).

grid5000/cassandra: add dashboards dedicated to scylla

Aug 6 2021, 4:35 PM

vsellier committed rDSNIP52b4eea9e3ad: grid5000/cassandra: add a dashboard to monitor the concurrent client connections (authored by vsellier).

grid5000/cassandra: add a dashboard to monitor the concurrent client connections

Aug 6 2021, 4:35 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.

Aug 6 2021, 3:17 PM · System administration, Storage manager

vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

Aug 6 2021, 3:06 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

Aug 6 2021, 2:53 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1

Aug 6 2021, 1:09 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

scylladb test

Aug 6 2021, 11:56 AM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run7 results - cassandra heap from 16g to 32g

Aug 6 2021, 10:49 AM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run6 results - commitlog on a HDD

Aug 6 2021, 10:28 AM · System administration, Storage manager

Aug 5 2021

vsellier added a comment to T3417: Cleanup the old counters environment.

Pergamon manual cleanup after D6064 is apply:

Remove /var/www/stats.export.softwareheritage.org directory
Remove apache vhosts:
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_ssl.conf
check crontab removal: export_archive_counters
remove '/usr/local/bin/export_archive_counters.py'
remove '/usr/local/share/swh-data' directory

Aug 5 2021, 7:43 PM · System administration, Monitoring

vsellier requested review of D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 5 2021, 7:33 PM

vsellier added a revision to T3417: Cleanup the old counters environment: D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 5 2021, 7:33 PM · System administration, Monitoring

vsellier closed T3460: Restore access to the gui of the passive firewall as Resolved.

Aug 5 2021, 5:49 PM · System administration

vsellier closed T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.

Aug 5 2021, 5:49 PM · System administration

vsellier closed D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:48 PM

vsellier committed rDDOCdde95608d439: infrastructure: document how to access the second firewall through the VPN (authored by vsellier).

infrastructure: document how to access the second firewall through the VPN

Aug 5 2021, 5:48 PM

vsellier changed the status of T3417: Cleanup the old counters environment, a subtask of T2912: Next generation archive counters, from Open to Work in Progress.

Aug 5 2021, 5:37 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier changed the status of T3417: Cleanup the old counters environment from Open to Work in Progress.

Aug 5 2021, 5:37 PM · System administration, Monitoring

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

louvre configuration reverted
pushking configuration reverted
access documented: D6063

Aug 5 2021, 5:29 PM · System administration

vsellier requested review of D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:28 PM

vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:28 PM · System administration

vsellier added a reverting change for D6060: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".

Aug 5 2021, 5:05 PM

vsellier added a reverting change for rSPSITE76e4f3fba539: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".

Aug 5 2021, 5:05 PM

vsellier committed rSPSITE10abbbd71e19: Revert "Change pushkin ip address" (authored by vsellier).

Revert "Change pushkin ip address"

Aug 5 2021, 5:05 PM

vsellier accepted D6062: staging: Deploy opam loader service.

LGTM

Aug 5 2021, 5:02 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

well I have misinterpreted the routing issue.
It seems it's only because the openvpn service is also started on pushkin (it's normal to be ready in case of a primary/secondary switch).
The route for the openvpn traffic is also declared on the secondary so a packet from the VPN is falling in a black hole:

Aug 5 2021, 4:44 PM · System administration

vsellier accepted D6061: staging: Deploy opam lister.

LGTM

Aug 5 2021, 4:31 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

the pushkin's ip was changed and the new ip was declared on puppet.
It seems the firewall is still not reachable with the new ip.
I'm trying to diagnose the problem

Aug 5 2021, 4:17 PM · System administration

vsellier closed D6060: Change pushkin ip address.

Aug 5 2021, 3:45 PM

vsellier committed rSPSITE76e4f3fba539: Change pushkin ip address (authored by vsellier).

Change pushkin ip address

Aug 5 2021, 3:45 PM

vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6060: Change pushkin ip address.

Aug 5 2021, 3:32 PM · System administration

vsellier requested review of D6060: Change pushkin ip address.

Aug 5 2021, 3:32 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

network configuration manually changed on louvre:

root@louvre:~# diff -u3 /tmp/interfaces /etc/network/interfaces
--- /tmp/interfaces	2021-08-05 12:44:20.213896058 +0000
+++ /etc/network/interfaces	2021-08-05 12:37:29.480805493 +0000
@@ -5,7 +5,7 @@

Aug 5 2021, 2:45 PM · System administration

vsellier changed the status of T3460: Restore access to the gui of the passive firewall from Open to Work in Progress.

Aug 5 2021, 2:34 PM · System administration

vsellier changed the status of T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.

Aug 5 2021, 2:34 PM · System administration

vsellier triaged T3465: Test multidatacenter replication as Normal priority.

Aug 5 2021, 12:31 PM · System administration, Storage manager

vsellier triaged T3464: Prepare a quote for the cassandra servers as Normal priority.

Aug 5 2021, 12:20 PM · System administration, Storage manager

vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.

Aug 5 2021, 12:18 PM · System administration, Storage manager

vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.

Aug 5 2021, 12:17 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Some news about the tests running since the beginning of the week:

The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota

Aug 5 2021, 12:16 PM · System administration, Storage manager

Aug 4 2021

vsellier triaged T3463: Ingest proxmox and ceph logs in elk as High priority.

Aug 4 2021, 6:44 PM · System administration

vsellier triaged T3462: Add proxmox / ceph monitoring as High priority.

Aug 4 2021, 6:42 PM · System administration

vsellier added a subtask for T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem: T3461: Prepare a quote for bare metal servers for the firewalls.

Aug 4 2021, 6:39 PM · System administration

vsellier added a parent task for T3461: Prepare a quote for bare metal servers for the firewalls: T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

Aug 4 2021, 6:39 PM · System administration

vsellier triaged T3461: Prepare a quote for bare metal servers for the firewalls as Normal priority.

Aug 4 2021, 6:39 PM · System administration

vsellier triaged T3460: Restore access to the gui of the passive firewall as Normal priority.

Aug 4 2021, 6:36 PM · System administration

vsellier added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.

Aug 4 2021, 4:19 PM · System administration

vsellier edited P1099 First crash logs.

Aug 4 2021, 12:28 PM

Aug 3 2021

vsellier accepted D6055: Install save code now update routine as a service called every minute.

LGTM

Aug 3 2021, 6:01 PM

vsellier accepted D6052: Install update-metrics as a service called daily.

It seems the systemd module upgrade add an internal change :

*******************************************
  Systemd::Service_limits[rabbitmq-server.service] =>
   parameters =>
     selinux_ignore_defaults =>
      + false

but as we have checked together, it seems it has no impacts on the service configuration / content of the dropin files

Aug 3 2021, 4:46 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

It looks like it solves the concurrency issue and will allow to keep the logs

Aug 3 2021, 2:53 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

I have just saw the paste so I have the response to my first question ;)

Aug 3 2021, 2:40 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

If there any log in output? Is-it possible to monitor the duration of the command and most importantly avoiding it to run several times in parallel?

Aug 3 2021, 2:38 PM

vsellier committed rDSNIPa96df2f91e47: grid5000/cassandra: add a script to cleanup the zfs pools (authored by vsellier).