Feed Advanced Search

Advanced Search
Use Results
Edit Query
Hide Query

	Include stories about projects I am a member of.

Aug 6 2021

vsellier added a comment to T3417: Cleanup the old counters environment.

D6064 landed
manual cleanup:
- the apache vhost was removed by puppet
- /var/www/stats.export.softwareheritage.org directory removed
- the crontab was removed by puppet
- /usr/local/bin/export_archive_counters.py file removed
- /usr/local/share/swh-data directory removed
the refresh of the database counter is now scheduled each monday at 6:29 AM

postgres@belvedere:~$ crontab -l | grep counter
29 6  *   *  mon     /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"

Aug 6 2021, 6:31 PM · System administration, Monitoring

vsellier closed D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 6 2021, 6:11 PM

vsellier committed rSPSITE278cfe53d5fe: Clean counter statistic scripts, data and vhosts on pergamon (authored by vsellier).

Clean counter statistic scripts, data and vhosts on pergamon

Aug 6 2021, 6:11 PM

vsellier updated the diff for D6064: Clean counter statistic scripts, data and vhosts on pergamon.

rebase

Aug 6 2021, 6:08 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Aug 6 2021, 5:33 PM · System administration, Storage manager

vsellier committed rDSNIP326a9e39812a: grid5000/cassandra: save the temporary scylla configuration (authored by vsellier).

grid5000/cassandra: save the temporary scylla configuration

Aug 6 2021, 5:18 PM

vsellier committed rDSNIP78b96c8f1a2b: grid5000/cassandra: add dashboards dedicated to scylla (authored by vsellier).

grid5000/cassandra: add dashboards dedicated to scylla

Aug 6 2021, 4:35 PM

vsellier committed rDSNIP52b4eea9e3ad: grid5000/cassandra: add a dashboard to monitor the concurrent client connections (authored by vsellier).

grid5000/cassandra: add a dashboard to monitor the concurrent client connections

Aug 6 2021, 4:35 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.

Aug 6 2021, 3:17 PM · System administration, Storage manager

vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

Aug 6 2021, 3:06 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

Aug 6 2021, 2:53 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1

Aug 6 2021, 1:09 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

scylladb test

Aug 6 2021, 11:56 AM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run7 results - cassandra heap from 16g to 32g

Aug 6 2021, 10:49 AM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run6 results - commitlog on a HDD

Aug 6 2021, 10:28 AM · System administration, Storage manager

Aug 5 2021

vsellier added a comment to T3417: Cleanup the old counters environment.

Pergamon manual cleanup after D6064 is apply:

Remove /var/www/stats.export.softwareheritage.org directory
Remove apache vhosts:
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_non-ssl.conf
- /etc/apache2/sites-available/25-stats.export.softwareheritage.org_ssl.conf
check crontab removal: export_archive_counters
remove '/usr/local/bin/export_archive_counters.py'
remove '/usr/local/share/swh-data' directory

Aug 5 2021, 7:43 PM · System administration, Monitoring

vsellier requested review of D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 5 2021, 7:33 PM

vsellier added a revision to T3417: Cleanup the old counters environment: D6064: Clean counter statistic scripts, data and vhosts on pergamon.

Aug 5 2021, 7:33 PM · System administration, Monitoring

vsellier closed T3460: Restore access to the gui of the passive firewall as Resolved.

Aug 5 2021, 5:49 PM · System administration

vsellier closed T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.

Aug 5 2021, 5:49 PM · System administration

vsellier closed D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:48 PM

vsellier committed rDDOCdde95608d439: infrastructure: document how to access the second firewall through the VPN (authored by vsellier).

infrastructure: document how to access the second firewall through the VPN

Aug 5 2021, 5:48 PM

vsellier changed the status of T3417: Cleanup the old counters environment, a subtask of T2912: Next generation archive counters, from Open to Work in Progress.

Aug 5 2021, 5:37 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier changed the status of T3417: Cleanup the old counters environment from Open to Work in Progress.

Aug 5 2021, 5:37 PM · System administration, Monitoring

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

louvre configuration reverted
pushking configuration reverted
access documented: D6063

Aug 5 2021, 5:29 PM · System administration

vsellier requested review of D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:28 PM

vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6063: infrastructure: document how to access the second firewall through the VPN.

Aug 5 2021, 5:28 PM · System administration

vsellier added a reverting change for D6060: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".

Aug 5 2021, 5:05 PM

vsellier added a reverting change for rSPSITE76e4f3fba539: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".

Aug 5 2021, 5:05 PM

vsellier committed rSPSITE10abbbd71e19: Revert "Change pushkin ip address" (authored by vsellier).

Revert "Change pushkin ip address"

Aug 5 2021, 5:05 PM

vsellier accepted D6062: staging: Deploy opam loader service.

LGTM

Aug 5 2021, 5:02 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

well I have misinterpreted the routing issue.
It seems it's only because the openvpn service is also started on pushkin (it's normal to be ready in case of a primary/secondary switch).
The route for the openvpn traffic is also declared on the secondary so a packet from the VPN is falling in a black hole:

Aug 5 2021, 4:44 PM · System administration

vsellier accepted D6061: staging: Deploy opam lister.

LGTM

Aug 5 2021, 4:31 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

the pushkin's ip was changed and the new ip was declared on puppet.
It seems the firewall is still not reachable with the new ip.
I'm trying to diagnose the problem

Aug 5 2021, 4:17 PM · System administration

vsellier closed D6060: Change pushkin ip address.

Aug 5 2021, 3:45 PM

vsellier committed rSPSITE76e4f3fba539: Change pushkin ip address (authored by vsellier).

Change pushkin ip address

Aug 5 2021, 3:45 PM

vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6060: Change pushkin ip address.

Aug 5 2021, 3:32 PM · System administration

vsellier requested review of D6060: Change pushkin ip address.

Aug 5 2021, 3:32 PM

vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

network configuration manually changed on louvre:

root@louvre:~# diff -u3 /tmp/interfaces /etc/network/interfaces
--- /tmp/interfaces	2021-08-05 12:44:20.213896058 +0000
+++ /etc/network/interfaces	2021-08-05 12:37:29.480805493 +0000
@@ -5,7 +5,7 @@

Aug 5 2021, 2:45 PM · System administration

vsellier changed the status of T3460: Restore access to the gui of the passive firewall from Open to Work in Progress.

Aug 5 2021, 2:34 PM · System administration

vsellier changed the status of T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.

Aug 5 2021, 2:34 PM · System administration

vsellier triaged T3465: Test multidatacenter replication as Normal priority.

Aug 5 2021, 12:31 PM · System administration, Storage manager

vsellier triaged T3464: Prepare a quote for the cassandra servers as Normal priority.

Aug 5 2021, 12:20 PM · System administration, Storage manager

vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.

Aug 5 2021, 12:18 PM · System administration, Storage manager

vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.

Aug 5 2021, 12:17 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Some news about the tests running since the beginning of the week:

The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota

Aug 5 2021, 12:16 PM · System administration, Storage manager

Aug 4 2021

vsellier triaged T3463: Ingest proxmox and ceph logs in elk as High priority.

Aug 4 2021, 6:44 PM · System administration

vsellier triaged T3462: Add proxmox / ceph monitoring as High priority.

Aug 4 2021, 6:42 PM · System administration

vsellier added a subtask for T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem: T3461: Prepare a quote for bare metal servers for the firewalls.

Aug 4 2021, 6:39 PM · System administration

vsellier added a parent task for T3461: Prepare a quote for bare metal servers for the firewalls: T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

Aug 4 2021, 6:39 PM · System administration

vsellier triaged T3461: Prepare a quote for bare metal servers for the firewalls as Normal priority.

Aug 4 2021, 6:39 PM · System administration

vsellier triaged T3460: Restore access to the gui of the passive firewall as Normal priority.

Aug 4 2021, 6:36 PM · System administration

vsellier added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.

Aug 4 2021, 4:19 PM · System administration

vsellier edited P1099 First crash logs.

Aug 4 2021, 12:28 PM

Aug 3 2021

vsellier accepted D6055: Install save code now update routine as a service called every minute.

LGTM

Aug 3 2021, 6:01 PM

vsellier accepted D6052: Install update-metrics as a service called daily.

It seems the systemd module upgrade add an internal change :

*******************************************
  Systemd::Service_limits[rabbitmq-server.service] =>
   parameters =>
     selinux_ignore_defaults =>
      + false

but as we have checked together, it seems it has no impacts on the service configuration / content of the dropin files

Aug 3 2021, 4:46 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

It looks like it solves the concurrency issue and will allow to keep the logs

Aug 3 2021, 2:53 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

I have just saw the paste so I have the response to my first question ;)

Aug 3 2021, 2:40 PM

vsellier added a comment to D6052: Install update-metrics as a service called daily.

If there any log in output? Is-it possible to monitor the duration of the command and most importantly avoiding it to run several times in parallel?

Aug 3 2021, 2:38 PM

vsellier committed rDSNIPa96df2f91e47: grid5000/cassandra: add a script to cleanup the zfs pools (authored by vsellier).

grid5000/cassandra: add a script to cleanup the zfs pools

Aug 3 2021, 2:29 PM

vsellier committed rDSNIPc56684acf284: grid5000/cassandra: quick and dirty script to manage best effort nodes (authored by vsellier).

grid5000/cassandra: quick and dirty script to manage best effort nodes

Aug 3 2021, 2:29 PM

vsellier closed T3452: Replication lag between the dbs should raise icinga alerts as Resolved.

The probe is deployed on the monitoring: https://icinga.softwareheritage.org/dashboard#!/monitoring/service/show?host=belvedere.internal.softwareheritage.org&service=Postgresql%20replication%20lag%20%28belvedere%20-%3E%20somerset%29

Aug 3 2021, 11:32 AM · Monitoring, System administration

vsellier closed D6050: monitor postgresql replication lag through prometheus data.

Aug 3 2021, 11:24 AM

vsellier committed rSPSITE9483287d6c9a: monitor postgresql replication lag through prometheus data (authored by vsellier).

monitor postgresql replication lag through prometheus data

Aug 3 2021, 11:24 AM

vsellier updated the test plan for D6050: monitor postgresql replication lag through prometheus data.

Aug 3 2021, 9:35 AM

vsellier requested review of D6050: monitor postgresql replication lag through prometheus data.

Aug 3 2021, 9:03 AM

vsellier added a revision to T3452: Replication lag between the dbs should raise icinga alerts: D6050: monitor postgresql replication lag through prometheus data.

Aug 3 2021, 9:03 AM · Monitoring, System administration

Aug 2 2021

vsellier committed rDSNIP1e056818f103: grid5000/cassandra: extend the prometheus data retention (authored by vsellier).

grid5000/cassandra: extend the prometheus data retention

Aug 2 2021, 11:18 PM

vsellier committed rDSNIP7801b4dac744: grid5000/cassandra: remove a pre-configured debian repo with an expired key (authored by vsellier).

grid5000/cassandra: remove a pre-configured debian repo with an expired key

Aug 2 2021, 11:18 PM

vsellier created P1112 federate your prometheus.

Aug 2 2021, 4:29 PM

Jul 9 2021

vsellier added a comment to T3408: Provide read-only access to production servers.

It misses a documentation somewhere to list the urls of the services for staging and production before closing

Jul 9 2021, 4:24 PM · System administration

vsellier closed T1526: Install a new VPN endpoint at Rocquencourt as Resolved.

Jul 9 2021, 4:23 PM · System administration

vsellier committed rDSNIP466a651fe2c4: Increase the max commit log segment size to allow the import of big revisions (authored by vsellier).

Increase the max commit log segment size to allow the import of big revisions

Jul 9 2021, 4:13 PM

vsellier committed rSPSITE4ddcd5a1349a: Allow the indexer journal client to access its configuration (authored by vsellier).

Allow the indexer journal client to access its configuration

Jul 9 2021, 3:15 PM

vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

It seems it's better, there were at worst 1 error per day since the update.

Jul 9 2021, 12:29 PM · System administration

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The run with 8 nodes was faster[1] than the previous ones, as expected, but it seems it could have been even faster because the bottleneck are now the 6 replayers which have a really high load.
The performance is better between 60% to 100% depending of the object.

Jul 9 2021, 9:59 AM · System administration, Storage manager

vsellier closed T3396: cassandra - allow to configure the consistency level used by the queries, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.

Jul 9 2021, 9:46 AM · System administration, Storage manager

vsellier closed T3396: cassandra - allow to configure the consistency level used by the queries as Resolved.

The version was used during the last tests on grid5000. The consistency level was correctly configured

Jul 9 2021, 9:46 AM · System administration, Storage manager

Jul 8 2021

vsellier updated the task description for T3408: Provide read-only access to production servers.

Jul 8 2021, 6:44 PM · System administration

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

News for the last incomplete run with 4 nodes[1], it seems it's 25% faster than with 3 nodes which it's great
The next run will be this night with 8 cassandra nodes

Jul 8 2021, 6:29 PM · System administration, Storage manager

vsellier committed rDSNIPf4ac8222de37: grid5000/cassandra: avoid oversize mutations on revisions (authored by vsellier).

grid5000/cassandra: avoid oversize mutations on revisions

Jul 8 2021, 1:58 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

During the last run, I discovered there were cassandra logs[1] about oversized mutations on this run and all the previous ones.
It means some changes were committed but ignored when the commit log is flushed which it's absolutely wrong.

Jul 8 2021, 1:54 PM · System administration, Storage manager

vsellier closed D5979: webapp: install read-only services on all frontend servers.

Jul 8 2021, 9:32 AM

vsellier committed rSPSITEca0e5a0e7961: webapp: install read-only services on all frontend servers (authored by vsellier).

webapp: install read-only services on all frontend servers

Jul 8 2021, 9:32 AM