Page MenuHomeSoftware Heritage
Feed Advanced Search

Aug 6 2021

vsellier added a comment to T3417: Cleanup the old counters environment.
  • D6064 landed
  • manual cleanup:
    • the apache vhost was removed by puppet
    • /var/www/stats.export.softwareheritage.org directory removed
    • the crontab was removed by puppet
    • /usr/local/bin/export_archive_counters.py file removed
    • /usr/local/share/swh-data directory removed
  • the refresh of the database counter is now scheduled each monday at 6:29 AM
postgres@belvedere:~$ crontab -l | grep counter
29 6  *   *  mon     /usr/bin/chronic /usr/bin/flock -xn /srv/softwareheritage/postgres/swh-update-counter.lock /usr/bin/psql -p 5433 softwareheritage -c "select swh_update_counter(object_type) from object_counts where single_update = true order by last_update limit 1"
Aug 6 2021, 6:31 PM · System administration, Monitoring
vsellier closed D6064: Clean counter statistic scripts, data and vhosts on pergamon.
Aug 6 2021, 6:11 PM
vsellier committed rSPSITE278cfe53d5fe: Clean counter statistic scripts, data and vhosts on pergamon (authored by vsellier).
Clean counter statistic scripts, data and vhosts on pergamon
Aug 6 2021, 6:11 PM
vsellier updated the diff for D6064: Clean counter statistic scripts, data and vhosts on pergamon.

rebase

Aug 6 2021, 6:08 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Aug 6 2021, 5:33 PM · System administration, Storage manager
vsellier committed rDSNIP326a9e39812a: grid5000/cassandra: save the temporary scylla configuration (authored by vsellier).
grid5000/cassandra: save the temporary scylla configuration
Aug 6 2021, 5:18 PM
vsellier committed rDSNIP78b96c8f1a2b: grid5000/cassandra: add dashboards dedicated to scylla (authored by vsellier).
grid5000/cassandra: add dashboards dedicated to scylla
Aug 6 2021, 4:35 PM
vsellier committed rDSNIP52b4eea9e3ad: grid5000/cassandra: add a dashboard to monitor the concurrent client connections (authored by vsellier).
grid5000/cassandra: add a dashboard to monitor the concurrent client connections
Aug 6 2021, 4:35 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.
Aug 6 2021, 3:17 PM · System administration, Storage manager
vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

Aug 6 2021, 3:06 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 6 2021, 2:53 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1
Aug 6 2021, 1:09 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

scylladb test

Aug 6 2021, 11:56 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run7 results - cassandra heap from 16g to 32g

Aug 6 2021, 10:49 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run6 results - commitlog on a HDD

Aug 6 2021, 10:28 AM · System administration, Storage manager

Aug 5 2021

vsellier added a comment to T3417: Cleanup the old counters environment.

Pergamon manual cleanup after D6064 is apply:

  • Remove /var/www/stats.export.softwareheritage.org directory
  • Remove apache vhosts:
    • /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_non-ssl.conf
    • /etc/apache2/sites-enabled/25-stats.export.softwareheritage.org_ssl.conf
    • /etc/apache2/sites-available/25-stats.export.softwareheritage.org_non-ssl.conf
    • /etc/apache2/sites-available/25-stats.export.softwareheritage.org_ssl.conf
  • check crontab removal: export_archive_counters
  • remove '/usr/local/bin/export_archive_counters.py'
  • remove '/usr/local/share/swh-data' directory
Aug 5 2021, 7:43 PM · System administration, Monitoring
vsellier requested review of D6064: Clean counter statistic scripts, data and vhosts on pergamon.
Aug 5 2021, 7:33 PM
vsellier added a revision to T3417: Cleanup the old counters environment: D6064: Clean counter statistic scripts, data and vhosts on pergamon.
Aug 5 2021, 7:33 PM · System administration, Monitoring
vsellier closed T3460: Restore access to the gui of the passive firewall as Resolved.
Aug 5 2021, 5:49 PM · System administration
vsellier closed T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.
Aug 5 2021, 5:49 PM · System administration
vsellier closed D6063: infrastructure: document how to access the second firewall through the VPN.
Aug 5 2021, 5:48 PM
vsellier committed rDDOCdde95608d439: infrastructure: document how to access the second firewall through the VPN (authored by vsellier).
infrastructure: document how to access the second firewall through the VPN
Aug 5 2021, 5:48 PM
vsellier changed the status of T3417: Cleanup the old counters environment, a subtask of T2912: Next generation archive counters, from Open to Work in Progress.
Aug 5 2021, 5:37 PM · Roadmap 2021, System administration, Monitoring, Web app
vsellier changed the status of T3417: Cleanup the old counters environment from Open to Work in Progress.
Aug 5 2021, 5:37 PM · System administration, Monitoring
vsellier added a comment to T3460: Restore access to the gui of the passive firewall .
  • louvre configuration reverted
  • pushking configuration reverted
  • access documented: D6063
Aug 5 2021, 5:29 PM · System administration
vsellier requested review of D6063: infrastructure: document how to access the second firewall through the VPN.
Aug 5 2021, 5:28 PM
vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6063: infrastructure: document how to access the second firewall through the VPN.
Aug 5 2021, 5:28 PM · System administration
vsellier added a reverting change for D6060: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".
Aug 5 2021, 5:05 PM
vsellier added a reverting change for rSPSITE76e4f3fba539: Change pushkin ip address: rSPSITE10abbbd71e19: Revert "Change pushkin ip address".
Aug 5 2021, 5:05 PM
vsellier committed rSPSITE10abbbd71e19: Revert "Change pushkin ip address" (authored by vsellier).
Revert "Change pushkin ip address"
Aug 5 2021, 5:05 PM
vsellier accepted D6062: staging: Deploy opam loader service.

LGTM

Aug 5 2021, 5:02 PM
vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

well I have misinterpreted the routing issue.
It seems it's only because the openvpn service is also started on pushkin (it's normal to be ready in case of a primary/secondary switch).
The route for the openvpn traffic is also declared on the secondary so a packet from the VPN is falling in a black hole:

Aug 5 2021, 4:44 PM · System administration
vsellier accepted D6061: staging: Deploy opam lister.

LGTM

Aug 5 2021, 4:31 PM
vsellier added a comment to T3460: Restore access to the gui of the passive firewall .

the pushkin's ip was changed and the new ip was declared on puppet.
It seems the firewall is still not reachable with the new ip.
I'm trying to diagnose the problem

Aug 5 2021, 4:17 PM · System administration
vsellier closed D6060: Change pushkin ip address.
Aug 5 2021, 3:45 PM
vsellier committed rSPSITE76e4f3fba539: Change pushkin ip address (authored by vsellier).
Change pushkin ip address
Aug 5 2021, 3:45 PM
vsellier added a revision to T3460: Restore access to the gui of the passive firewall : D6060: Change pushkin ip address.
Aug 5 2021, 3:32 PM · System administration
vsellier requested review of D6060: Change pushkin ip address.
Aug 5 2021, 3:32 PM
vsellier added a comment to T3460: Restore access to the gui of the passive firewall .
  • network configuration manually changed on louvre:
root@louvre:~# diff -u3 /tmp/interfaces /etc/network/interfaces
--- /tmp/interfaces	2021-08-05 12:44:20.213896058 +0000
+++ /etc/network/interfaces	2021-08-05 12:37:29.480805493 +0000
@@ -5,7 +5,7 @@
Aug 5 2021, 2:45 PM · System administration
vsellier changed the status of T3460: Restore access to the gui of the passive firewall from Open to Work in Progress.
Aug 5 2021, 2:34 PM · System administration
vsellier changed the status of T3460: Restore access to the gui of the passive firewall , a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.
Aug 5 2021, 2:34 PM · System administration
vsellier triaged T3465: Test multidatacenter replication as Normal priority.
Aug 5 2021, 12:31 PM · System administration, Storage manager
vsellier triaged T3464: Prepare a quote for the cassandra servers as Normal priority.
Aug 5 2021, 12:20 PM · System administration, Storage manager
vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.
Aug 5 2021, 12:18 PM · System administration, Storage manager
vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.
Aug 5 2021, 12:17 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Some news about the tests running since the beginning of the week:

  • The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
  • The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
  • With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota
Aug 5 2021, 12:16 PM · System administration, Storage manager

Aug 4 2021

vsellier triaged T3463: Ingest proxmox and ceph logs in elk as High priority.
Aug 4 2021, 6:44 PM · System administration
vsellier triaged T3462: Add proxmox / ceph monitoring as High priority.
Aug 4 2021, 6:42 PM · System administration
vsellier added a subtask for T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem: T3461: Prepare a quote for bare metal servers for the firewalls.
Aug 4 2021, 6:39 PM · System administration
vsellier added a parent task for T3461: Prepare a quote for bare metal servers for the firewalls: T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.
Aug 4 2021, 6:39 PM · System administration
vsellier triaged T3461: Prepare a quote for bare metal servers for the firewalls as Normal priority.
Aug 4 2021, 6:39 PM · System administration
vsellier triaged T3460: Restore access to the gui of the passive firewall as Normal priority.
Aug 4 2021, 6:36 PM · System administration
vsellier added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.

Aug 4 2021, 4:19 PM · System administration
vsellier edited P1099 First crash logs.
Aug 4 2021, 12:28 PM

Aug 3 2021

vsellier accepted D6055: Install save code now update routine as a service called every minute.

LGTM

Aug 3 2021, 6:01 PM
vsellier accepted D6052: Install update-metrics as a service called daily.

It seems the systemd module upgrade add an internal change :

*******************************************
  Systemd::Service_limits[rabbitmq-server.service] =>
   parameters =>
     selinux_ignore_defaults =>
      + false

but as we have checked together, it seems it has no impacts on the service configuration / content of the dropin files

Aug 3 2021, 4:46 PM
vsellier added a comment to D6052: Install update-metrics as a service called daily.

It looks like it solves the concurrency issue and will allow to keep the logs

Aug 3 2021, 2:53 PM
vsellier added a comment to D6052: Install update-metrics as a service called daily.

I have just saw the paste so I have the response to my first question ;)

Aug 3 2021, 2:40 PM
vsellier added a comment to D6052: Install update-metrics as a service called daily.

If there any log in output? Is-it possible to monitor the duration of the command and most importantly avoiding it to run several times in parallel?

Aug 3 2021, 2:38 PM
vsellier committed rDSNIPa96df2f91e47: grid5000/cassandra: add a script to cleanup the zfs pools (authored by vsellier).
grid5000/cassandra: add a script to cleanup the zfs pools
Aug 3 2021, 2:29 PM
vsellier committed rDSNIPc56684acf284: grid5000/cassandra: quick and dirty script to manage best effort nodes (authored by vsellier).
grid5000/cassandra: quick and dirty script to manage best effort nodes
Aug 3 2021, 2:29 PM
vsellier closed T3452: Replication lag between the dbs should raise icinga alerts as Resolved.

The probe is deployed on the monitoring: https://icinga.softwareheritage.org/dashboard#!/monitoring/service/show?host=belvedere.internal.softwareheritage.org&service=Postgresql%20replication%20lag%20%28belvedere%20-%3E%20somerset%29

Aug 3 2021, 11:32 AM · Monitoring, System administration
vsellier closed D6050: monitor postgresql replication lag through prometheus data.
Aug 3 2021, 11:24 AM
vsellier committed rSPSITE9483287d6c9a: monitor postgresql replication lag through prometheus data (authored by vsellier).
monitor postgresql replication lag through prometheus data
Aug 3 2021, 11:24 AM
vsellier updated the test plan for D6050: monitor postgresql replication lag through prometheus data.
Aug 3 2021, 9:35 AM
vsellier requested review of D6050: monitor postgresql replication lag through prometheus data.
Aug 3 2021, 9:03 AM
vsellier added a revision to T3452: Replication lag between the dbs should raise icinga alerts: D6050: monitor postgresql replication lag through prometheus data.
Aug 3 2021, 9:03 AM · Monitoring, System administration

Aug 2 2021

vsellier committed rDSNIP1e056818f103: grid5000/cassandra: extend the prometheus data retention (authored by vsellier).
grid5000/cassandra: extend the prometheus data retention
Aug 2 2021, 11:18 PM
vsellier committed rDSNIP7801b4dac744: grid5000/cassandra: remove a pre-configured debian repo with an expired key (authored by vsellier).
grid5000/cassandra: remove a pre-configured debian repo with an expired key
Aug 2 2021, 11:18 PM
vsellier created P1112 federate your prometheus.
Aug 2 2021, 4:29 PM

Jul 9 2021

vsellier added a comment to T3408: Provide read-only access to production servers.

It misses a documentation somewhere to list the urls of the services for staging and production before closing

Jul 9 2021, 4:24 PM · System administration
vsellier closed T1526: Install a new VPN endpoint at Rocquencourt as Resolved.
Jul 9 2021, 4:23 PM · System administration
vsellier committed rDSNIP466a651fe2c4: Increase the max commit log segment size to allow the import of big revisions (authored by vsellier).
Increase the max commit log segment size to allow the import of big revisions
Jul 9 2021, 4:13 PM
vsellier committed rSPSITE4ddcd5a1349a: Allow the indexer journal client to access its configuration (authored by vsellier).
Allow the indexer journal client to access its configuration
Jul 9 2021, 3:15 PM
vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

It seems it's better, there were at worst 1 error per day since the update.

Jul 9 2021, 12:29 PM · System administration
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The run with 8 nodes was faster[1] than the previous ones, as expected, but it seems it could have been even faster because the bottleneck are now the 6 replayers which have a really high load.
The performance is better between 60% to 100% depending of the object.

Jul 9 2021, 9:59 AM · System administration, Storage manager
vsellier closed T3396: cassandra - allow to configure the consistency level used by the queries, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Jul 9 2021, 9:46 AM · System administration, Storage manager
vsellier closed T3396: cassandra - allow to configure the consistency level used by the queries as Resolved.

The version was used during the last tests on grid5000. The consistency level was correctly configured

Jul 9 2021, 9:46 AM · System administration, Storage manager

Jul 8 2021

vsellier updated the task description for T3408: Provide read-only access to production servers.
Jul 8 2021, 6:44 PM · System administration
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

News for the last incomplete run with 4 nodes[1], it seems it's 25% faster than with 3 nodes which it's great
The next run will be this night with 8 cassandra nodes

Jul 8 2021, 6:29 PM · System administration, Storage manager
vsellier committed rDSNIPf4ac8222de37: grid5000/cassandra: avoid oversize mutations on revisions (authored by vsellier).
grid5000/cassandra: avoid oversize mutations on revisions
Jul 8 2021, 1:58 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

During the last run, I discovered there were cassandra logs[1] about oversized mutations on this run and all the previous ones.
It means some changes were committed but ignored when the commit log is flushed which it's absolutely wrong.

Jul 8 2021, 1:54 PM · System administration, Storage manager
vsellier closed D5979: webapp: install read-only services on all frontend servers.
Jul 8 2021, 9:32 AM
vsellier committed rSPSITEca0e5a0e7961: webapp: install read-only services on all frontend servers (authored by vsellier).
webapp: install read-only services on all frontend servers
Jul 8 2021, 9:32 AM

Jul 7 2021

vsellier committed rDSNIP5a50b5dfb0f9: grid5000/cassandra: Restart the monitoring containers (authored by vsellier).
grid5000/cassandra: Restart the monitoring containers
Jul 7 2021, 7:32 PM
vsellier committed rDSNIPbb9854fba524: grid5000/cassandra: small updates on grafana dashboards (authored by vsellier).
grid5000/cassandra: small updates on grafana dashboards
Jul 7 2021, 7:32 PM
vsellier committed rDSNIPa397ec1c7335: grid5000/cassandra: Use a more secure consistency level (authored by vsellier).
grid5000/cassandra: Use a more secure consistency level
Jul 7 2021, 7:32 PM
vsellier added a comment to T3396: cassandra - allow to configure the consistency level used by the queries.

released in swh-storage:v0.34.0

Jul 7 2021, 6:46 PM · System administration, Storage manager
vsellier closed D5974: cassandra: Allow to configure the consistency level to use.
Jul 7 2021, 6:21 PM
vsellier committed rDSTO9747aed6cbf1: cassandra: Allow to configure the consistency level to use (authored by vsellier).
cassandra: Allow to configure the consistency level to use
Jul 7 2021, 6:21 PM
vsellier requested review of D5979: webapp: install read-only services on all frontend servers.
Jul 7 2021, 5:38 PM
vsellier added a revision to T3408: Provide read-only access to production servers: D5979: webapp: install read-only services on all frontend servers.
Jul 7 2021, 5:38 PM · System administration
vsellier added a comment to T3380: staging - Disk errors on storage1.

This is the disk position according to a picture of the server taken by Christophe :

Jul 7 2021, 4:22 PM · System administration, Staging environment
vsellier updated the diff for D5974: cassandra: Allow to configure the consistency level to use.
  • use a fstring to build the error message
  • check the error message content
Jul 7 2021, 2:30 PM
vsellier added inline comments to D5974: cassandra: Allow to configure the consistency level to use.
Jul 7 2021, 12:59 PM
vsellier added inline comments to D5974: cassandra: Allow to configure the consistency level to use.
Jul 7 2021, 11:40 AM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

A run was launched with th patch storage allowing to configure the consistency levels.
The results are on the dedicated hedgedoc document[1]

Jul 7 2021, 11:21 AM · System administration, Storage manager

Jul 6 2021

vsellier requested review of D5974: cassandra: Allow to configure the consistency level to use.
Jul 6 2021, 5:21 PM
vsellier added a revision to T3396: cassandra - allow to configure the consistency level used by the queries: D5974: cassandra: Allow to configure the consistency level to use.
Jul 6 2021, 4:59 PM · System administration, Storage manager
vsellier closed T3395: cassandra - Timeouts during revision import, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Jul 6 2021, 12:44 PM · System administration, Storage manager