The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]

Dec 9 2020, 9:51 AM · System administrators, Staging environment, Journal, Archive search

vsellier updated the task description for T2817: Enable the swh-search environment in staging.

Dec 9 2020, 9:35 AM · System administrators, Staging environment, Journal, Archive search

Dec 8 2020

vsellier closed T2828: Archive counters are no longer updated in production as Resolved.

changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.

Dec 8 2020, 7:30 PM · Monitoring, Web app, System administration

vsellier renamed T2868: Integrate former ceph-mon1 server to the proxmox cluster from Integrate former ceph-mon1 server in the proxmox cluster to Integrate former ceph-mon1 server to the proxmox cluster.

Dec 8 2020, 6:11 PM · System administration

vsellier triaged T2868: Integrate former ceph-mon1 server to the proxmox cluster as Normal priority.

Dec 8 2020, 6:09 PM · System administration

vsellier triaged T2866: Integrate former Uffizi server to the proxmox cluster as Normal priority.

Dec 8 2020, 6:00 PM · System administration

vsellier triaged T2865: Prepare an environment to test the ClearlyDefined integration as Normal priority.

Dec 8 2020, 5:57 PM · System administration

vsellier accepted D4687: search: Add initialization step on install or upgrade.

LGTM

Dec 8 2020, 5:13 PM

vsellier accepted D4668: Add swh-search-journal-client to swh_search_with_journal_client role.

ack, let's go to try the current version of the journal client :)

Dec 8 2020, 12:11 PM

vsellier requested changes to D4668: Add swh-search-journal-client to swh_search_with_journal_client role.

As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet too

Dec 8 2020, 11:36 AM

vsellier accepted D4666: staging: Deploy swh-search rpc backend on search0.

My previous comment was not for this diff but for D4668 :)

Dec 8 2020, 11:31 AM

vsellier requested changes to D4666: staging: Deploy swh-search rpc backend on search0.

As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet to

Dec 8 2020, 11:17 AM

vsellier accepted D4664: search0: Add swh-search rpc backend node.

LGTM

Dec 8 2020, 11:14 AM

vsellier added a comment to T2817: Enable the swh-search environment in staging.

A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests

Dec 8 2020, 10:49 AM · System administrators, Staging environment, Journal, Archive search

Dec 7 2020

vsellier added a comment to T2817: Enable the swh-search environment in staging.

Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html

Dec 7 2020, 6:15 PM · System administrators, Staging environment, Journal, Archive search

vsellier committed rSPSITE262e122fa89b: monitoring: gather metrics into prometheus (authored by vsellier).

monitoring: gather metrics into prometheus

Dec 7 2020, 2:41 PM

vsellier closed D4674: monitoring: gather metrics into prometheus.

Dec 7 2020, 2:41 PM

vsellier closed T2859: Out of disk space on prometheus storage as Resolved.

Dec 7 2020, 2:38 PM · Monitoring, System administration

vsellier added a comment to T2859: Out of disk space on prometheus storage.

an increment of 250Go was added via the proxmox ui (for a total of 500Go now)
the disk was resized on the os side :

root@pergamon:~# parted /dev/vdc
GNU Parted 3.2
Using /dev/vdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 537GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Dec 7 2020, 2:38 PM · Monitoring, System administration

vsellier added a comment to D4674: monitoring: gather metrics into prometheus.

puppet was disabled on production nodes to avoid this diff to be applied. We will perform a rolling restart of the production cluster after the next scheduled kernel upgrade

Dec 7 2020, 2:17 PM

vsellier added a comment to T2859: Out of disk space on prometheus storage.

The disk will be resized to avoid the service disruption in a short term before looking at T1362

Dec 7 2020, 12:41 PM · Monitoring, System administration

vsellier changed the status of T2859: Out of disk space on prometheus storage from Open to Work in Progress.

Dec 7 2020, 12:38 PM · Monitoring, System administration

vsellier claimed T2859: Out of disk space on prometheus storage.

Dec 7 2020, 12:38 PM · Monitoring, System administration

vsellier triaged T2859: Out of disk space on prometheus storage as High priority.

Dec 7 2020, 12:24 PM · Monitoring, System administration

vsellier created D4674: monitoring: gather metrics into prometheus.

Dec 7 2020, 12:14 PM

vsellier committed rSENVf54b4fc6183b: Update octocatalog-diff facts (authored by vsellier).

Update octocatalog-diff facts

Dec 7 2020, 12:02 PM

Dec 4 2020

vsellier created P893 (An Untitled Masterwork).

Dec 4 2020, 5:23 PM

vsellier updated subscribers of T2852: Take back control on elasticsearch puppet manifests.

esnode3 was restarted and updated.

~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%

Dec 4 2020, 3:49 PM · System administration

vsellier added a comment to T2852: Take back control on elasticsearch puppet manifests.

The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.

Dec 4 2020, 2:25 PM · System administration

vsellier added a comment to T2852: Take back control on elasticsearch puppet manifests.

A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.

Dec 4 2020, 2:23 PM · System administration

vsellier added a revision to T2852: Take back control on elasticsearch puppet manifests: D4651: Puppetize elasticsearch nodes.

Dec 4 2020, 2:20 PM · System administration

vsellier added a task to D4651: Puppetize elasticsearch nodes: T2852: Take back control on elasticsearch puppet manifests.

Dec 4 2020, 2:20 PM

vsellier changed the status of T2852: Take back control on elasticsearch puppet manifests from Open to Work in Progress.

Dec 4 2020, 2:20 PM · System administration

vsellier accepted D4651: Puppetize elasticsearch nodes.

fake LGTM to be able to change the status

Dec 4 2020, 11:47 AM

vsellier added a comment to T2817: Enable the swh-search environment in staging.

dedicated ES node for staging deployed (search-esnode0.internal.staging.swh.network) with D4658 and D4651

Dec 4 2020, 11:46 AM · System administrators, Staging environment, Journal, Archive search

vsellier updated the task description for T2817: Enable the swh-search environment in staging.

Dec 4 2020, 11:44 AM · System administrators, Staging environment, Journal, Archive search

vsellier accepted D4658: staging: Add search-esnode0.

LGTM

Dec 4 2020, 11:34 AM

Dec 3 2020

vsellier added inline comments to D4651: Puppetize elasticsearch nodes.

Dec 3 2020, 4:47 PM

vsellier abandoned D4654: -wip- Switch to the official elasticsearch plugin.

According the description and to the latest discussion we haved decided to not continue in this direction.

Dec 3 2020, 2:12 PM

vsellier added a comment to D4651: Puppetize elasticsearch nodes.

@olasd We have tested to use the official elasticsearch puppet plugin (D4654).
There is several issues to use it. WDYT?

Dec 3 2020, 12:25 PM

vsellier added a revision to T2817: Enable the swh-search environment in staging: D4654: -wip- Switch to the official elasticsearch plugin.

Dec 3 2020, 12:21 PM · System administrators, Staging environment, Journal, Archive search

vsellier created D4654: -wip- Switch to the official elasticsearch plugin.

Dec 3 2020, 12:21 PM

Dec 2 2020

vsellier committed rSENVde7c399ee557: add the dependency needed by elasticsearch plugin (authored by vsellier).

add the dependency needed by elasticsearch plugin

Dec 2 2020, 6:16 PM

vsellier committed rSENV3470c887ff6b: Remove db0.staging (authored by vsellier).

Remove db0.staging

Dec 2 2020, 11:29 AM

vsellier abandoned D4308: wip - poc network configuration in markdown.

This is not viable as the opnsense api doesn't allow to retrieve the network rules

Dec 2 2020, 11:25 AM

vsellier added a comment to T2761: Install webapp counters in the staging webapp/storage.

After T2828, It's more clear of what must be deployed to have the counters working on staging:

the counters can be intialized via the /stat/refresh endpoint of the storage api (Note: It will create more counters than production as directory_entry_* and revision_history are not counted in production)
Add a script/service to execute the `swh_update_counter_bucketed` in an infinite loop
Create the buckets in the object_counts_bucketed
- per object type : identifier|bucket_start|bucket_end. value and last_update will be updated be the stored procedures.
configure prometheus sql exporter for db1.staging [1]
configure profile_exporter on pergamon
- Update the script to ensure the data are filtered by environments (to avoid staging data to be included in production counts [2])
- Configure a new cron
  - loading an empty file for historical data
  - creating a new export_file
update webapp to be able to configure the counter origin

Dec 2 2020, 9:55 AM · Storage manager, Web app, Staging environment

vsellier edited P888 Counters pipeline.

Dec 2 2020, 9:22 AM

vsellier added a comment to T2828: Archive counters are no longer updated in production.

I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :

Object counters pipeline.svg36 KBDownload

Dec 2 2020, 9:18 AM · Monitoring, Web app, System administration

vsellier created P888 Counters pipeline.

Dec 2 2020, 9:15 AM

Dec 1 2020

vsellier added a comment to T2828: Archive counters are no longer updated in production.

I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.

Dec 1 2020, 3:18 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

Dec 1 2020, 2:47 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

All stopped workers are restarted :

vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'

Dec 1 2020, 12:48 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

Dec 1 2020, 12:26 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

In T2828#53740, @rdicosmo wrote:

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

Dec 1 2020, 12:22 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display

Dec 1 2020, 12:06 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is landed.

Dec 1 2020, 11:48 AM · Monitoring, Web app, System administration

vsellier updated the test plan for D4635: exclude temporary schemas from the statistics.

Dec 1 2020, 11:46 AM

vsellier committed rSPSITEe0e677dca3ff: exclude temporary schemas from the statistics (authored by vsellier).

exclude temporary schemas from the statistics

Dec 1 2020, 11:42 AM

vsellier closed D4635: exclude temporary schemas from the statistics.

Dec 1 2020, 11:42 AM

vsellier added a comment to T2827: Deploy an instance of hedgedoc.

The installation doesn't look too complicated [1], we need to install a new firewall dedicated to admin/internal tools.
The unknown part is on the sso part, but as a quick win, we can try to plug it on the current softwareheritage keycloak's scheme with a dedicated group

Dec 1 2020, 11:24 AM · System administration

vsellier added a comment to D4635: exclude temporary schemas from the statistics.

In D4635#115756, @olasd wrote:

I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...

Dec 1 2020, 10:28 AM

vsellier added a comment to D4635: exclude temporary schemas from the statistics.

In D4635#115756, @olasd wrote:

I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...

Dec 1 2020, 10:22 AM

vsellier updated the diff for D4635: exclude temporary schemas from the statistics.

Exclude only pg_temp schemas

Dec 1 2020, 10:19 AM

vsellier added a comment to D4635: exclude temporary schemas from the statistics.

This is the complete result of the statistics with the temporary tables : P886

Dec 1 2020, 10:06 AM

vsellier created P886 Complete statistics with temporary tables.

Dec 1 2020, 10:05 AM

vsellier added a comment to T2828: Archive counters are no longer updated in production.

As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :

vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'

Dec 1 2020, 9:37 AM · Monitoring, Web app, System administration

vsellier updated the summary of D4635: exclude temporary schemas from the statistics.

Dec 1 2020, 9:31 AM

vsellier updated the diff for D4635: exclude temporary schemas from the statistics.

Update description

Dec 1 2020, 9:30 AM

vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is a proposal to solve the performance issues on the statistic queries

Dec 1 2020, 9:22 AM · Monitoring, Web app, System administration

vsellier added a revision to T2828: Archive counters are no longer updated in production: D4635: exclude temporary schemas from the statistics.

Dec 1 2020, 9:21 AM · Monitoring, Web app, System administration

vsellier created D4635: exclude temporary schemas from the statistics.

Dec 1 2020, 9:21 AM

Nov 30 2020

vsellier updated subscribers of T2828: Archive counters are no longer updated in production.

@olasd has stopped the backfilling with :

pkill -2 -u swhstorage -f revision

(allow to flush the logs before exiting)

Nov 30 2020, 7:49 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

Half of the workers were stopped :

root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service.
...

Nov 30 2020, 6:09 PM · Monitoring, Web app, System administration

vsellier triaged T2831: sql exporter is failing to retrieve the number of running queries as Normal priority.

Nov 30 2020, 6:05 PM · Monitoring, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :

postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
   client_addr   | count 
-----------------+-------
                 |     3
 192.168.100.18  |     0
 ::1             |     1
 192.168.100.210 |    60
 192.168.100.102 |    64
(5 rows)

I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere

Nov 30 2020, 5:36 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

In T2828#53672, @rdicosmo wrote:

Hmmm... there is definitely no need to update the counters more than once a day

Nov 30 2020, 5:19 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

Let's try a temporary workaround :

root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" 
root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~
root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter

Nov 30 2020, 4:18 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):

Nov 30 2020, 4:04 PM · Monitoring, Web app, System administration

vsellier added a comment to T2828: Archive counters are no longer updated in production.

After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.

Nov 30 2020, 3:34 PM · Monitoring, Web app, System administration

vsellier changed the status of T2828: Archive counters are no longer updated in production from Open to Work in Progress.

Nov 30 2020, 3:20 PM · Monitoring, Web app, System administration

vsellier closed T2790: [staging] deploy the journal infrastructure as Resolved.

Nov 30 2020, 10:47 AM · System administration, Staging environment

vsellier closed T2790: [staging] deploy the journal infrastructure, a subtask of T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage), as Resolved.