Page MenuHomeSoftware Heritage
Feed Advanced Search

Dec 7 2020

vsellier added a comment to T2859: Out of disk space on prometheus storage.
  • an increment of 250Go was added via the proxmox ui (for a total of 500Go now)
  • the disk was resized on the os side :
root@pergamon:~# parted /dev/vdc
GNU Parted 3.2
Using /dev/vdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Model: Virtio Block Device (virtblk)
Disk /dev/vdc: 537GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Dec 7 2020, 2:38 PM · Monitoring, System administration
vsellier added a comment to D4674: monitoring: gather metrics into prometheus.

puppet was disabled on production nodes to avoid this diff to be applied. We will perform a rolling restart of the production cluster after the next scheduled kernel upgrade

Dec 7 2020, 2:17 PM
vsellier added a comment to T2859: Out of disk space on prometheus storage.

The disk will be resized to avoid the service disruption in a short term before looking at T1362

Dec 7 2020, 12:41 PM · Monitoring, System administration
vsellier changed the status of T2859: Out of disk space on prometheus storage from Open to Work in Progress.
Dec 7 2020, 12:38 PM · Monitoring, System administration
vsellier claimed T2859: Out of disk space on prometheus storage.
Dec 7 2020, 12:38 PM · Monitoring, System administration
vsellier triaged T2859: Out of disk space on prometheus storage as High priority.
Dec 7 2020, 12:24 PM · Monitoring, System administration
vsellier created D4674: monitoring: gather metrics into prometheus.
Dec 7 2020, 12:14 PM
vsellier committed rSENVf54b4fc6183b: Update octocatalog-diff facts (authored by vsellier).
Update octocatalog-diff facts
Dec 7 2020, 12:02 PM

Dec 4 2020

vsellier created P893 (An Untitled Masterwork).
Dec 4 2020, 5:23 PM
vsellier updated subscribers of T2852: Take back control on elasticsearch puppet manifests.

esnode3 was restarted and updated.

~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%
Dec 4 2020, 3:49 PM · System administration
vsellier added a comment to T2852: Take back control on elasticsearch puppet manifests.

The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.

Dec 4 2020, 2:25 PM · System administration
vsellier added a comment to T2852: Take back control on elasticsearch puppet manifests.

A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.

Dec 4 2020, 2:23 PM · System administration
vsellier added a revision to T2852: Take back control on elasticsearch puppet manifests: D4651: Puppetize elasticsearch nodes.
Dec 4 2020, 2:20 PM · System administration
vsellier added a task to D4651: Puppetize elasticsearch nodes: T2852: Take back control on elasticsearch puppet manifests.
Dec 4 2020, 2:20 PM
vsellier changed the status of T2852: Take back control on elasticsearch puppet manifests from Open to Work in Progress.
Dec 4 2020, 2:20 PM · System administration
vsellier accepted D4651: Puppetize elasticsearch nodes.

fake LGTM to be able to change the status

Dec 4 2020, 11:47 AM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

dedicated ES node for staging deployed (search-esnode0.internal.staging.swh.network) with D4658 and D4651

Dec 4 2020, 11:46 AM · System administrators, Staging environment, Journal, Archive search
vsellier updated the task description for T2817: Enable the swh-search environment in staging.
Dec 4 2020, 11:44 AM · System administrators, Staging environment, Journal, Archive search
vsellier accepted D4658: staging: Add search-esnode0.

LGTM

Dec 4 2020, 11:34 AM

Dec 3 2020

vsellier added inline comments to D4651: Puppetize elasticsearch nodes.
Dec 3 2020, 4:47 PM
vsellier abandoned D4654: -wip- Switch to the official elasticsearch plugin.

According the description and to the latest discussion we haved decided to not continue in this direction.

Dec 3 2020, 2:12 PM
vsellier added a comment to D4651: Puppetize elasticsearch nodes.

@olasd We have tested to use the official elasticsearch puppet plugin (D4654).
There is several issues to use it. WDYT?

Dec 3 2020, 12:25 PM
vsellier added a revision to T2817: Enable the swh-search environment in staging: D4654: -wip- Switch to the official elasticsearch plugin.
Dec 3 2020, 12:21 PM · System administrators, Staging environment, Journal, Archive search
vsellier created D4654: -wip- Switch to the official elasticsearch plugin.
Dec 3 2020, 12:21 PM

Dec 2 2020

vsellier committed rSENVde7c399ee557: add the dependency needed by elasticsearch plugin (authored by vsellier).
add the dependency needed by elasticsearch plugin
Dec 2 2020, 6:16 PM
vsellier committed rSENV3470c887ff6b: Remove db0.staging (authored by vsellier).
Remove db0.staging
Dec 2 2020, 11:29 AM
vsellier abandoned D4308: wip - poc network configuration in markdown.

This is not viable as the opnsense api doesn't allow to retrieve the network rules

Dec 2 2020, 11:25 AM
vsellier added a comment to T2761: Install webapp counters in the staging webapp/storage.

After T2828, It's more clear of what must be deployed to have the counters working on staging:

  • the counters can be intialized via the /stat/refresh endpoint of the storage api (Note: It will create more counters than production as directory_entry_* and revision_history are not counted in production)
  • Add a script/service to execute the `swh_update_counter_bucketed` in an infinite loop
  • Create the buckets in the object_counts_bucketed
    • per object type : identifier|bucket_start|bucket_end. value and last_update will be updated be the stored procedures.
  • configure prometheus sql exporter for db1.staging [1]
  • configure profile_exporter on pergamon
    • Update the script to ensure the data are filtered by environments (to avoid staging data to be included in production counts [2])
    • Configure a new cron
      • loading an empty file for historical data
      • creating a new export_file
  • update webapp to be able to configure the counter origin
Dec 2 2020, 9:55 AM · Storage manager, Web app, Staging environment
vsellier edited P888 Counters pipeline.
Dec 2 2020, 9:22 AM
vsellier added a comment to T2828: Archive counters are no longer updated in production.

I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :

Dec 2 2020, 9:18 AM · Monitoring, Web app, System administration
vsellier created P888 Counters pipeline.
Dec 2 2020, 9:15 AM

Dec 1 2020

vsellier added a comment to T2828: Archive counters are no longer updated in production.

I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.

Dec 1 2020, 3:18 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].

Dec 1 2020, 2:47 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

All stopped workers are restarted :

vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'
Dec 1 2020, 12:48 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.

Dec 1 2020, 12:26 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.

Dec 1 2020, 12:22 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display

Dec 1 2020, 12:06 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is landed.

Dec 1 2020, 11:48 AM · Monitoring, Web app, System administration
vsellier updated the test plan for D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 11:46 AM
vsellier committed rSPSITEe0e677dca3ff: exclude temporary schemas from the statistics (authored by vsellier).
exclude temporary schemas from the statistics
Dec 1 2020, 11:42 AM
vsellier closed D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 11:42 AM
vsellier added a comment to T2827: Deploy an instance of hedgedoc.

The installation doesn't look too complicated [1], we need to install a new firewall dedicated to admin/internal tools.
The unknown part is on the sso part, but as a quick win, we can try to plug it on the current softwareheritage keycloak's scheme with a dedicated group

Dec 1 2020, 11:24 AM · System administration
vsellier added a comment to D4635: exclude temporary schemas from the statistics.
In D4635#115756, @olasd wrote:

I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...

Dec 1 2020, 10:28 AM
vsellier added a comment to D4635: exclude temporary schemas from the statistics.
In D4635#115756, @olasd wrote:

I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...

Dec 1 2020, 10:22 AM
vsellier updated the diff for D4635: exclude temporary schemas from the statistics.

Exclude only pg_temp schemas

Dec 1 2020, 10:19 AM
vsellier added a comment to D4635: exclude temporary schemas from the statistics.

This is the complete result of the statistics with the temporary tables : P886

Dec 1 2020, 10:06 AM
vsellier created P886 Complete statistics with temporary tables.
Dec 1 2020, 10:05 AM
vsellier added a comment to T2828: Archive counters are no longer updated in production.

As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :

vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'
Dec 1 2020, 9:37 AM · Monitoring, Web app, System administration
vsellier updated the summary of D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 9:31 AM
vsellier updated the diff for D4635: exclude temporary schemas from the statistics.

Update description

Dec 1 2020, 9:30 AM
vsellier added a comment to T2828: Archive counters are no longer updated in production.

D4635 is a proposal to solve the performance issues on the statistic queries

Dec 1 2020, 9:22 AM · Monitoring, Web app, System administration
vsellier added a revision to T2828: Archive counters are no longer updated in production: D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 9:21 AM · Monitoring, Web app, System administration
vsellier created D4635: exclude temporary schemas from the statistics.
Dec 1 2020, 9:21 AM

Nov 30 2020

vsellier updated subscribers of T2828: Archive counters are no longer updated in production.

@olasd has stopped the backfilling with :

pkill -2 -u swhstorage -f revision

(allow to flush the logs before exiting)

Nov 30 2020, 7:49 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Half of the workers were stopped :

root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service.
worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service.
...
Nov 30 2020, 6:09 PM · Monitoring, Web app, System administration
vsellier triaged T2831: sql exporter is failing to retrieve the number of running queries as Normal priority.
Nov 30 2020, 6:05 PM · Monitoring, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :

postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr;
   client_addr   | count 
-----------------+-------
                 |     3
 192.168.100.18  |     0
 ::1             |     1
 192.168.100.210 |    60
 192.168.100.102 |    64
(5 rows)

I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere

Nov 30 2020, 5:36 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Hmmm... there is definitely no need to update the counters more than once a day

Nov 30 2020, 5:19 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

Let's try a temporary workaround :

root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" 
root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~
root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter
Nov 30 2020, 4:18 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):

Nov 30 2020, 4:04 PM · Monitoring, Web app, System administration
vsellier added a comment to T2828: Archive counters are no longer updated in production.

After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.

Nov 30 2020, 3:34 PM · Monitoring, Web app, System administration
vsellier changed the status of T2828: Archive counters are no longer updated in production from Open to Work in Progress.
Nov 30 2020, 3:20 PM · Monitoring, Web app, System administration
vsellier closed T2790: [staging] deploy the journal infrastructure as Resolved.
Nov 30 2020, 10:47 AM · System administration, Staging environment
vsellier closed T2790: [staging] deploy the journal infrastructure, a subtask of T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage), as Resolved.
Nov 30 2020, 10:47 AM · Staging environment, System administration

Nov 27 2020

vsellier closed T2816: Enable the journal-writer for the swh-idx-storage in staging, a subtask of T2590: Finish the indexer -> swh-search pipeline, as Resolved.
Nov 27 2020, 6:20 PM · Journal, Archive search
vsellier closed T2816: Enable the journal-writer for the swh-idx-storage in staging as Resolved.

The swh-indexer stack is deployed on staging and the initial loading is done.
The volumes are quite low :

Nov 27 2020, 6:20 PM · System administrators, Staging environment, Journal, Archive search
vsellier created P885 indexer error.
Nov 27 2020, 5:21 PM
vsellier committed rSPSITE2e1a65a3e33b: staging: Fix object storage configuration for indexers (authored by vsellier).
staging: Fix object storage configuration for indexers
Nov 27 2020, 3:47 PM
vsellier closed D4625: staging: Fix object storage configuration for indexers.
Nov 27 2020, 3:47 PM
vsellier committed rSPSITEa2a84c2efb3e: staging: configure idx-storage to write to kafka (authored by vsellier).
staging: configure idx-storage to write to kafka
Nov 27 2020, 3:47 PM
vsellier closed D4620: staging: configure idx-storage to write to kafka.
Nov 27 2020, 3:47 PM
vsellier added a revision to T2816: Enable the journal-writer for the swh-idx-storage in staging: D4625: staging: Fix object storage configuration for indexers.
Nov 27 2020, 3:20 PM · System administrators, Staging environment, Journal, Archive search
vsellier created D4625: staging: Fix object storage configuration for indexers.
Nov 27 2020, 3:20 PM
vsellier created P884 (An Untitled Masterwork).
Nov 27 2020, 12:53 PM
vsellier added a revision to T2816: Enable the journal-writer for the swh-idx-storage in staging: D4620: staging: configure idx-storage to write to kafka.
Nov 27 2020, 10:43 AM · System administrators, Staging environment, Journal, Archive search
vsellier created D4620: staging: configure idx-storage to write to kafka.
Nov 27 2020, 10:43 AM
vsellier added a comment to T2590: Finish the indexer -> swh-search pipeline.

this a description of the pipeline to clarify the interaction between the components (source: P883) :

Nov 27 2020, 10:14 AM · Journal, Archive search
vsellier created P883 Plantuml diagram for origin visits to swh search pipeline.
Nov 27 2020, 10:13 AM

Nov 26 2020

vsellier changed the status of T2817: Enable the swh-search environment in staging, a subtask of T2590: Finish the indexer -> swh-search pipeline, from Open to Work in Progress.
Nov 26 2020, 5:59 PM · Journal, Archive search
vsellier renamed T2817: Enable the swh-search environment in staging from Enable the swh-search in staging to Enable the swh-search environment in staging.
Nov 26 2020, 5:59 PM · System administrators, Staging environment, Journal, Archive search
vsellier triaged T2817: Enable the swh-search environment in staging as Normal priority.
Nov 26 2020, 5:58 PM · System administrators, Staging environment, Journal, Archive search
vsellier added a comment to T2816: Enable the journal-writer for the swh-idx-storage in staging.

T2814 needs to be released before

Nov 26 2020, 5:46 PM · System administrators, Staging environment, Journal, Archive search
vsellier triaged T2816: Enable the journal-writer for the swh-idx-storage in staging as Normal priority.
Nov 26 2020, 5:40 PM · System administrators, Staging environment, Journal, Archive search
vsellier committed rSPRE2e93ac6e6534: Remove deprecation warning (authored by vsellier).
Remove deprecation warning
Nov 26 2020, 5:17 PM
vsellier closed D4614: Remove deprecation warning.
Nov 26 2020, 5:16 PM
vsellier created D4614: Remove deprecation warning.
Nov 26 2020, 5:16 PM
vsellier committed rSPREdf717a7dd2fa: Reflect manual changes applied on journal0 (authored by vsellier).
Reflect manual changes applied on journal0
Nov 26 2020, 5:13 PM
vsellier closed D4613: Reflect manual changes applied on journal0.
Nov 26 2020, 5:13 PM
vsellier created D4613: Reflect manual changes applied on journal0.
Nov 26 2020, 5:13 PM
vsellier added a revision to T2790: [staging] deploy the journal infrastructure: D4613: Reflect manual changes applied on journal0.
Nov 26 2020, 5:13 PM · System administration, Staging environment
vsellier created P882 swh-indexer-journal-client error on origin_visit.
Nov 26 2020, 3:17 PM
vsellier committed rDCIDXd92c241980db: swh.indexer.cli.journal_client: ensure the minimal configuration exists (authored by vsellier).
swh.indexer.cli.journal_client: ensure the minimal configuration exists
Nov 26 2020, 2:56 PM
vsellier closed D4599: swh.indexer.cli.journal_client: fix config use.
Nov 26 2020, 2:56 PM
vsellier updated the diff for D4599: swh.indexer.cli.journal_client: fix config use.

Improve test coverage and change mandatory configuration validation

Nov 26 2020, 2:44 PM
vsellier updated the diff for D4599: swh.indexer.cli.journal_client: fix config use.

Fix tests

Nov 26 2020, 2:20 PM
vsellier added a comment to T2790: [staging] deploy the journal infrastructure.

the backfilling is complete (except for the metadatas). We will focus now on some clients to ensure all the local configuration is correct (T2814 for example), and then we will focus on exposing kafka to the outside.

Nov 26 2020, 12:49 PM · System administration, Staging environment
vsellier added a comment to D4599: swh.indexer.cli.journal_client: fix config use.

The diff fixes the configuration issue but it seems there is another problem with the visits :

Nov 26 2020, 12:27 PM
vsellier added a revision to T2814: Fix swh indexer journal client service: D4599: swh.indexer.cli.journal_client: fix config use.
Nov 26 2020, 12:22 PM · Journal, Indexer
vsellier created D4599: swh.indexer.cli.journal_client: fix config use.
Nov 26 2020, 12:22 PM
vsellier accepted D4582: storage.backfill: Allow cli run for origin_visit_status as well.

LGTM tested on staging

Nov 26 2020, 9:26 AM