LGTM according the journal content
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 10 2020
LGTM
LGTM
LGTM
LGTM
LGTM
Dec 9 2020
The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]
Dec 8 2020
changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.
LGTM
ack, let's go to try the current version of the journal client :)
As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet too
My previous comment was not for this diff but for D4668 :)
As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet to
A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests
Dec 7 2020
Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html
- an increment of 250Go was added via the proxmox ui (for a total of 500Go now)
- the disk was resized on the os side :
root@pergamon:~# parted /dev/vdc GNU Parted 3.2 Using /dev/vdc Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print Model: Virtio Block Device (virtblk) Disk /dev/vdc: 537GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags:
puppet was disabled on production nodes to avoid this diff to be applied. We will perform a rolling restart of the production cluster after the next scheduled kernel upgrade
The disk will be resized to avoid the service disruption in a short term before looking at T1362
Dec 4 2020
esnode3 was restarted and updated.
~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{ "persistent": { "cluster.routing.allocation.enable": "primaries" } }' {"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%
The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.
A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.
fake LGTM to be able to change the status
Dec 3 2020
According the description and to the latest discussion we haved decided to not continue in this direction.
Dec 2 2020
This is not viable as the opnsense api doesn't allow to retrieve the network rules
After T2828, It's more clear of what must be deployed to have the counters working on staging:
- the counters can be intialized via the /stat/refresh endpoint of the storage api (Note: It will create more counters than production as directory_entry_* and revision_history are not counted in production)
- Add a script/service to execute the `swh_update_counter_bucketed` in an infinite loop
- Create the buckets in the object_counts_bucketed
- per object type : identifier|bucket_start|bucket_end. value and last_update will be updated be the stored procedures.
- configure prometheus sql exporter for db1.staging [1]
- configure profile_exporter on pergamon
- Update the script to ensure the data are filtered by environments (to avoid staging data to be included in production counts [2])
- Configure a new cron
- loading an empty file for historical data
- creating a new export_file
- update webapp to be able to configure the counter origin
I took the time to create a schema of the pipeline to help me summarize the subject.
It should help to deploy the counters in staging.
SVG :
Dec 1 2020
I was right until the last point :). My mistake was to look at an old version of stored procedure to base my reflection.
Thanks again for the explanation, it's crystal clear now.
Thanks for the clarification.
I missed those counters, I was only focused on the sql_swh_archive_object_count metrics. Could you give some pointers or information on how it's called ? I can only found the stored procedure declaration on storage [1].
All stopped workers are restarted :
vsellier@pergamon ~ % sudo clush -b -w @swh-workers16 'puppet agent --enable; systemctl default'
Erratum: the counters are not yet visible on the "Object added by time period" dashboard due to the aggregation per day.
In T2828#53740, @rdicosmo wrote:Thanks for looking into this. It would be great to make sure that statistics are collected only once at a time (every X hours), and cached, so to avoid rerunning expensive queries regularly.
The postgresql statistics come back online [1].
The "Object added by time period" dashboard[2] has also data to display
D4635 is landed.
The installation doesn't look too complicated [1], we need to install a new firewall dedicated to admin/internal tools.
The unknown part is on the sso part, but as a quick win, we can try to plug it on the current softwareheritage keycloak's scheme with a dedicated group
In D4635#115756, @olasd wrote:I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...
In D4635#115756, @olasd wrote:I was also somewhat concerned that this filtering would, in fact, show a partial picture of what's happening on the database, as we do a lot of processing / I/Os in temp tables. But in practice the aggregate value of i/os on temp tables is orders of magnitude lower than that of any actual table, so I guess this is fine...
Exclude only pg_temp schemas
This is the complete result of the statistics with the temporary tables : P886
As the slowness of the monitoring requests doesn't seem to be related to the direct load on the database, the indexers were restarted :
vsellier@pergamon ~ % sudo clush -b -w @azure-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in "swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service"; do systemctl enable $unit; done; systemctl start swh-worker@indexer_origin_intrinsic_metadata.service swh-worker@indexer_fossology_license.service swh-worker@indexer_content_mimetype.service; puppet agent --enable'
Update description
D4635 is a proposal to solve the performance issues on the statistic queries
Nov 30 2020
@olasd has stopped the backfilling with :
pkill -2 -u swhstorage -f revision
(allow to flush the logs before exiting)
Half of the workers were stopped :
root@pergamon:~# sudo clush -b -w @swh-workers16 'puppet agent --disable "Reduce load of belvedere"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*' worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker11: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker10: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker09: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@checker_deposit.service. worker12: Removed /etc/systemd/system/multi-user.target.wants/swh-worker@lister.service. ...
It seems there is no other solution then reducing the load on belvedere.
There is an aggressive backfill in progress from getty(192.168.100.102) :
postgres=# select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; select client_addr, count(datid) from pg_stat_activity where state != 'idle' group by client_addr; client_addr | count -----------------+------- | 3 192.168.100.18 | 0 ::1 | 1 192.168.100.210 | 60 192.168.100.102 | 64 (5 rows)
I don't want to kill the job running since several day (2020-11-27) to avoid losing any work, The temporary solution is to reduce the number of workers to relieve the load on belvedere
In T2828#53672, @rdicosmo wrote:Hmmm... there is definitely no need to update the counters more than once a day
Let's try a temporary workaround :
root@belvedere:/etc/prometheus-sql-exporter# puppet agent --disable "Diagnose prometheus-exporter timeout" root@belvedere:/etc/prometheus-sql-exporter# mv swh-scheduler.yml ~ root@belvedere:/etc/prometheus-sql-exporter# systemctl restart prometheus-sql-exporter
It seems some queries are executed on the database each time the metrics are requested.
This one is too long (on the swh-scheduler instance):
After retracing the counter computation pipeline, it seems they are computed from the values stored on prometheus.