Page MenuHomeSoftware Heritage

Storage metrics not refreshed
Closed, MigratedEdits Locked

Description

Since the 2022-03-23, it seems the statsd storage metrics are not updated anymore.

It could be a regression of 284a4ab3066956dc403c4ada700e8421990c0528 (released in 1.1.0) as the date match with the upgrade from version 1.0.0 to version 1.2.0 of the python3-swh.storage package.

Start-Date: 2022-03-23  17:01:23
Commandline: apt dist-upgrade
Requested-By: olasd (1001)
Install: linux-headers-5.10.0-0.bpo.12-common:amd64 (5.10.103-1~bpo10+1, automatic), linux-headers-5.10.0-0.bpo.12-amd64:amd64 (5.10.103-1~bpo10+1, automatic), li
nux-image-5.10.0-0.bpo.12-amd64:amd64 (5.10.103-1~bpo10+1, automatic)
Upgrade: linux-kbuild-5.10:amd64 (5.10.92-1~bpo10+1, 5.10.103-1~bpo10+1), linux-image-amd64:amd64 (5.10.70-1~bpo10+1, 5.10.103-1~bpo10+1), python3-swh.model:amd64
 (4.4.0-1~swh1~bpo10+1, 6.0.0-1~swh1~bpo10+1), linux-headers-amd64:amd64 (5.10.70-1~bpo10+1, 5.10.103-1~bpo10+1), python3-swh.storage:amd64 (1.0.0-1~swh1~bpo10+1,
 1.2.0-1~swh1~bpo10+1), python3-swh.core:amd64 (2.2.1-1~swh1~bpo10+1, 2.2.2-1~swh1~bpo10+1)
End-Date: 2022-03-23  17:03:12

The monitoring stack seems to work correctly as some statistics are still updated.
for example, the stats of the content replayer:

 # HELP swh_content_replayer_retries_total Metric autogenerated by statsd_exporter.
 # TYPE swh_content_replayer_retries_total counter
 swh_content_replayer_retries_total{attempt="1",operation="copy"} 1.2108223e+07
-swh_content_replayer_retries_total{attempt="1",operation="get_object"} 3.20125529e+08
-swh_content_replayer_retries_total{attempt="1",operation="put_object"} 3.17732857e+08
+swh_content_replayer_retries_total{attempt="1",operation="get_object"} 3.20156286e+08
+swh_content_replayer_retries_total{attempt="1",operation="put_object"} 3.17763416e+08
 swh_content_replayer_retries_total{attempt="2",operation="copy"} 950575
-swh_content_replayer_retries_total{attempt="2",operation="put_object"} 2.3881e+06
+swh_content_replayer_retries_total{attempt="2",operation="put_object"} 2.388296e+06
 swh_content_replayer_retries_total{attempt="3",operation="copy"} 9220
 swh_content_replayer_retries_total{attempt="3",operation="put_object"} 446

The storage duration are still updated but not the operation counts:

 # HELP swh_objstorage_request_duration_seconds_error_count Metric autogenerated by statsd_exporter.
 # TYPE swh_objstorage_request_duration_seconds_error_count counter
 swh_objstorage_request_duration_seconds_error_count{endpoint="get_bytes"} 1166
@@ -1154,33 +1154,33 @@
 swh_storage_request_duration_seconds_bucket{endpoint="extid_get_from_target",le="+Inf"} 3.956518e+06
 swh_storage_request_duration_seconds_sum{endpoint="extid_get_from_target"} 39741.5300646238
 swh_storage_request_duration_seconds_count{endpoint="extid_get_from_target"} 3.956518e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.005"} 1.084838e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.01"} 1.084912e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.025"} 1.084965e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.05"} 1.084974e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.1"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.25"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.5"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.75"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="1"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="2"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="5"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="10"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="15"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="30"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="45"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="60"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="120"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="300"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="600"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="900"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="1800"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="2700"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="3600"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="7200"} 1.084975e+06
-swh_storage_request_duration_seconds_bucket{endpoint="index",le="+Inf"} 1.084975e+06
-swh_storage_request_duration_seconds_sum{endpoint="index"} 11.127004264484611
-swh_storage_request_duration_seconds_count{endpoint="index"} 1.084975e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.005"} 1.084857e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.01"} 1.084931e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.025"} 1.084984e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.05"} 1.084993e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.1"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.25"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.5"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="0.75"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="1"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="2"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="5"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="10"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="15"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="30"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="45"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="60"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="120"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="300"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="600"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="900"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="1800"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="2700"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="3600"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="7200"} 1.084994e+06
+swh_storage_request_duration_seconds_bucket{endpoint="index",le="+Inf"} 1.084994e+06
+swh_storage_request_duration_seconds_sum{endpoint="index"} 11.127209654639927
+swh_storage_request_duration_seconds_count{endpoint="index"} 1.084994e+06

Event Timeline

vsellier renamed this task from Storage metrics not updated to Storage metrics not refreshed.Mar 30 2022, 10:01 AM
vsellier changed the task status from Open to Work in Progress.
vsellier triaged this task as High priority.
vsellier created this task.

What exact metric are you saying isn't updating? In your diff, the swh_storage_request_duration_seconds_count{endpoint="index"} metric seems to be increasing normally

swh.storage 1.2.0 increased a bunch of timeouts and made some queries smarter so it's entirely plausible that the number of errors has dropped drastically.

Sorry the description was not completely clear
The durations metrics are still updated, but not the operations count ones
In 2 metrics at 10 minutes of interval:

vsellier@saam ~ % ls -al /tmp/m[12]              
-rw-r--r-- 1 vsellier vsellier 176850 Mar 30 07:25 /tmp/m1
-rw-r--r-- 1 vsellier vsellier 176846 Mar 30 07:34 /tmp/m2

Only the operation count of the indexers are update:

vsellier@saam ~ % diff -u3 /tmp/m1 /tmp/m2 | grep operations
 # HELP swh_content_replayer_operations_total Metric autogenerated by statsd_exporter.
 # TYPE swh_content_replayer_operations_total counter
-swh_content_replayer_operations_total{decision="copied"} 3.691309537e+09
+swh_content_replayer_operations_total{decision="copied"} 3.691340292e+09
 swh_content_replayer_operations_total{decision="excluded"} 3.42409768e+09
 swh_content_replayer_operations_total{decision="failed"} 5504
 swh_content_replayer_operations_total{decision="not_in_src"} 165
 # HELP swh_indexer_storage_operations_total Metric autogenerated by statsd_exporter.
 # TYPE swh_indexer_storage_operations_total counter
 swh_indexer_storage_operations_total{endpoint="content_fossology_license_add",object_type="content_fossology_license",operation="add"} 8.02107655e+08
 swh_indexer_storage_operations_total{endpoint="content_metadata_add",object_type="content_metadata",operation="add"} 577453
-swh_indexer_storage_operations_total{endpoint="content_mimetype_add",object_type="content_mimetype",operation="add"} 4.013740961e+09
+swh_indexer_storage_operations_total{endpoint="content_mimetype_add",object_type="content_mimetype",operation="add"} 4.013748961e+09
 swh_indexer_storage_operations_total{endpoint="indexer_configuration_add",object_type="indexer_configuration",operation="add"} 7.120871e+07
 swh_indexer_storage_operations_total{endpoint="origin_intrinsic_metadata_add",object_type="origin_intrinsic_metadata",operation="add"} 5.3757244e+07
 swh_indexer_storage_operations_total{endpoint="revision_intrinsic_metadata_add",object_type="revision_intrinsic_metadata",operation="add"} 5.3743287e+07
vsellier@saam ~ % grep swh_storage_operation /tmp/m1
# HELP swh_storage_operations_bytes_total Metric autogenerated by statsd_exporter.
# TYPE swh_storage_operations_bytes_total counter
swh_storage_operations_bytes_total{endpoint="content_add",object_type="content",operation="add"} 2.93119486862783e+14
# HELP swh_storage_operations_total Metric autogenerated by statsd_exporter.
# TYPE swh_storage_operations_total counter
swh_storage_operations_total{endpoint="content_add",object_type="content",operation="add"} 2.218783146e+09
swh_storage_operations_total{endpoint="directory_add",object_type="directory",operation="add"} 1.817779486e+09
swh_storage_operations_total{endpoint="directory_metadata_add",object_type="dir_metadata",operation="add"} 2.3578921e+07
swh_storage_operations_total{endpoint="metadata_authority",object_type="metadata_authority",operation="add"} 3.055405e+06
swh_storage_operations_total{endpoint="metadata_authority_add",object_type="metadata_authority",operation="add"} 1.0029911e+07
swh_storage_operations_total{endpoint="metadata_fetcher",object_type="metadata_fetcher",operation="add"} 3.05542e+06
swh_storage_operations_total{endpoint="metadata_fetcher_add",object_type="metadata_fetcher",operation="add"} 1.0029917e+07
swh_storage_operations_total{endpoint="origin_add",object_type="origin",operation="add"} 1.8290367e+07
swh_storage_operations_total{endpoint="origin_metadata_add",object_type="ori_metadata",operation="add"} 162
swh_storage_operations_total{endpoint="origin_visit",object_type="origin_visit",operation="add"} 1.34871326e+08
swh_storage_operations_total{endpoint="origin_visit_status",object_type="origin_visit_status",operation="add"} 1.37391811e+08
swh_storage_operations_total{endpoint="origin_visit_status_add",object_type="origin_visit_status",operation="add"} 6.6002981e+07
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="dir_metadata",operation="add"} 3.0642109e+07
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="emd_metadata",operation="add"} 349
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="ori_metadata",operation="add"} 13054
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="snp_metadata",operation="add"} 417
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_metadata_add",object_type="emd_metadata",operation="add"} 56
swh_storage_operations_total{endpoint="release_add",object_type="release",operation="add"} 1.5497131e+07
swh_storage_operations_total{endpoint="revision_add",object_type="revision",operation="add"} 4.75891848e+08
swh_storage_operations_total{endpoint="skipped_content_add",object_type="skipped_content",operation="add"} 74894
swh_storage_operations_total{endpoint="snapshot_add",object_type="snapshot",operation="add"} 2.9549466e+07
swh_storage_operations_total{endpoint="snapshot_metadata_add",object_type="snp_metadata",operation="add"} 110
vsellier@saam ~ % grep swh_storage_operation /tmp/m2
# HELP swh_storage_operations_bytes_total Metric autogenerated by statsd_exporter.
# TYPE swh_storage_operations_bytes_total counter
swh_storage_operations_bytes_total{endpoint="content_add",object_type="content",operation="add"} 2.93119486862783e+14
# HELP swh_storage_operations_total Metric autogenerated by statsd_exporter.
# TYPE swh_storage_operations_total counter
swh_storage_operations_total{endpoint="content_add",object_type="content",operation="add"} 2.218783146e+09
swh_storage_operations_total{endpoint="directory_add",object_type="directory",operation="add"} 1.817779486e+09
swh_storage_operations_total{endpoint="directory_metadata_add",object_type="dir_metadata",operation="add"} 2.3578921e+07
swh_storage_operations_total{endpoint="metadata_authority",object_type="metadata_authority",operation="add"} 3.055405e+06
swh_storage_operations_total{endpoint="metadata_authority_add",object_type="metadata_authority",operation="add"} 1.0029911e+07
swh_storage_operations_total{endpoint="metadata_fetcher",object_type="metadata_fetcher",operation="add"} 3.05542e+06
swh_storage_operations_total{endpoint="metadata_fetcher_add",object_type="metadata_fetcher",operation="add"} 1.0029917e+07
swh_storage_operations_total{endpoint="origin_add",object_type="origin",operation="add"} 1.8290367e+07
swh_storage_operations_total{endpoint="origin_metadata_add",object_type="ori_metadata",operation="add"} 162
swh_storage_operations_total{endpoint="origin_visit",object_type="origin_visit",operation="add"} 1.34871326e+08
swh_storage_operations_total{endpoint="origin_visit_status",object_type="origin_visit_status",operation="add"} 1.37391811e+08
swh_storage_operations_total{endpoint="origin_visit_status_add",object_type="origin_visit_status",operation="add"} 6.6002981e+07
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="dir_metadata",operation="add"} 3.0642109e+07
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="emd_metadata",operation="add"} 349
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="ori_metadata",operation="add"} 13054
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_add",object_type="snp_metadata",operation="add"} 417
swh_storage_operations_total{endpoint="raw_extrinsic_metadata_metadata_add",object_type="emd_metadata",operation="add"} 56
swh_storage_operations_total{endpoint="release_add",object_type="release",operation="add"} 1.5497131e+07
swh_storage_operations_total{endpoint="revision_add",object_type="revision",operation="add"} 4.75891848e+08
swh_storage_operations_total{endpoint="skipped_content_add",object_type="skipped_content",operation="add"} 74894
swh_storage_operations_total{endpoint="snapshot_add",object_type="snapshot",operation="add"} 2.9549466e+07
swh_storage_operations_total{endpoint="snapshot_metadata_add",object_type="snp_metadata",operation="add"} 110
vsellier claimed this task.

Thanks olasd for restarting the service following this documentation: https://docs.gunicorn.org/en/stable/signals.html#upgrading-to-a-new-binary-on-the-fly

First, replace the old binary with a new one, then send a USR2 signal to the current master process. It executes a new binary whose PID file is postfixed with .2 (e.g. /var/run/gunicorn.pid.2), which in turn starts a new master process and new worker processes:
At this point, two instances of Gunicorn are running, handling the incoming requests together. To phase the old instance out, you have to send a WINCH signal to the old master process, and its worker processes will start to gracefully shut down.

Everything is working well again now.