diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -237,3 +237,4 @@ api-reference archive-changelog journal + statsd diff --git a/docs/statsd.rst b/docs/statsd.rst new file mode 100644 --- /dev/null +++ b/docs/statsd.rst @@ -0,0 +1,253 @@ +.. _swh_statsd_metrics: + +Statsd metrics and Grafana dashboards +===================================== + +This page lists all statsd metrics reported by Software Heritage's components, +and other metrics commonly used to monitor them + +.. _swh_statsd_metrics_archive: + +Archive +------- + +* ``sql_swh_archive_object_count`` +* ``sql_swh_scheduler_delay`` +* ``swh_archive_object_total`` + +.. _swh_statsd_metrics_journal: + +Journal +------- + +* ``swh_journal_client_handle_message_total`` +* ``swh_journal_client_status`` + +Client progress and status is monitored using the `Kafka estimated time to completion +` +dashboard for a loader-specific view, and `Kafka consumer lags +` to show all +consumers at once. + +.. _swh_statsd_metrics_indexers: + +Indexers +-------- + +See :ref:`swh_statsd_metrics_rpc`. + +.. _swh_statsd_metrics_loaders: + +Loaders +------- + +Filterered objects, ie. objects received by the loader that the archive +already has (currently only reported by the Git loader): + +* ``swh_loader_filtered_objects_percent_bucket`` +* ``swh_loader_filtered_objects_percent_count`` +* ``swh_loader_filtered_objects_percent_sum`` +* ``swh_loader_filtered_objects_total_count`` +* ``swh_loader_filtered_objects_total_sum`` + +Git references which are not loaded: + +* ``swh_loader_git_ignored_refs_percent_bucket`` +* ``swh_loader_git_ignored_refs_percent_count`` +* ``swh_loader_git_ignored_refs_percent_sum`` +* ``swh_loader_git_known_refs_percent_bucket`` +* ``swh_loader_git_known_refs_percent_count`` +* ``swh_loader_git_known_refs_percent_sum`` +* ``swh_loader_git_total`` + +Metadata loading: + +* ``swh_loader_metadata_fetchers_count`` and ``swh_loader_metadata_fetchers_sum``: the ratio is the average number of fetchers used by visit +* ``swh_loader_metadata_objects_count``: total number of metadata objects loaded +* ``swh_loader_metadata_objects_sum`` +* ``swh_loader_metadata_parent_origins_count`` and ``swh_loader_metadata_parent_origins_sum``: the ratio is the average number of origins this origin is a fork of + +Performance (all labeled with the name of an operation; and for the git loader, +by whether they are incremental): + +* ``swh_loader_operation_duration_seconds_bucket`` +* ``swh_loader_operation_duration_seconds_count`` +* ``swh_loader_operation_duration_seconds_error_count`` +* ``swh_loader_operation_duration_seconds_sum`` + +Loader status is monitored through the `Ingestion status`_ and `Loader metrics`_ +dashboards, which are focused respectively on loaded objects and loaders themselves. + +.. _Ingestion status: https://grafana.softwareheritage.org/d/Cgi8dR8Wz/ingestion-status +.. _Loader metrics: https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics + +.. _swh_statsd_metrics_objstorage: + +Object storage +-------------- + +In addition to :ref:`swh_statsd_metrics_rpc`, + +* ``swh_objstorage_in_bytes_total`` +* ``swh_objstorage_out_bytes_total`` + +.. _swh_statsd_metrics_provenance: + +Provenance +---------- + +* ``swh_provenance_archive_direct_duration_seconds_bucket`` +* ``swh_provenance_archive_direct_duration_seconds_count`` +* ``swh_provenance_archive_direct_duration_seconds_error_count`` +* ``swh_provenance_archive_direct_duration_seconds_sum`` +* ``swh_provenance_archive_graph_duration_seconds_bucket`` +* ``swh_provenance_archive_graph_duration_seconds_count`` +* ``swh_provenance_archive_graph_duration_seconds_sum`` +* ``swh_provenance_archive_multiplexed_duration_seconds_bucket`` +* ``swh_provenance_archive_multiplexed_duration_seconds_count`` +* ``swh_provenance_archive_multiplexed_duration_seconds_error_count`` +* ``swh_provenance_archive_multiplexed_duration_seconds_sum`` +* ``swh_provenance_archive_multiplexed_per_backend_count`` +* ``swh_provenance_backend_duration_seconds_bucket`` +* ``swh_provenance_backend_duration_seconds_count`` +* ``swh_provenance_backend_duration_seconds_error_count`` +* ``swh_provenance_backend_duration_seconds_sum`` +* ``swh_provenance_backend_operations_total`` +* ``swh_provenance_graph_duration_seconds_bucket`` +* ``swh_provenance_graph_duration_seconds_count`` +* ``swh_provenance_graph_duration_seconds_error_count`` +* ``swh_provenance_graph_duration_seconds_sum`` +* ``swh_provenance_origin_revision_layer_duration_seconds_bucket`` +* ``swh_provenance_origin_revision_layer_duration_seconds_count`` +* ``swh_provenance_origin_revision_layer_duration_seconds_error_count`` +* ``swh_provenance_origin_revision_layer_duration_seconds_sum`` +* ``swh_provenance_storage_postgresql_duration_seconds_bucket`` +* ``swh_provenance_storage_postgresql_duration_seconds_count`` +* ``swh_provenance_storage_postgresql_duration_seconds_error_count`` +* ``swh_provenance_storage_postgresql_duration_seconds_sum`` +* ``swh_provenance_storage_rabbitmq_duration_seconds_bucket`` +* ``swh_provenance_storage_rabbitmq_duration_seconds_count`` +* ``swh_provenance_storage_rabbitmq_duration_seconds_error_count`` +* ``swh_provenance_storage_rabbitmq_duration_seconds_sum`` + +`Index of Provenance dashboards +`_ + +.. _swh_statsd_metrics_replayers: + +Content and graph replayers +--------------------------- + +* ``swh_content_replayer_bytes`` +* ``swh_content_replayer_duration_seconds_bucket`` +* ``swh_content_replayer_duration_seconds_count`` +* ``swh_content_replayer_duration_seconds_error_count`` +* ``swh_content_replayer_duration_seconds_sum`` +* ``swh_content_replayer_operations_total`` +* ``swh_content_replayer_retries_total`` +* ``swh_graph_replayer_duration_seconds_bucket`` +* ``swh_graph_replayer_duration_seconds_count`` +* ``swh_graph_replayer_duration_seconds_sum`` +* ``swh_graph_replayer_operations_total`` + +Dashboards: + +* `Cassandra `__ +* `S3 `__ + +.. _swh_statsd_metrics_rpc: + +RPC servers +----------- + +``indexer_storage``, ``objstorage``, ``storage``, ``search`` +each report this set of metrics: + +* ``swh__request_duration_seconds_bucket`` +* ``swh__request_duration_seconds_count`` +* ``swh__request_duration_seconds_error_count`` +* ``swh__request_duration_seconds_sum`` + +``indexer_storage``, and ``search`` also have: + +* ``swh__operations_total`` + +.. _swh_statsd_metrics_scheduler: + +Scheduler +--------- + +* ``swh_scheduler_listener_handled_event_total`` +* ``swh_scheduler_origins_enabled`` +* ``swh_scheduler_origins_known`` +* ``swh_scheduler_origins_last_update`` +* ``swh_scheduler_origins_never_visited`` +* ``swh_scheduler_origins_with_pending_changes`` +* ``swh_scheduler_runner_scheduled_task_total`` +* ``swh_task_called_count`` +* ``swh_task_duration_seconds_bucket`` +* ``swh_task_duration_seconds_count`` +* ``swh_task_duration_seconds_error_count`` +* ``swh_task_duration_seconds_sum`` +* ``swh_task_end_ts`` +* ``swh_task_failure_count`` +* ``swh_task_start_ts`` +* ``swh_task_success_count`` + +.. _swh_statsd_metrics_search: + +Search +------ + +See :ref:`swh_statsd_metrics_rpc`. + +.. _swh_statsd_metrics_scrubber: + +Scrubber +-------- + +Performance: + +* ``swh_scrubber_batch_duration_seconds_bucket`` +* ``swh_scrubber_batch_duration_seconds_count`` +* ``swh_scrubber_batch_duration_seconds_error_count`` +* ``swh_scrubber_batch_duration_seconds_sum`` +* ``swh_scrubber_objects_hashed_total`` + +Corruptions found: + +* ``swh_scrubber_hash_mismatch_total`` +* ``swh_scrubber_missing_object_total`` + +.. _swh_statsd_metrics_storage: + +Storage +------- + +In addition to :ref:`swh_statsd_metrics_rpc`, + +* ``swh_storage_operations_bytes_total``, which reports the total number of content bytes + going through the RPC server + +.. _swh_statsd_metrics_webapp: + +Webapp +------ + +* ``swh_web_accepted_save_requests`` +* ``swh_web_save_requests_delay_seconds`` +* ``swh_web_submitted_save_requests`` +* ``swh_web_submitted_save_requests_from_webhooks`` + +Dashboard: `Save Code Now +`_ + +.. _swh_statsd_metrics_misc: + +Other metrics +------------- + +Performance of end-to-end tests: + +* ``swh_e2e_duration_seconds`` +* ``swh_e2e_status``