Page MenuHomeSoftware Heritage

Metrics/monitoringTag
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Recent Activity

Wed, Feb 10

ardumont moved T2787: Improve access_logs parsing from in-progress to done on the System administration board.
Wed, Feb 10, 7:06 PM · System administration, Metrics/monitoring

Fri, Feb 5

vsellier closed T2787: Improve access_logs parsing as Resolved.

It seems there were some huge queries the last few days, the script needed to be adapted to use Long instead of Integers :

apache_logs-2021.01.14:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
          "java.base/java.lang.Integer.parseInt(Integer.java:652)",
          "java.base/java.lang.Integer.parseInt(Integer.java:770)",
          "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
          "                                                                                                ^---- HERE"
        ],
        "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
        "lang" : "painless",
        "position" : {
          "offset" : 96,
          "start" : 0,
          "end" : 125
        }
      }
    ],
    "type" : "script_exception",
    "reason" : "runtime error",
    "script_stack" : [
      "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
      "java.base/java.lang.Integer.parseInt(Integer.java:652)",
      "java.base/java.lang.Integer.parseInt(Integer.java:770)",
      "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
      "                                                                                                ^---- HERE"
    ],
    "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
    "lang" : "painless",
    "position" : {
      "offset" : 96,
      "start" : 0,
      "end" : 125
    },
    "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: \"4633815064\""
    }
  },
  "status" : 400
}
Fri, Feb 5, 9:09 AM · System administration, Metrics/monitoring

Thu, Feb 4

vsellier added a comment to T2787: Improve access_logs parsing.

The opened apache indexes are currently being migrated with the P940 script.

Thu, Feb 4, 8:12 PM · System administration, Metrics/monitoring
vsellier added a comment to T2787: Improve access_logs parsing.

The log parsing is ok.
An elasticsearch datasource was created on grafana so we can now create some graphs based on the logs on elasticsearch.
A simple dashboard to display some statistics based on the apache log was initiated[1], it appears the design is not as simple as in kibana and have some limitations but it still allows to have basic information centralized in grafana.

Thu, Feb 4, 10:42 AM · System administration, Metrics/monitoring

Tue, Feb 2

vsellier added a revision to T2787: Improve access_logs parsing: D5000: deposit: add request duration on access logs.
Tue, Feb 2, 7:05 PM · System administration, Metrics/monitoring
vsellier added a comment to T2787: Improve access_logs parsing.

Configuration deployed for the webapp on all servers, the logs have now the duration, which is parsed on the elasticseach entries :

Tue, Feb 2, 3:39 PM · System administration, Metrics/monitoring
vsellier added a revision to T2787: Improve access_logs parsing: D4989: Add request durations in access logs and improve logstash's integer parsing.
Tue, Feb 2, 9:55 AM · System administration, Metrics/monitoring

Jan 29 2021

vsellier added a revision to T2787: Improve access_logs parsing: D4974: logstash: fix first puppet run and configuration updates.
Jan 29 2021, 5:05 PM · System administration, Metrics/monitoring
vsellier changed the status of T2787: Improve access_logs parsing from Open to Work in Progress.
Jan 29 2021, 2:34 PM · System administration, Metrics/monitoring
vsellier added a project to T2787: Improve access_logs parsing: System administration.
Jan 29 2021, 2:33 PM · System administration, Metrics/monitoring

Nov 17 2020

vsellier added a comment to T2733: Explore / install a varnish prometheus probe.

The varnish logs should be also ingested to elasticsearch to have fine grained statistics.

Nov 17 2020, 2:42 PM · Metrics/monitoring, System administration
vsellier triaged T2787: Improve access_logs parsing as Normal priority.
Nov 17 2020, 12:36 PM · System administration, Metrics/monitoring
vsellier added a project to T2733: Explore / install a varnish prometheus probe: Metrics/monitoring.
Nov 17 2020, 11:54 AM · Metrics/monitoring, System administration

Nov 3 2020

ardumont moved T1490: Use origin url on external-id attribute on deposit admin page from Backlog to Archived on the SWORD deposit board.
Nov 3 2020, 4:07 PM · Metrics/monitoring, SWORD deposit

Oct 26 2020

douardda closed T1370: Report key code metrics in prometheus as Resolved.
Oct 26 2020, 12:30 PM · Metrics/monitoring, Restricted Project, Continuous Integration, System administration

Oct 16 2020

ardumont added a comment to T2087: Create script to test SWORD deposit on SWH.

This can be closed now.

Oct 16 2020, 11:57 AM · Metrics/monitoring, SWORD deposit

Sep 22 2020

olasd added a comment to T1461: Add loader-related metrics to swh-loader-core.

I think the second point mostly happened: the storage is returning statistics to the loader, but the loaders don't generally collect them.

Sep 22 2020, 6:13 PM · Core Loader, Metrics/monitoring
olasd updated the task description for T1461: Add loader-related metrics to swh-loader-core.
Sep 22 2020, 6:11 PM · Core Loader, Metrics/monitoring
olasd updated the task description for T1461: Add loader-related metrics to swh-loader-core.
Sep 22 2020, 6:10 PM · Core Loader, Metrics/monitoring
olasd placed T1461: Add loader-related metrics to swh-loader-core up for grabs.
Sep 22 2020, 6:10 PM · Core Loader, Metrics/monitoring
olasd closed T1435: Improve swh-scheduler prometheus metrics, a subtask of T1408: More/better Metrics, as Resolved.
Sep 22 2020, 6:09 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1435: Improve swh-scheduler prometheus metrics as Resolved.
Sep 22 2020, 6:09 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1438: Add labels to prometheus metrics to help queries as Resolved.

We've definitely improved on this (notably using proper hostnames for the instance label on prom metrics). I think we should make this task more actionable if we want to keep it open.

Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1438: Add labels to prometheus metrics to help queries, a subtask of T1408: More/better Metrics, as Resolved.
Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12

Apr 21 2020

olasd closed T1270: Investigate an application monitoring tool to automate error detection in our workers as Resolved.

I'm pretty sure this is done now ;p

Apr 21 2020, 11:36 AM · Metrics/monitoring, Development environment

Feb 15 2020

vlorentz moved T2175: Deploy swh-icinga-plugins from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz moved T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 27 2020

vlorentz added a comment to T1365: Archive coverage metrics in prometheus.

https://grafana.softwareheritage.org/d/3SAW_JEmk/software-heritage-archive-counters

Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus, a subtask of T1364: Have production metrics in prometheus or kibana, as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project

Jan 23 2020

ardumont closed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services as Resolved.

Deployed.

Jan 23 2020, 12:09 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont added a parent task for T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: T2238: Configure Sentry environments.
Jan 23 2020, 11:13 AM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 22 2020

ardumont added a revision to T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: D2576: sentry: Define setup for swh services (servers, workers, ...).
Jan 22 2020, 6:50 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz added a project to T2228: Metrics and monitoring: Metrics/monitoring.
Jan 22 2020, 4:27 PM · Metrics/monitoring, Restricted Project
ardumont claimed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.

Adapting the puppet manifest so we can discriminate issues per environment in sentry.

Jan 22 2020, 4:13 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont closed T2175: Deploy swh-icinga-plugins, a subtask of T1011: Enable continuous monitoring of deposit, as Resolved.
Jan 22 2020, 3:29 PM · Metrics/monitoring, SWORD deposit
ardumont closed T2175: Deploy swh-icinga-plugins as Resolved.
Jan 22 2020, 3:29 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Vault check deployed!

Jan 22 2020, 3:28 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Deposit check deployed!

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

debian package this

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz updated the task description for T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 22 2020, 2:11 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz renamed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Set SWH_MAIN_PACKAGE for all services to Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 22 2020, 2:10 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 20 2020

ardumont added a comment to T2175: Deploy swh-icinga-plugins.

debian package this

Jan 20 2020, 12:04 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 17 2020

ardumont claimed T2175: Deploy swh-icinga-plugins.

As far as i could tell so far:

  • debian package this
  • update puppet configuration to add the checks [1]
Jan 17 2020, 5:56 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 15 2020

vlorentz renamed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Set SWH_MAIN_PACKAGE for all SWH services to Set SWH_MAIN_PACKAGE for all services.
Jan 15 2020, 2:59 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz triaged T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services as Normal priority.
Jan 15 2020, 2:59 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz updated subscribers of T2180: Configure Jenkins to publish releases to Sentry.
Jan 15 2020, 2:58 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz created T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 15 2020, 2:58 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz updated the task description for T2180: Configure Jenkins to publish releases to Sentry.
Jan 15 2020, 2:56 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz triaged T2180: Configure Jenkins to publish releases to Sentry as Normal priority.
Jan 15 2020, 2:56 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring