Advanced Search
Use Results
Edit Query
Hide Query

vsellier moved T2960: Add disk health monitoring from Backlog to Weekly backlog on the System administration board.

vsellier changed the status of T3033: Replace first disk on storage1.staging, a subtask of T2939: Replace out of order disks on db1.staging and storage1.staging, from Open to Work in Progress.

vsellier changed the status of T3033: Replace first disk on storage1.staging from Open to Work in Progress.

Feb 8 2021, 12:49 PM · System administration

vsellier triaged T3033: Replace first disk on storage1.staging as Normal priority.

Feb 5 2021

vsellier added a comment to T2231: Continuous deployment.

I start to throw some ideas in this document : https://hedgedoc.softwareheritage.org/Fi2pq7zkSw6aVAJwk9Xhqw

Feb 5 2021, 5:48 PM · meta-task, Roadmap 2022, Staging environment, Roadmap 2020

vsellier updated the task description for T3030: Improve loaders to deal with new visit status events.

Feb 5 2021, 2:30 PM · Core Loader

vsellier updated the summary of D5024: package: Mark visit status as failed when relevant.

Feb 5 2021, 2:29 PM

vsellier added a comment to T2912: Next generation archive counters.

Nice, thanks for confirming this at the source.

Feb 5 2021, 10:03 AM · Roadmap 2021, System administration, Monitoring, Web app

vsellier committed rDSNIP931603f05078: add a script to convert es indexes fields (authored by vsellier).

add a script to convert es indexes fields

Feb 5 2021, 9:21 AM

vsellier committed rDSNIP7718a0ca86ea: Support unreadable messages (authored by vsellier).

Support unreadable messages

Feb 5 2021, 9:21 AM

vsellier edited P940 Elasticsearch field conversion from string to integer.

Feb 5 2021, 9:10 AM · System administration

vsellier closed T2787: Improve access_logs parsing as Resolved.

It seems there were some huge queries the last few days [1], the script needed to be adapted to use Long instead of Integers :

apache_logs-2021.01.14:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
          "java.base/java.lang.Integer.parseInt(Integer.java:652)",
          "java.base/java.lang.Integer.parseInt(Integer.java:770)",
          "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
          "                                                                                                ^---- HERE"
        ],
        "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
        "lang" : "painless",
        "position" : {
          "offset" : 96,
          "start" : 0,
          "end" : 125
        }
      }
    ],
    "type" : "script_exception",
    "reason" : "runtime error",
    "script_stack" : [
      "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
      "java.base/java.lang.Integer.parseInt(Integer.java:652)",
      "java.base/java.lang.Integer.parseInt(Integer.java:770)",
      "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
      "                                                                                                ^---- HERE"
    ],
    "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
    "lang" : "painless",
    "position" : {
      "offset" : 96,
      "start" : 0,
      "end" : 125
    },
    "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: \"4633815064\""
    }
  },
  "status" : 400
}

Feb 5 2021, 9:09 AM · System administration, Metrics/monitoring

Feb 4 2021

vsellier added a comment to T2787: Improve access_logs parsing.

The opened apache indexes are currently being migrated with the P940's script.

Feb 4 2021, 8:12 PM · System administration, Metrics/monitoring

vsellier created P940 Elasticsearch field conversion from string to integer.

Feb 4 2021, 7:46 PM · System administration

vsellier accepted D5007: Update decomissioning script with necessary instruction.

Feb 4 2021, 11:52 AM

vsellier accepted D5005: Decomission storage02.euwest.

Feb 4 2021, 11:48 AM

vsellier accepted D5006: hiera: Move cassandra configuration to its own yaml config file.

\o/ thanks,
We should be able to properly format the common file now

Feb 4 2021, 11:46 AM

vsellier committed rCDFPd15519eb8450: Add a current status (authored by vsellier).

Add a current status

Feb 4 2021, 11:27 AM

vsellier committed rCDFP5e116e91517d: WIP - POC kubernetes (authored by vsellier).

WIP - POC kubernetes

Feb 4 2021, 11:23 AM

vsellier added a comment to T2787: Improve access_logs parsing.

The log parsing is ok.
An elasticsearch datasource was created on grafana so we can now create some graphs based on the logs on elasticsearch.
A simple dashboard to display some statistics based on the apache log was initiated[1], it appears the design is not as simple as in kibana and have some limitations but it still allows to have basic information centralized in grafana.

Feb 4 2021, 10:42 AM · System administration, Metrics/monitoring

vsellier added a comment to T2912: Next generation archive counters.

The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.

Feb 4 2021, 9:48 AM · Roadmap 2021, System administration, Monitoring, Web app

vsellier closed T2975: Disk replacement on esnode1 as Resolved.

So far so good, the smart test is done and didn't find any errors :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         9         -

Feb 4 2021, 9:15 AM · System administration

Feb 2 2021

vsellier committed rSPSITEf59776a25cd5: logstash: clearly identify the applications (authored by vsellier).

logstash: clearly identify the applications

vsellier committed rSPSITEb84027e16513: logstash: Allow to perform numerical operations on bytes (authored by vsellier).

logstash: Allow to perform numerical operations on bytes

vsellier closed D5000: deposit: add request duration on access logs.

vsellier committed rSPSITE3c7194e85922: deposit: add request duration on access logs (authored by vsellier).

deposit: add request duration on access logs

Feb 2 2021, 7:05 PM · System administration, Metrics/monitoring

vsellier updated the diff for D5000: deposit: add request duration on access logs.

configure filebeat to check the right file based on the vhost
add some additional fields to help the message filtering

Feb 2 2021, 8:20 PM

vsellier requested review of D5000: deposit: add request duration on access logs.

Feb 2 2021, 7:05 PM

vsellier added a revision to T2787: Improve access_logs parsing: D5000: deposit: add request duration on access logs.

vsellier added a comment to T2975: Disk replacement on esnode1.

partition recreated :

# sfdisk -d /dev/sda | sfdisk -f /dev/sdb

zfs pool recreated with the wwn ids :

root@esnode1:/etc/zfs# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4)
root@esnode1:/etc/zfs# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   152K  7.00T        -         -     0%     0%  1.00x    ONLINE  -

server restarted to check everything is ok
allocation reactivated :

❯ export ES_NODE=192.168.100.61:9200 
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{                                                       18:11:28
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : null
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}

and in progress :

 ❯ curl -s http://$ES_NODE/_cat/health\?v; echo; curl -s http://$ES_NODE/_cat/allocation\?v\&s=node                                                       18:12:47
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1612285969 17:12:49  swh-logging-prod green           3         3   8974 4487    2    0        0             0                  -                100.0%

Feb 2 2021, 6:15 PM · System administration

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

Feb 2 2021, 6:14 PM · System administration

vsellier added a comment to T2975: Disk replacement on esnode1.

The disk is replaced :

# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Feb 2 2021, 5:52 PM · System administration

vsellier added a comment to T2787: Improve access_logs parsing.

Configuration deployed for the webapp on all servers, the logs have now the duration, which is parsed on the elasticseach entries :

Feb 2 2021, 3:39 PM · System administration, Metrics/monitoring

vsellier committed rSPSITE249f747e9c35: apache: Add the request duration on access logs (authored by vsellier).

apache: Add the request duration on access logs

vsellier committed rSPSITEcc35baf50c73: logstash: Add support an optional duration on apache logs (authored by vsellier).

logstash: Add support an optional duration on apache logs

vsellier committed rSPSITE8e5ca3287738: webapp: improve access log parsing (authored by vsellier).

webapp: improve access log parsing

vsellier closed D4989: Add request durations in access logs and improve logstash's integer parsing.

vsellier committed rSPSITE908a635fff3d: webapp: code format (authored by vsellier).

webapp: code format

vsellier closed D4974: logstash: fix first puppet run and configuration updates.

vsellier committed rSPSITE2cf48d29a464: logstash: fix first puppet run and configuration updates (authored by vsellier).

logstash: fix first puppet run and configuration updates