Page MenuHomeSoftware Heritage
Feed Advanced Search

Feb 9 2021

vsellier added a revision to T2566: Add an icinga check on whether the puppet agent is enabled: D5043: icinga: monitor puppet agent activation.
Feb 9 2021, 9:37 AM · System administration
vsellier requested review of D5043: icinga: monitor puppet agent activation.
Feb 9 2021, 9:37 AM
vsellier claimed T2566: Add an icinga check on whether the puppet agent is enabled.
Feb 9 2021, 9:26 AM · System administration

Feb 8 2021

vsellier added a comment to T2566: Add an icinga check on whether the puppet agent is enabled.

The file /var/lib/puppet/state/agent_disabled.lock can be checked to detect if puppet is disable or not.

Feb 8 2021, 3:15 PM · System administration
vsellier changed the status of T2566: Add an icinga check on whether the puppet agent is enabled from Open to Work in Progress.
Feb 8 2021, 2:20 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

Precision around these disks replacement, even if the disks are in error, there is still spares on the zfs pool:

  • db1 :
root@db1:~# zpool status data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:16:01 with 0 errors on Sun Jan 10 00:40:03 2021
config:
Feb 8 2021, 12:57 PM · System administration
vsellier moved T3009: Manage backfiller configuration in puppet from Backlog to Weekly backlog on the System administration board.
Feb 8 2021, 12:51 PM · System administration
vsellier moved T3015: Sentry should have two different projects for swh-indexer and swh-indexer-storage from Backlog to Weekly backlog on the System administration board.
Feb 8 2021, 12:50 PM · System administration, Sentry
vsellier moved T2566: Add an icinga check on whether the puppet agent is enabled from Backlog to Weekly backlog on the System administration board.
Feb 8 2021, 12:50 PM · System administration
vsellier moved T2960: Add disk health monitoring from Backlog to Weekly backlog on the System administration board.
Feb 8 2021, 12:50 PM · System administration
vsellier changed the status of T3033: Replace first disk on storage1.staging, a subtask of T2939: Replace out of order disks on db1.staging and storage1.staging, from Open to Work in Progress.
Feb 8 2021, 12:50 PM · System administration
vsellier changed the status of T3033: Replace first disk on storage1.staging from Open to Work in Progress.
Feb 8 2021, 12:50 PM · System administration
vsellier triaged T3033: Replace first disk on storage1.staging as Normal priority.
Feb 8 2021, 12:49 PM · System administration

Feb 5 2021

vsellier added a comment to T2231: Continuous deployment.

I start to throw some ideas in this document : https://hedgedoc.softwareheritage.org/Fi2pq7zkSw6aVAJwk9Xhqw

Feb 5 2021, 5:48 PM · meta-task, Roadmap 2022, Staging environment, Roadmap 2020
vsellier updated the task description for T3030: Improve loaders to deal with new visit status events.
Feb 5 2021, 2:30 PM · Core Loader
vsellier updated the summary of D5024: package: Mark visit status as failed when relevant.
Feb 5 2021, 2:29 PM
vsellier added a comment to T2912: Next generation archive counters.

Nice, thanks for confirming this at the source.

Feb 5 2021, 10:03 AM · Roadmap 2021, System administration, Monitoring, Web app
vsellier committed rDSNIP931603f05078: add a script to convert es indexes fields (authored by vsellier).
add a script to convert es indexes fields
Feb 5 2021, 9:21 AM
vsellier committed rDSNIP7718a0ca86ea: Support unreadable messages (authored by vsellier).
Support unreadable messages
Feb 5 2021, 9:21 AM
vsellier edited P940 Elasticsearch field conversion from string to integer.
Feb 5 2021, 9:10 AM · System administration
vsellier closed T2787: Improve access_logs parsing as Resolved.

It seems there were some huge queries the last few days [1], the script needed to be adapted to use Long instead of Integers :

apache_logs-2021.01.14:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
          "java.base/java.lang.Integer.parseInt(Integer.java:652)",
          "java.base/java.lang.Integer.parseInt(Integer.java:770)",
          "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
          "                                                                                                ^---- HERE"
        ],
        "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
        "lang" : "painless",
        "position" : {
          "offset" : 96,
          "start" : 0,
          "end" : 125
        }
      }
    ],
    "type" : "script_exception",
    "reason" : "runtime error",
    "script_stack" : [
      "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
      "java.base/java.lang.Integer.parseInt(Integer.java:652)",
      "java.base/java.lang.Integer.parseInt(Integer.java:770)",
      "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
      "                                                                                                ^---- HERE"
    ],
    "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
    "lang" : "painless",
    "position" : {
      "offset" : 96,
      "start" : 0,
      "end" : 125
    },
    "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: \"4633815064\""
    }
  },
  "status" : 400
}
Feb 5 2021, 9:09 AM · System administration, Metrics/monitoring

Feb 4 2021

vsellier added a comment to T2787: Improve access_logs parsing.

The opened apache indexes are currently being migrated with the P940's script.

Feb 4 2021, 8:12 PM · System administration, Metrics/monitoring
vsellier created P940 Elasticsearch field conversion from string to integer.
Feb 4 2021, 7:46 PM · System administration
vsellier accepted D5007: Update decomissioning script with necessary instruction.
Feb 4 2021, 11:52 AM
vsellier accepted D5005: Decomission storage02.euwest.
Feb 4 2021, 11:48 AM
vsellier accepted D5006: hiera: Move cassandra configuration to its own yaml config file.

\o/ thanks,
We should be able to properly format the common file now

Feb 4 2021, 11:46 AM
vsellier committed rCDFPd15519eb8450: Add a current status (authored by vsellier).
Add a current status
Feb 4 2021, 11:27 AM
vsellier committed rCDFP5e116e91517d: WIP - POC kubernetes (authored by vsellier).
WIP - POC kubernetes
Feb 4 2021, 11:23 AM
vsellier added a comment to T2787: Improve access_logs parsing.

The log parsing is ok.
An elasticsearch datasource was created on grafana so we can now create some graphs based on the logs on elasticsearch.
A simple dashboard to display some statistics based on the apache log was initiated[1], it appears the design is not as simple as in kibana and have some limitations but it still allows to have basic information centralized in grafana.

Feb 4 2021, 10:42 AM · System administration, Metrics/monitoring
vsellier added a comment to T2912: Next generation archive counters.

The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.

Feb 4 2021, 9:48 AM · Roadmap 2021, System administration, Monitoring, Web app
vsellier closed T2975: Disk replacement on esnode1 as Resolved.

So far so good, the smart test is done and didn't find any errors :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         9         -
Feb 4 2021, 9:15 AM · System administration

Feb 2 2021

vsellier committed rSPSITEf59776a25cd5: logstash: clearly identify the applications (authored by vsellier).
logstash: clearly identify the applications
Feb 2 2021, 8:21 PM
vsellier committed rSPSITEb84027e16513: logstash: Allow to perform numerical operations on bytes (authored by vsellier).
logstash: Allow to perform numerical operations on bytes
Feb 2 2021, 8:21 PM
vsellier closed D5000: deposit: add request duration on access logs.
Feb 2 2021, 8:21 PM
vsellier committed rSPSITE3c7194e85922: deposit: add request duration on access logs (authored by vsellier).
deposit: add request duration on access logs
Feb 2 2021, 8:21 PM
vsellier updated the diff for D5000: deposit: add request duration on access logs.

configure filebeat to check the right file based on the vhost
add some additional fields to help the message filtering

Feb 2 2021, 8:20 PM
vsellier requested review of D5000: deposit: add request duration on access logs.
Feb 2 2021, 7:05 PM
vsellier added a revision to T2787: Improve access_logs parsing: D5000: deposit: add request duration on access logs.
Feb 2 2021, 7:05 PM · System administration, Metrics/monitoring
vsellier added a comment to T2975: Disk replacement on esnode1.
  • partition recreated :
# sfdisk -d /dev/sda | sfdisk -f /dev/sdb
  • zfs pool recreated with the wwn ids :
root@esnode1:/etc/zfs# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4)
root@esnode1:/etc/zfs# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   152K  7.00T        -         -     0%     0%  1.00x    ONLINE  -
  • server restarted to check everything is ok
  • allocation reactivated :
❯ export ES_NODE=192.168.100.61:9200 
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{                                                       18:11:28
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : null
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}
  • and in progress :
 ❯ curl -s http://$ES_NODE/_cat/health\?v; echo; curl -s http://$ES_NODE/_cat/allocation\?v\&s=node                                                       18:12:47
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1612285969 17:12:49  swh-logging-prod green           3         3   8974 4487    2    0        0             0                  -                100.0%
Feb 2 2021, 6:15 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.
Feb 2 2021, 6:14 PM · System administration
vsellier added a comment to T2975: Disk replacement on esnode1.

The disk is replaced :

# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Feb 2 2021, 5:52 PM · System administration
vsellier added a comment to T2787: Improve access_logs parsing.

Configuration deployed for the webapp on all servers, the logs have now the duration, which is parsed on the elasticseach entries :

Feb 2 2021, 3:39 PM · System administration, Metrics/monitoring
vsellier committed rSPSITE249f747e9c35: apache: Add the request duration on access logs (authored by vsellier).
apache: Add the request duration on access logs
Feb 2 2021, 3:12 PM
vsellier committed rSPSITEcc35baf50c73: logstash: Add support an optional duration on apache logs (authored by vsellier).
logstash: Add support an optional duration on apache logs
Feb 2 2021, 3:12 PM
vsellier committed rSPSITE8e5ca3287738: webapp: improve access log parsing (authored by vsellier).
webapp: improve access log parsing
Feb 2 2021, 3:12 PM
vsellier closed D4989: Add request durations in access logs and improve logstash's integer parsing.
Feb 2 2021, 3:12 PM
vsellier committed rSPSITE908a635fff3d: webapp: code format (authored by vsellier).
webapp: code format
Feb 2 2021, 3:12 PM
vsellier closed D4974: logstash: fix first puppet run and configuration updates.
Feb 2 2021, 3:12 PM
vsellier committed rSPSITE2cf48d29a464: logstash: fix first puppet run and configuration updates (authored by vsellier).
logstash: fix first puppet run and configuration updates
Feb 2 2021, 3:12 PM
vsellier added a comment to D4989: Add request durations in access logs and improve logstash's integer parsing.

lgtm

Please also update the deposit.pp which can benefit from this as well ;)

Feb 2 2021, 10:19 AM
vsellier updated the diff for D4989: Add request durations in access logs and improve logstash's integer parsing.

Remove wrong float conversion on grok pattern

Feb 2 2021, 10:15 AM
vsellier requested review of D4989: Add request durations in access logs and improve logstash's integer parsing.
Feb 2 2021, 9:55 AM
vsellier added a revision to T2787: Improve access_logs parsing: D4989: Add request durations in access logs and improve logstash's integer parsing.
Feb 2 2021, 9:55 AM · System administration, Metrics/monitoring

Feb 1 2021

vsellier added a comment to T2975: Disk replacement on esnode1.

esnode1 is ready to be stopped :

❯ curl -s http://$ES_NODE/_cat/allocation\?v\&s=node                                                                                                             18:07:54
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  1482                                                                                         UNASSIGNED
     0           0b     1.7tb        5tb      6.7tb           25 192.168.100.61 192.168.100.61 esnode1
  3767        3.7tb     3.7tb        3tb      6.7tb           55 192.168.100.62 192.168.100.62 esnode2
  3713        3.6tb     3.6tb      3.1tb      6.7tb           54 192.168.100.63 192.168.100.63 esnode3

It will be left in the cluster until the work starts to keep 3 voting nodes in case of a problem on the other nodes in the interval.

Feb 1 2021, 6:10 PM · System administration
vsellier closed D4987: cgit: remove the repository urls's trailing /.
Feb 1 2021, 5:50 PM
vsellier committed rDLS8e4dd178f1df: cgit: remove the repository urls's trailing / (authored by vsellier).
cgit: remove the repository urls's trailing /
Feb 1 2021, 5:50 PM
vsellier requested review of D4987: cgit: remove the repository urls's trailing /.
Feb 1 2021, 5:37 PM
vsellier added a revision to T3013: Deploy remaining next-gen listers on staging: D4987: cgit: remove the repository urls's trailing /.
Feb 1 2021, 5:34 PM · System administration, Lister
vsellier committed rSENVc66bb65143a9: vagrant: increase staging.webapp memory (authored by vsellier).
vagrant: increase staging.webapp memory
Feb 1 2021, 3:45 PM
vsellier committed rSENV05d1a18442cc: vagrant: allow network communications between all vms (authored by vsellier).
vagrant: allow network communications between all vms
Feb 1 2021, 3:45 PM
vsellier updated the task description for T3009: Manage backfiller configuration in puppet.
Feb 1 2021, 12:10 PM · System administration
vsellier triaged T3009: Manage backfiller configuration in puppet as Normal priority.
Feb 1 2021, 12:09 PM · System administration
vsellier added a comment to T2975: Disk replacement on esnode1.

esnode1 unallocation started :

❯ export ES_NODE=192.168.100.61:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.61"
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.61"
          }
        }
      }
    }
  }
}
Feb 1 2021, 11:40 AM · System administration
vsellier closed T2944: Deploy swh-search v0.4.1, a subtask of T2936: Update the swh-search journal client to only set "has_visit" on "full" status of the visit, as Resolved.
Feb 1 2021, 10:06 AM · Journal, Archive search
vsellier closed T2944: Deploy swh-search v0.4.1 as Resolved.

The backfill is done.

Feb 1 2021, 10:06 AM · System administration, Journal, Archive search
vsellier added a comment to T2912: Next generation archive counters.

This is the results for the count of the directories and revisions (the content is still running, so there is some fresh statistics) :

Feb 1 2021, 10:02 AM · Roadmap 2021, System administration, Monitoring, Web app

Jan 29 2021

vsellier requested review of D4974: logstash: fix first puppet run and configuration updates.
Jan 29 2021, 5:06 PM
vsellier added a revision to T2787: Improve access_logs parsing: D4974: logstash: fix first puppet run and configuration updates.
Jan 29 2021, 5:05 PM · System administration, Metrics/monitoring
vsellier committed rSENV8f095c62cb58: vagrant: declare logstash node (authored by vsellier).
vagrant: declare logstash node
Jan 29 2021, 4:38 PM
vsellier added a comment to T2944: Deploy swh-search v0.4.1.

The journal_client has almost ingested the topics[1] it listens. It took some more time because a backfill of the origin_visit_status was launched for T2993.
It should be done by the end of the day.

Jan 29 2021, 2:44 PM · System administration, Journal, Archive search
vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from Weekly backlog to Backlog on the System administration board.
Jan 29 2021, 2:34 PM · System administration
vsellier changed the status of T2787: Improve access_logs parsing from Open to Work in Progress.
Jan 29 2021, 2:34 PM · System administration, Metrics/monitoring
vsellier added a project to T2787: Improve access_logs parsing: System administration.
Jan 29 2021, 2:33 PM · System administration, Metrics/monitoring
vsellier moved T2958: Use all the disks on esnode2 and esnode3 from deployed/landed/monitoring to done on the System administration board.
Jan 29 2021, 12:21 PM · System administration
vsellier moved T2903: Test different disk configuration on esnode1 from deployed/landed/monitoring to done on the System administration board.
Jan 29 2021, 12:21 PM · System administration
vsellier moved T2905: Deploy swh-search for production from deployed/landed/monitoring to done on the System administration board.
Jan 29 2021, 12:21 PM · System administration, Journal, Archive search
vsellier moved T2920: Document staging infrastructure from in-progress to done on the System administration board.
Jan 29 2021, 12:21 PM · Documentation, System administration, Staging environment
vsellier closed T2920: Document staging infrastructure as Resolved.
  • Inventory updated to ensure all the components are associated to the staging environment
  • Staging page on the intranet updated [1]
  • Staging section on the network page [2] on the intranet updated
Jan 29 2021, 12:20 PM · Documentation, System administration, Staging environment
vsellier added a comment to T2912: Next generation archive counters.

I'm not sure to understand, the hyperloglog function is precisely used to deduplicate the messages based on their keys (at least in the poc).

Jan 29 2021, 12:10 PM · Roadmap 2021, System administration, Monitoring, Web app
vsellier added a comment to T2912: Next generation archive counters.

For information, the poc was launched on the content topic of production, the results seems to be acceptable with a count a little more important on the redis counter, probably due to some messages sent to kafka but not persisted in the database .

Jan 29 2021, 11:12 AM · Roadmap 2021, System administration, Monitoring, Web app

Jan 28 2021

vsellier added a comment to T2975: Disk replacement on esnode1.

Ticket opened via the dell support.
The disk should be delivered the Monday 1st February 2021, the DSI is informed

Jan 28 2021, 5:55 PM · System administration
vsellier changed the status of T2975: Disk replacement on esnode1 from Open to Work in Progress.
Jan 28 2021, 3:44 PM · System administration
vsellier closed T3001: Webapp is not displaying the origin type on the search results as Resolved.

The fix is deployed on webapp1 and solved the problem.

Jan 28 2021, 3:33 PM · Storage manager, Web app
vsellier closed D4963: webapp1: use the same deployment pattern than moma.
Jan 28 2021, 3:19 PM
vsellier committed rSPSITEb82b0d93c2ec: webapp1: use the same deployment pattern than moma (authored by vsellier).
webapp1: use the same deployment pattern than moma
Jan 28 2021, 3:18 PM
vsellier requested review of D4963: webapp1: use the same deployment pattern than moma.
Jan 28 2021, 3:10 PM
vsellier added a comment to T3001: Webapp is not displaying the origin type on the search results.

The storage version v0.21.1 is deployed in staging, the problem looks fixed :

❯ curl -s  https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/visit/latest/\?require_snapshot\=true | jq ''
{
  "origin": "https://gitlab.com/miwc/miwc.github.io.git",
  "date": "2020-12-07T18:21:58.967952+00:00",
  "type": "git",
  "visit": 1,
  "status": "full",
  "snapshot": "759b36e0e3e81e8cbf601181829571daa645b5d2",
  "metadata": {},
  "origin_url": "https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/get/",
  "snapshot_url": "https://webapp.staging.swh.network/api/1/snapshot/759b36e0e3e81e8cbf601181829571daa645b5d2/"
}
Jan 28 2021, 2:36 PM · Storage manager, Web app
vsellier closed T2988: Improve cgit lister to add last modification date of the repos as Resolved.
Jan 28 2021, 2:10 PM · CGit lister, Lister
vsellier closed D4960: Correctly return origin_visit_status.type value everywhere.
Jan 28 2021, 2:01 PM
vsellier committed rDSTO76de53cb261f: Correctly return origin_visit_status.type value everywhere (authored by vsellier).
Correctly return origin_visit_status.type value everywhere
Jan 28 2021, 2:01 PM
vsellier requested review of D4960: Correctly return origin_visit_status.type value everywhere.
Jan 28 2021, 12:23 PM
vsellier added a revision to T3001: Webapp is not displaying the origin type on the search results: D4960: Correctly return origin_visit_status.type value everywhere.
Jan 28 2021, 12:12 PM · Storage manager, Web app
vsellier added projects to T3001: Webapp is not displaying the origin type on the search results: Web app, Storage manager.
Jan 28 2021, 12:11 PM · Storage manager, Web app
vsellier changed the status of T3001: Webapp is not displaying the origin type on the search results from Open to Work in Progress.
Jan 28 2021, 12:11 PM · Storage manager, Web app
vsellier created P930 (An Untitled Masterwork).
Jan 28 2021, 10:30 AM

Jan 27 2021

vsellier added a comment to T2920: Document staging infrastructure.

This is a tryout to generate a global schema of the staging environment (P929):

Jan 27 2021, 6:09 PM · Documentation, System administration, Staging environment
vsellier created P929 Staging infrastructure.
Jan 27 2021, 6:07 PM
vsellier accepted D4956: launchpad: Actually mock the anonymous login to launchpad.

It seems to be ok :)

Jan 27 2021, 4:32 PM
vsellier committed rDSNIP0fe3238bdabf: counters: batch redis calls (authored by vsellier).
counters: batch redis calls
Jan 27 2021, 3:38 PM
vsellier committed rDSNIPe1076146c645: counters: add local counter to follow the message count (authored by vsellier).
counters: add local counter to follow the message count
Jan 27 2021, 3:38 PM