The file /var/lib/puppet/state/agent_disabled.lock can be checked to detect if puppet is disable or not.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 9 2021
Feb 8 2021
Precision around these disks replacement, even if the disks are in error, there is still spares on the zfs pool:
- db1 :
root@db1:~# zpool status data pool: data state: ONLINE scan: scrub repaired 0B in 0 days 00:16:01 with 0 errors on Sun Jan 10 00:40:03 2021 config:
Feb 5 2021
I start to throw some ideas in this document : https://hedgedoc.softwareheritage.org/Fi2pq7zkSw6aVAJwk9Xhqw
Nice, thanks for confirming this at the source.
It seems there were some huge queries the last few days [1], the script needed to be adapted to use Long instead of Integers :
apache_logs-2021.01.14: { "error" : { "root_cause" : [ { "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)", "java.base/java.lang.Integer.parseInt(Integer.java:652)", "java.base/java.lang.Integer.parseInt(Integer.java:770)", "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ", " ^---- HERE" ], "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;", "lang" : "painless", "position" : { "offset" : 96, "start" : 0, "end" : 125 } } ], "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)", "java.base/java.lang.Integer.parseInt(Integer.java:652)", "java.base/java.lang.Integer.parseInt(Integer.java:770)", "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ", " ^---- HERE" ], "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;", "lang" : "painless", "position" : { "offset" : 96, "start" : 0, "end" : 125 }, "caused_by" : { "type" : "number_format_exception", "reason" : "For input string: \"4633815064\"" } }, "status" : 400 }
Feb 4 2021
The opened apache indexes are currently being migrated with the P940's script.
\o/ thanks,
We should be able to properly format the common file now
The log parsing is ok.
An elasticsearch datasource was created on grafana so we can now create some graphs based on the logs on elasticsearch.
A simple dashboard to display some statistics based on the apache log was initiated[1], it appears the design is not as simple as in kibana and have some limitations but it still allows to have basic information centralized in grafana.
The question is not an abstract one: there are implementations of HyperLogLog that are monotonic, maybe the Redis one is already, we just need to know.
So far so good, the smart test is done and didn't find any errors :
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 9 -
Feb 2 2021
configure filebeat to check the right file based on the vhost
add some additional fields to help the message filtering
- partition recreated :
# sfdisk -d /dev/sda | sfdisk -f /dev/sdb
- zfs pool recreated with the wwn ids :
root@esnode1:/etc/zfs# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4) root@esnode1:/etc/zfs# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT elasticsearch-data 7T 152K 7.00T - - 0% 0% 1.00x ONLINE -
- server restarted to check everything is ok
- allocation reactivated :
❯ export ES_NODE=192.168.100.61:9200 ❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 18:11:28 "transient" : { "cluster.routing.allocation.exclude._ip" : null } }' { "acknowledged" : true, "persistent" : { }, "transient" : { } }
- and in progress :
❯ curl -s http://$ES_NODE/_cat/health\?v; echo; curl -s http://$ES_NODE/_cat/allocation\?v\&s=node 18:12:47 epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1612285969 17:12:49 swh-logging-prod green 3 3 8974 4487 2 0 0 0 - 100.0%
The disk is replaced :
# smartctl -a /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Configuration deployed for the webapp on all servers, the logs have now the duration, which is parsed on the elasticseach entries :
In D4989#125683, @ardumont wrote:lgtm
Please also update the deposit.pp which can benefit from this as well ;)
Remove wrong float conversion on grok pattern
Feb 1 2021
esnode1 is ready to be stopped :
❯ curl -s http://$ES_NODE/_cat/allocation\?v\&s=node 18:07:54 shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 1482 UNASSIGNED 0 0b 1.7tb 5tb 6.7tb 25 192.168.100.61 192.168.100.61 esnode1 3767 3.7tb 3.7tb 3tb 6.7tb 55 192.168.100.62 192.168.100.62 esnode2 3713 3.6tb 3.6tb 3.1tb 6.7tb 54 192.168.100.63 192.168.100.63 esnode3
It will be left in the cluster until the work starts to keep 3 voting nodes in case of a problem on the other nodes in the interval.
esnode1 unallocation started :
❯ export ES_NODE=192.168.100.61:9200 ❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ "transient" : { "cluster.routing.allocation.exclude._ip" : "192.168.100.61" } }' { "acknowledged" : true, "persistent" : { }, "transient" : { "cluster" : { "routing" : { "allocation" : { "exclude" : { "_ip" : "192.168.100.61" } } } } } }
The backfill is done.
This is the results for the count of the directories and revisions (the content is still running, so there is some fresh statistics) :
Jan 29 2021
The journal_client has almost ingested the topics[1] it listens. It took some more time because a backfill of the origin_visit_status was launched for T2993.
It should be done by the end of the day.
- Inventory updated to ensure all the components are associated to the staging environment
- Staging page on the intranet updated [1]
- Staging section on the network page [2] on the intranet updated
I'm not sure to understand, the hyperloglog function is precisely used to deduplicate the messages based on their keys (at least in the poc).
For information, the poc was launched on the content topic of production, the results seems to be acceptable with a count a little more important on the redis counter, probably due to some messages sent to kafka but not persisted in the database .
Jan 28 2021
Ticket opened via the dell support.
The disk should be delivered the Monday 1st February 2021, the DSI is informed
The fix is deployed on webapp1 and solved the problem.
The storage version v0.21.1 is deployed in staging, the problem looks fixed :
❯ curl -s https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/visit/latest/\?require_snapshot\=true | jq '' { "origin": "https://gitlab.com/miwc/miwc.github.io.git", "date": "2020-12-07T18:21:58.967952+00:00", "type": "git", "visit": 1, "status": "full", "snapshot": "759b36e0e3e81e8cbf601181829571daa645b5d2", "metadata": {}, "origin_url": "https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/get/", "snapshot_url": "https://webapp.staging.swh.network/api/1/snapshot/759b36e0e3e81e8cbf601181829571daa645b5d2/" }
Jan 27 2021
This is a tryout to generate a global schema of the staging environment (P929):
It seems to be ok :)