logstash now exposes an api server[1] which seems to return some interesting metrics on the plugin behaviors.
For example, there is a section for the elasticsearch output plugin:

  "outputs": [
    {
      "id": "62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664",
      "events": {
        "duration_in_millis": 160089636,
        "in": 72818126,
        "out": 72818046
      },
      "bulk_requests": {
        "responses": {
          "200": 3860888
        },
        "successes": 3860888
      },
      "documents": {
        "successes": 72818046
      },
      "name": "elasticsearch"
    }
  ]
},

I'll try to implement a small python script checking if there is other response code than 200 in a first time to identify the behavior
Perhaps it will be also interesting to check other properties like queue size :

"queue": {
  "type": "memory",
  "events_count": 0,
  "queue_size_in_bytes": 0,
  "max_queue_size_in_bytes": 0
},

Apr 23 2021, 5:16 PM · System administration

vsellier added a comment to T3222: Monitor daily indexes are present on the log cluster and logs are correctly ingested.

I checked the icinga_logstash plugin[1] to see if it can be helpful but it's more oriented to logastash instances used to ingest data from log files. There is no options to check the number of events received/sent for example.

Apr 23 2021, 4:53 PM · System administration

vsellier committed rSENV035022b779a8: Replace clearly-defined vm by the mirror-test one (authored by vsellier).

Replace clearly-defined vm by the mirror-test one

Apr 23 2021, 4:46 PM

vsellier requested review of D5588: Activate swh-counters on all the webapps.

Apr 23 2021, 4:26 PM

vsellier added a revision to T2912: Next generation archive counters: D5588: Activate swh-counters on all the webapps.

Apr 23 2021, 4:26 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier changed the status of T3222: Monitor daily indexes are present on the log cluster and logs are correctly ingested, a subtask of T3219: No logs are ingested on elasticsearch since 2021-03-26, from Open to Work in Progress.

Apr 23 2021, 4:10 PM · System administrators

vsellier changed the status of T3222: Monitor daily indexes are present on the log cluster and logs are correctly ingested from Open to Work in Progress.

Apr 23 2021, 4:10 PM · System administration

vsellier edited projects for T3222: Monitor daily indexes are present on the log cluster and logs are correctly ingested, added: System administration; removed System administrators.

Apr 23 2021, 4:09 PM · System administration

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

According to the tracking page, the command has left the factory the Apr 22, 2021, The ETA is May 28, 2021*.

Apr 23 2021, 4:00 PM · System administration, Archive search

vsellier claimed T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

Apr 23 2021, 3:57 PM · System administration, Archive search

vsellier claimed T3129: Reliable monitoring of services: for users and for admins .

Apr 23 2021, 3:13 PM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier closed D5542: Remove tenma's access.

closed by rSPSITEe749fd9a244c669b108def9f008009b2f5563811

Apr 23 2021, 2:59 PM

vsellier closed T3251: Count authors from revisions and releases, a subtask of T2912: Next generation archive counters, as Resolved.

Apr 23 2021, 1:03 PM · Roadmap 2021, System administration, Monitoring, Web app

vsellier closed T3251: Count authors from revisions and releases as Resolved.

and the authors are now displayed on staging and production (webapp1)

Apr 23 2021, 1:03 PM · Monitoring, Web app

vsellier added a comment to T3251: Count authors from revisions and releases.

The lag for the production can be followed here: https://grafana.softwareheritage.org/goto/Di2H3z9Gk
(staging has already recovered)

Apr 23 2021, 12:57 PM · Monitoring, Web app

vsellier added a comment to T3251: Count authors from revisions and releases.

the swh-counters is deployed in production too:

upgrade swh-counters package and restart swh-counters backend and journal

root@counters1:~# apt dist-upgrade
...
Setting up python3-swh.counters (0.7.0-1~swh1~bpo10+1) ...
root@counters1:~# systemctl stop swh-counters-journal-client.service 
root@counters1:~# systemctl restart gunicorn-swh-counters.service 
root@counters1:~# systemctl start swh-counters-journal-client.service 
root@counters1:~# redis-cli pfcount person
(integer) 7

The count of the person already starts

stopping the journal-client to be able to reset the releases and revisions offsets

root@counters1:~# systemctl stop swh-counters-journal-client.service

reset the offsets

vsellier@kafka1 ~ % /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets.csv
# revision reset
vsellier@kafka1 ~ % 
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.revision
# release reset
vsellier@kafka1 ~ %  /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.release 
# checks
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets-backfill.csv 
vsellier@kafka1 ~ % diff ~/counters_journal_client_offsets.csv ~/counters_journal_client_offsets-backfill.csv | less 
1c1
< "swh.journal.objects.revision",25,8275180
---
> "swh.journal.objects.revision",25,0
8c8
< "swh.journal.objects.release",128,78484
---
> "swh.journal.objects.release",128,0
16c16
...

journal client restarted

root@counters1:~# systemctl start swh-counters-journal-client.service

the person counters is growing fastly

root@counters1:~# date;redis-cli pfcount person
Fri 23 Apr 2021 10:55:54 AM UTC
(integer) 72358
root@counters1:~# date;redis-cli pfcount person
Fri 23 Apr 2021 10:55:57 AM UTC
(integer) 80618

Apr 23 2021, 12:56 PM · Monitoring, Web app

vsellier closed D5586: Activate the person's counter on the home page with swh-counters.

Apr 23 2021, 12:30 PM

vsellier committed rDWAPPSb9ff5a073f9f: Activate the person's counter on the home page with swh-counters (authored by vsellier).

Activate the person's counter on the home page with swh-counters

Apr 23 2021, 12:30 PM

vsellier closed D5573: Update the counters' journal clients configuration to count the persons.

Apr 23 2021, 12:29 PM

vsellier committed rDENVd2dac157b76b: Update the counters' journal clients configuration to count the persons (authored by vsellier).

Update the counters' journal clients configuration to count the persons

Apr 23 2021, 12:29 PM

vsellier retitled D5573: Update the counters' journal clients configuration to count the persons from Update th counters' journal clients configuration to count the persons to Update the counters' journal clients configuration to count the persons.

Apr 23 2021, 12:26 PM

vsellier updated the diff for D5573: Update the counters' journal clients configuration to count the persons.

just keep the topic configuration as the journal split is not needed anymore
fix the type in the commit message

Apr 23 2021, 12:25 PM

vsellier added a comment to D5586: Activate the person's counter on the home page with swh-counters.

I hesitated to do it, but as it should not move anymore now everything is reactivated, I choose to keep it as it.
We'll see if there are mouvements on this list in the near future

Apr 23 2021, 12:23 PM

vsellier requested review of D5586: Activate the person's counter on the home page with swh-counters.

Apr 23 2021, 12:10 PM

vsellier added a revision to T3251: Count authors from revisions and releases: D5586: Activate the person's counter on the home page with swh-counters.

Apr 23 2021, 12:03 PM · Monitoring, Web app

vsellier committed rDENVab8702258dac: Revert "Add the counters-journal-client-messages deployment" (authored by vsellier).

Revert "Add the counters-journal-client-messages deployment"

Apr 23 2021, 11:46 AM

vsellier added a comment to T3251: Count authors from revisions and releases.

version 0.7.0 release with the last improvement (D5576) of vlorentz (thanks)
deployment done in staging
the person counting has started on the live messages:

root@counters0:~# redis-cli
127.0.0.1:6379> pfcount person
(integer) 7

now let reset the consumer offsets for the release and revision topics to backfill the person counter:

# offsets backup
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets.csv
# revision reset
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.revision
# release reset
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.release

Apr 23 2021, 11:16 AM · Monitoring, Web app

vsellier closed T3283: Create a vm to test the mirror environment as Resolved.

the vm is configured and the new database schema on the staging database created.
psql is also configured with several services:

swh-mirror : read-only connection on the swh-mirror schema
admin-swh-mirror: r/w connection
swh: read-only connection on the archive database (staging)

Apr 23 2021, 10:00 AM · System administration

vsellier added a comment to D5576: Remove 'journal_type' argument from the CLI.

In D5576#141670, @vlorentz wrote:

In D5576#141650, @vsellier wrote:

I suppose the message.value() is returning a copy of the content

It does not: https://github.com/confluentinc/confluent-kafka-python/blob/b7f8dce998ec254f54c15e514850d7404c6a71a3/src/confluent_kafka/src/confluent_kafka.c#L439-L445

(Py_INCREF only increments the reference counter)

Apr 23 2021, 9:55 AM

vsellier committed rSPSITEe48733a0acc9: Add mirror-test vm and a dedicated schema on the staging database (authored by vsellier).

Add mirror-test vm and a dedicated schema on the staging database

Apr 23 2021, 9:27 AM

vsellier closed D5581: Add mirror-test vm and a dedicated schema on the staging database.

Apr 23 2021, 9:27 AM

vsellier updated the diff for D5581: Add mirror-test vm and a dedicated schema on the staging database.

limit the pre-configured databases to the main database + swh-mirror

Apr 23 2021, 9:26 AM

vsellier added inline comments to D5581: Add mirror-test vm and a dedicated schema on the staging database.

Apr 23 2021, 9:21 AM

Apr 22 2021

vsellier added a revision to T3283: Create a vm to test the mirror environment: D5581: Add mirror-test vm and a dedicated schema on the staging database.

Apr 22 2021, 8:24 PM · System administration

vsellier requested review of D5581: Add mirror-test vm and a dedicated schema on the staging database.

Apr 22 2021, 8:24 PM

vsellier committed rSPPRIVCbb5dcc9df026: Add password of swh-mirror database (authored by vsellier).

Add password of swh-mirror database

Apr 22 2021, 8:16 PM

vsellier accepted D5576: Remove 'journal_type' argument from the CLI.

I suppose this implementation will be less effective in term of memory consumption as we will keep a copy of the message contents on the dict (I suppose the message.value() is returning a copy of the content)
It also makes the module aware of how the objects are serialized on kafka, which looks quite low level.

Apr 22 2021, 6:46 PM

vsellier committed rSPRE7831daff5cbc: Create the mirror-test vm (authored by vsellier).

Create the mirror-test vm

Apr 22 2021, 5:57 PM

vsellier committed rSPRE5fa5a7f6106d: open the cicustom options (authored by vsellier).

open the cicustom options

Apr 22 2021, 5:57 PM

vsellier added a comment to T3283: Create a vm to test the mirror environment.

VM created by terraform :

mirror-tests_summary = 
hostname: mirror-test
fqdn: mirror-test.internal.staging.swh.network
network: ip=192.168.130.160/24,gw=192.168.130.1 macaddrs=E6:3C:8A:B7:26:5D

Apr 22 2021, 5:42 PM · System administration

vsellier added a comment to T3283: Create a vm to test the mirror environment.

VM declared on the inventory : https://inventory.internal.softwareheritage.org/virtualization/virtual-machines/103/
Future ip will be 192.168.130.160

Apr 22 2021, 5:31 PM · System administration

vsellier changed the status of T3283: Create a vm to test the mirror environment from Open to Work in Progress.

Apr 22 2021, 5:12 PM · System administration

vsellier triaged T3283: Create a vm to test the mirror environment as Normal priority.

Apr 22 2021, 4:49 PM · System administration

vsellier removed a project from T3165: Generate historical data from the new counters series: Roadmap 2021.

Apr 22 2021, 4:25 PM · System administration, Monitoring

vsellier closed D5572: Implement the jounal client counting an internal property of an object.

Apr 22 2021, 4:14 PM

vsellier committed rDCNT5cae9b7cfe30: Implement the jounal client counting an internal property of an object (authored by vsellier).

Implement the jounal client counting an internal property of an object

Apr 22 2021, 4:14 PM

vsellier added a comment to D5572: Implement the jounal client counting an internal property of an object.

thanks for the diff you will propose. I will land this one in the interval.

Apr 22 2021, 4:13 PM

vsellier added a comment to D5572: Implement the jounal client counting an internal property of an object.

It's for performance considerations only, for most of the counters, counting the keys is enough as it's the unique identifier in kafka.
The KeyOrientedJournalClient[1] is bypassing the object deserialization when a message is received, so a more classical client is needed for this specific Person case.

Apr 22 2021, 2:59 PM

vsellier updated the diff for D5572: Implement the jounal client counting an internal property of an object.

add missing doc strings

Apr 22 2021, 2:39 PM

vsellier updated the diff for D5572: Implement the jounal client counting an internal property of an object.

Update according the reviews' feedbacks

Apr 22 2021, 1:19 PM

vsellier added inline comments to D5572: Implement the jounal client counting an internal property of an object.

Apr 22 2021, 12:45 PM

vsellier added inline comments to D5572: Implement the jounal client counting an internal property of an object.

Apr 22 2021, 12:26 PM

vsellier requested review of D5573: Update the counters' journal clients configuration to count the persons.

Apr 22 2021, 12:08 PM

vsellier added a revision to T3251: Count authors from revisions and releases: D5573: Update the counters' journal clients configuration to count the persons.

Apr 22 2021, 12:08 PM · Monitoring, Web app

vsellier committed rDENV76232fa40ada: Add the counters-journal-client-messages deployment (authored by vsellier).

Add the counters-journal-client-messages deployment

Apr 22 2021, 11:06 AM

vsellier requested review of D5572: Implement the jounal client counting an internal property of an object.

Apr 22 2021, 10:37 AM

vsellier added a revision to T3251: Count authors from revisions and releases: D5572: Implement the jounal client counting an internal property of an object.

Apr 22 2021, 10:36 AM · Monitoring, Web app

Apr 21 2021

vsellier added a comment to T3242: Decommission ClearlyDefined resources.

puppet ressources cleaned:

root@pergamon:~# /usr/local/sbin/swh-puppet-master-decomission clearly-defined.internal.staging.swh.network
+ puppet node deactivate clearly-defined.internal.staging.swh.network
Submitted 'deactivate node' for clearly-defined.internal.staging.swh.network with UUID 26eb9a73-add9-4745-b068-6106ab2b20b4
+ puppet node clean clearly-defined.internal.staging.swh.network
Notice: Revoked certificate with serial 256
Notice: Removing file Puppet::SSL::Certificate clearly-defined.internal.staging.swh.network at '/var/lib/puppet/ssl/ca/signed/clearly-defined.internal.staging.swh.network.pem'
clearly-defined.internal.staging.swh.network
+ puppet cert clean clearly-defined.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Revoked certificate with serial 256
+ systemctl restart apache2

Apr 21 2021, 5:17 PM · System administration

vsellier closed D5557: Remove clearly-defined resources.

Apr 21 2021, 4:40 PM

vsellier committed rSPSITE239a1337af3d: Remove clearly-defined resources (authored by vsellier).

Remove clearly-defined resources

Apr 21 2021, 4:40 PM

vsellier updated the diff for D5557: Remove clearly-defined resources.

rebase

Apr 21 2021, 4:40 PM

vsellier closed T3242: Decommission ClearlyDefined resources as Resolved.

vm destroyed
configuration removed for terraform
database schemas cleared:
- before:

root@db1:~# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data  27.3T   623G  26.7T        -         -    16%     2%  1.00x    ONLINE  -

Apr 21 2021, 4:25 PM · System administration

vsellier committed rSPRE4ee025890a6a: Remove clearly-defined resources (authored by vsellier).

Remove clearly-defined resources

Apr 21 2021, 4:19 PM

vsellier accepted D5568: Update save code now monitoring checks.

lgtm

Apr 21 2021, 9:34 AM · Save Code Now