Page MenuHomeSoftware Heritage
Feed Advanced Search

Dec 17 2020

vsellier added a comment to T2897: [staging] kafka data dir over 80%.

After one week, the disk used by kafka was around 85% of usage

root@journal0:/tmp# df -h /srv/kafka/logdir
Filesystem      Size  Used Avail Use% Mounted on
kafka-volume    481G  409G   73G  85% /srv/kafka/logdir

Compared to the production, the compression was not activated on the zfs pool:

root@kafka1:~#  zfs get all data/kafka  | grep compress
data/kafka  compressratio         1.55x                  -
data/kafka  compression           lz4                    inherited from data
data/kafka  refcompressratio      1.55x                  -
root@journal0:/tmp# zfs get all  | grep compress
kafka-volume  compressratio         1.00x                  -
kafka-volume  compression           off                    default
kafka-volume  refcompressratio      1.00x                  -

So the compression was activated :

root@journal0:/tmp# zfs set compression=lz4 kafka-volume
root@journal0:/tmp# zfs get all  | grep compress
kafka-volume  compressratio         1.00x                  -
kafka-volume  compression           lz4                    local
kafka-volume  refcompressratio      1.00x                  -

As this parameter is only used for the new written data, we have force a compact on the biggest topics : `directory, revision and content`

 % ./kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.revision --config min.cleanable.dirty.ratio=0.01
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects.revision.
vsellier@journal0 /opt/kafka/bin
 % ./kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects_privileged.revision --config min.cleanable.dirty.ratio=0.01
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects_privileged.revision.
Dec 17 2020, 10:00 AM · System administration, Staging environment
vsellier changed the status of T2897: [staging] kafka data dir over 80% from Open to Work in Progress.
Dec 17 2020, 9:58 AM · System administration, Staging environment

Dec 16 2020

vsellier accepted D4754: staging: Add clearly-defined node.

LGTM

Dec 16 2020, 3:43 PM
vsellier closed T2629: Recycle ceph-mon1 as a hypervisor integrated in the proxmox cluster as Resolved.

changing the status to resolved as everything looks good \o/

Dec 16 2020, 3:34 PM · System administration
vsellier closed T2629: Recycle ceph-mon1 as a hypervisor integrated in the proxmox cluster, a subtask of T2501: Proxmox reliability improvements (Summer 2020), as Resolved.
Dec 16 2020, 3:34 PM · System administration
vsellier added a comment to T2629: Recycle ceph-mon1 as a hypervisor integrated in the proxmox cluster.

After a new test, a vm deployed on pompidou can reach the network without any issue.
There were some glitches (kernel dump) after the migration, perhaps a reboot after the first migration test would have fixed to network problem.

Dec 16 2020, 3:33 PM · System administration
vsellier merged T2868: Integrate former ceph-mon1 server to the proxmox cluster into T2629: Recycle ceph-mon1 as a hypervisor integrated in the proxmox cluster.
Dec 16 2020, 3:31 PM · System administration
vsellier merged task T2868: Integrate former ceph-mon1 server to the proxmox cluster into T2629: Recycle ceph-mon1 as a hypervisor integrated in the proxmox cluster.
Dec 16 2020, 3:31 PM · System administration
vsellier updated subscribers of D4747: Decomission kafka from esnodes.

We should also check with @olasd if the zookeeper[1-3] can be decommissioned if we remove this brokers

Dec 16 2020, 3:08 PM
vsellier accepted D4753: Add clearly-defined vm role to access the staging clearly defined db instance.

LGTM
as tested together, removing the deep option of lookup seems to result in a more predictable behavior :)

Dec 16 2020, 3:04 PM
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Dec 16 2020, 10:22 AM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.
  • smartctl extended test are running on all the esnode* disks to detect possible defects. The results will be availble in few hours

All the smartctl tests are done and no additional faulty disks were detected

Dec 16 2020, 10:22 AM · System administration

Dec 15 2020

vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

Remark regarding the extension of the storage via the addition of a new data directory [1], so not sure it's the best way to do it:

Dec 15 2020, 6:52 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.
  • smartctl extended test are running on all the esnode* disks to detect possible defects. The results will be availble in few hours
Dec 15 2020, 6:29 PM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Dec 15 2020, 6:16 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

We tried to temporarily restart esnode1 to reallocate the shards of the red indices for which esnode1 was the primary.
Actions:

  • Mount/Remount the xfs partition to flush the xfs journal
  • Perform a xfs_repair to ensure the fs is ok
  • configure elasticsearch deallocate the shard managed by esnode1
  • start esnode1
  • wait for the shards redistribution (swh_workers-2020.09.03was quickly recovered, and the remaining systemlogs.2018 deleted)
  • stop esnode1
  • disable puppet to avoid a restart of elasticsearch on esnode1
Dec 15 2020, 6:16 PM · System administration
vsellier accepted D4745: staging: Add clearly-defined postgresql instance.

LGTM

Dec 15 2020, 5:18 PM
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The free disk space is again around ~85% used on esnode3 (~79% on esnode2).
The systemlogs.*2020.01.* indices were removed.

Dec 15 2020, 3:28 PM · System administration
vsellier accepted D4743: Onboard Tushar Goel as tg1999.
Dec 15 2020, 2:59 PM
vsellier added a comment to D4743: Onboard Tushar Goel as tg1999.

LGTM

Dec 15 2020, 2:59 PM
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Dec 15 2020, 11:37 AM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The shard allocation is reactivated, it should have enough free disk space to replicate all the shard on the 2 nodes.

Dec 15 2020, 11:35 AM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Dec 15 2020, 11:32 AM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.
  1. cleanup of systemlogs index before 2020 (2018/2019)
Dec 15 2020, 11:31 AM · System administration
vsellier raised the priority of T2888: Elasticsearch cluster failure during a rolling restart from Normal to Unbreak Now!.
Dec 15 2020, 10:54 AM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

Short term plan :

  • Remove old systemlogs indexes older than 1year to start, but we can go to 3 months if necessary
  • reactivate the shard allocation to have 1 replica for all the shards in case of a second node failure
  • Launch a long smartcl test on all the disks of each esnode* server
  • Contact DELL support to proceed to the replacement of the 2 failing disks (under warranty(?)) [1]
  • Try to recover the 16 red indexes if possible, if not, delete them as they are not critical
Dec 15 2020, 10:52 AM · System administration

Dec 14 2020

vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

xfs has shutdown the partition so ES is lost .

Dec 14 2020, 10:49 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

It seems there is a quite limited indices impacted by the corruption :

❯ curl -s  http://${ES_NODE}/_cat/indices | grep red                                                                                                        22:20:27
red    open  systemlogs-2020.08.30               o_gpFSjQRBuQBvWqaqA_dA 1 1                                   
red    open  systemlogs-2020.08.27               U4fKujQhTXmbGsx7zzLiPw 1 1                                   
red    open  systemlogs-2020.08.28               JVz-yhe4SeSow1TQPT61Jg 1 1                                   
red    open  systemlogs-2020.08.29               6avrSP3bRW2ZiwSlTpN0tA 1 1                                   
red    open  systemlogs-2020.08.22               jY7nPiXDS6a6aBnTDHNd1A 1 1                                   
red    open  systemlogs-2020.08.16               AK8wyDFQQ2KOgbzIdLvPqQ 1 1                                   
red    open  systemlogs-2020.08.13               o6OowHj-TMCBSglETaTj4w 1 1                                   
red    open  systemlogs-2020.08.10               NN0H_eaXQJW_20lsIMmg0Q 1 1                                   
red    open  systemlogs-2020.08.08               pkJVICAdSbqn3JgHU1h5Yw 1 1                                   
red    open  systemlogs-2020.09.07               naRyJEkZRCeOY5h_2avRyg 1 1                                   
red    open  systemlogs-2020.09.03               wb0DMaeqT2-Lh4nx8rafgQ 1 1                                   
red    open  systemlogs-2020.09.01               jelq1Ij5SGWQAKDqdbCYlQ 1 1                                   
red    open  swh_workers-2020.09.03              c1ZiRR8HS9W44T3nVd7f9Q 2 1  2733325        0   1.6gb    1.6gb
red    open  systemlogs-2020.07.24               743a1usWSw-whONPLhcKrA 1 1                                   
red    open  systemlogs-2020.07.25               zFkfn6l5SA-sby3A0SOAtw 1 1                                   
red    open  systemlogs-2020.07.17               PxL7sBrUQ8SXtbOEG5v_3A 1 1
Dec 14 2020, 10:21 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

sdb and sdc on esnode1 have serious issues.
(there is no other disks with errors on other servers)

root@esnode1:~# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Dec 14 2020, 10:19 PM · System administration
vsellier changed the status of T2888: Elasticsearch cluster failure during a rolling restart from Open to Work in Progress.
Dec 14 2020, 10:15 PM · System administration
vsellier accepted D4732: Add admin tools to default packages.

LGTM no more apt install dstat \o/ :P

Dec 14 2020, 12:02 PM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

With the "optimized" configuration, the import is quite faster :

root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json    
{
  "took" : 10215280,
  "timed_out" : false,
  "total" : 91517657,
  "updated" : 0,
  "created" : 91517657,
  "deleted" : 0,
  "batches" : 91518,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

"took" : 10215280, => 2h45

Dec 14 2020, 9:47 AM · System administrators, Staging environment, Journal, Archive search

Dec 11 2020

vsellier added a comment to T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage).
  • diff landed and applied on the server
  • VIP 128.93.166.40 configured on the firewall
  • NAT Port forward of port 9093 from public ip to internal journal0 declared on the firewall
  • DNS declaration of broker0.journal.staging.swh.network in gandi
  • Ask to DSI to apply the kafka firewall profile to 128.93.166.40
  • Configure a user to test the pipeline
Dec 11 2020, 6:11 PM · Staging environment, System administration
vsellier committed rSPSITE5c693c5cf08b: kafka: activate the authentication on the public network (authored by vsellier).
kafka: activate the authentication on the public network
Dec 11 2020, 5:19 PM
vsellier closed D4726: kafka: activate the authentication on the public network.
Dec 11 2020, 5:19 PM
vsellier updated the diff for D4726: kafka: activate the authentication on the public network.

rebase

Dec 11 2020, 5:18 PM
vsellier committed rSENV84531f26646d: Upgrade the journal0 to have the first CN matching broker1.journal.staging.swh. (authored by vsellier).
Upgrade the journal0 to have the first CN matching broker1.journal.staging.swh.
Dec 11 2020, 4:24 PM
vsellier created D4726: kafka: activate the authentication on the public network.
Dec 11 2020, 3:17 PM
vsellier added a revision to T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage): D4726: kafka: activate the authentication on the public network.
Dec 11 2020, 3:17 PM · Staging environment, System administration
vsellier committed rSPSITE5128dbb7b8d3: varnish: Correctly handle the vhost when the port number is included (authored by vsellier).
varnish: Correctly handle the vhost when the port number is included
Dec 11 2020, 2:36 PM
vsellier closed D4719: varnish: Correctly handle the vhost when the port number is included.
Dec 11 2020, 2:36 PM
vsellier added a comment to T2877: Investigate spurious deposit logs.

I agree for the default site but we have several legit requests from the monitoring not correctly routed so the configuration needs to be adapted.

Dec 11 2020, 11:46 AM · System administration, Staging environment, SWORD deposit
vsellier added a revision to T2877: Investigate spurious deposit logs: D4719: varnish: Correctly handle the vhost when the port number is included.
Dec 11 2020, 11:42 AM · System administration, Staging environment, SWORD deposit
vsellier created D4719: varnish: Correctly handle the vhost when the port number is included.
Dec 11 2020, 11:42 AM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

The production index origin was correctly copied from the production cluster but it seems without the configuration to optimize the copy.
We keep this one and try a new optimized copy to check if the server still crash in an OOM with the new cpu and memory settings.

Dec 11 2020, 10:15 AM · System administrators, Staging environment, Journal, Archive search

Dec 10 2020

vsellier changed the status of T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage) from Open to Work in Progress.
Dec 10 2020, 5:41 PM · Staging environment, System administration
vsellier committed rSPRE56974c0407c2: staging: Increase cpu, memory and disk of search-esnode0 (authored by vsellier).
staging: Increase cpu, memory and disk of search-esnode0
Dec 10 2020, 3:59 PM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

FI: The origin index was recreated with the "official" mapping and a backfill was performed (necessary after the test of the flattened mapping)

Dec 10 2020, 3:42 PM · System administrators, Staging environment, Journal, Archive search
vsellier accepted D4716: Deactivate swh-search-journal-client@indexed service.
Dec 10 2020, 3:33 PM
vsellier closed T2817: Enable the swh-search environment in staging, a subtask of T2590: Finish the indexer -> swh-search pipeline, as Resolved.
Dec 10 2020, 3:29 PM · Journal, Archive search
vsellier closed T2817: Enable the swh-search environment in staging as Resolved.

The deployment manifest are ok and deployed in staging so this task can be resolved.
We will work on reactivating search-journal-client for the metadata in another task when T2876 is resolved

Dec 10 2020, 3:29 PM · System administrators, Staging environment, Journal, Archive search
vsellier accepted D4712: staging: Increase elasticsearch jvm heap size to half its memory.
Dec 10 2020, 3:22 PM
vsellier updated the task description for T2817: Enable the swh-search environment in staging.
Dec 10 2020, 3:19 PM · System administrators, Staging environment, Journal, Archive search
vsellier added a comment to T2876: metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types.

We tried to change the mapping type of the field intrinsic_metadata from nested to flattened as you have suggested, we have now a new error related to the huge size of a description.
ES can be configured to accept bigger fields but I'm not sure it's relevant regarding the description field content.

Dec 10 2020, 3:18 PM · Intrinsic metadata, Indexer, Archive search
vsellier added a subtask for T2590: Finish the indexer -> swh-search pipeline: T2876: metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types.
Dec 10 2020, 12:31 PM · Journal, Archive search
vsellier added a parent task for T2876: metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types: T2590: Finish the indexer -> swh-search pipeline.
Dec 10 2020, 12:31 PM · Intrinsic metadata, Indexer, Archive search
vsellier triaged T2876: metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types as Normal priority.
Dec 10 2020, 12:31 PM · Intrinsic metadata, Indexer, Archive search
vsellier accepted D4699: search: Deploy multiple search journal client instances.
Dec 10 2020, 11:36 AM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

The copy of the production index is restarted.
To improve the speed of the copy, the index was tuned to reduce the disk pressure (it's a temporary configuration and should not be used in a normal case as it's not safe) :

cat >/tmp/config.json <<EOF
{
  "index" : {
    "translog.sync_interval" : "60s",
	"translog.durability": "async",
	"refresh_interval": "60s"
  }
}
EOF
Dec 10 2020, 11:14 AM · System administrators, Staging environment, Journal, Archive search
vsellier added a comment to T2817: Enable the swh-search environment in staging.
  • Parition and memory extended with terraform.
  • The disk resize needed some console actions to be extended :
Dec 10 2020, 10:39 AM · System administrators, Staging environment, Journal, Archive search
vsellier added a comment to T2817: Enable the swh-search environment in staging.

The production index import failed because the limit of 90% of used disk spaces was reached at some time to fall back to around 60G after a compaction
The progression was 80M documents of 91M.

Dec 10 2020, 9:59 AM · System administrators, Staging environment, Journal, Archive search
vsellier accepted D4711: test_journal_client: Migrate to pytest.
Dec 10 2020, 9:48 AM
vsellier accepted D4703: docker-compose.search.yml: Upgrade elasticsearch container.
Dec 10 2020, 9:46 AM
vsellier accepted D4702: docker-compose.search.yml: Specify the search journal client config.
Dec 10 2020, 9:46 AM
vsellier accepted D4708: swh-indexer: Fix configuration to add the tools configuration entry.
Dec 10 2020, 9:46 AM
vsellier accepted D4709: indexer_storage: Publish indexer computation to journal topics.
Dec 10 2020, 9:45 AM
vsellier accepted D4710: search.journal_client: Fix key error.
Dec 10 2020, 9:45 AM
vsellier accepted D4710: search.journal_client: Fix key error.

LGTM according the journal content

Dec 10 2020, 9:44 AM
vsellier added a comment to D4709: indexer_storage: Publish indexer computation to journal topics.

LGTM

Dec 10 2020, 9:43 AM
vsellier added a comment to D4708: swh-indexer: Fix configuration to add the tools configuration entry.

LGTM

Dec 10 2020, 9:41 AM
vsellier added a comment to D4702: docker-compose.search.yml: Specify the search journal client config.

LGTM

Dec 10 2020, 9:39 AM
vsellier added a comment to D4703: docker-compose.search.yml: Upgrade elasticsearch container.

LGTM

Dec 10 2020, 9:37 AM
vsellier accepted D4704: docker-compose.search.yml: Add journal client for indexed values.

LGTM

Dec 10 2020, 9:36 AM

Dec 9 2020

vsellier committed rDSEAe72a785757fb: Allow configuration through cli or config file (authored by vsellier).
Allow configuration through cli or config file
Dec 9 2020, 6:15 PM
vsellier closed D4701: Allow configuration through cli or config file.
Dec 9 2020, 6:15 PM
vsellier added a revision to T2817: Enable the swh-search environment in staging: D4701: Allow configuration through cli or config file.
Dec 9 2020, 5:57 PM · System administrators, Staging environment, Journal, Archive search
vsellier created D4701: Allow configuration through cli or config file.
Dec 9 2020, 5:57 PM
vsellier added a project to T2594: production: Running nixguix on guix sources: Archive coverage.
Dec 9 2020, 4:25 PM · Archive coverage, System administration
vsellier added a project to T2608: Deploy launchpad and gitea listers on production: Archive coverage.
Dec 9 2020, 4:22 PM · Archive coverage, System administration
vsellier committed rSPSITE87af77517b68: Allow staging network to request internal dns (authored by vsellier).
Allow staging network to request internal dns
Dec 9 2020, 10:54 AM
vsellier closed D4693: Allow staging network to request internal dns.
Dec 9 2020, 10:54 AM
vsellier created D4693: Allow staging network to request internal dns.
Dec 9 2020, 10:48 AM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]

Dec 9 2020, 9:51 AM · System administrators, Staging environment, Journal, Archive search
vsellier updated the task description for T2817: Enable the swh-search environment in staging.
Dec 9 2020, 9:35 AM · System administrators, Staging environment, Journal, Archive search

Dec 8 2020

vsellier closed T2828: Archive counters are no longer updated in production as Resolved.

changing the status to "Resolved" as it seems there is nothing more to do on this task as the counters start to be updated again.

Dec 8 2020, 7:30 PM · Monitoring, Web app, System administration
vsellier renamed T2868: Integrate former ceph-mon1 server to the proxmox cluster from Integrate former ceph-mon1 server in the proxmox cluster to Integrate former ceph-mon1 server to the proxmox cluster.
Dec 8 2020, 6:11 PM · System administration
vsellier triaged T2868: Integrate former ceph-mon1 server to the proxmox cluster as Normal priority.
Dec 8 2020, 6:09 PM · System administration
vsellier triaged T2866: Integrate former Uffizi server to the proxmox cluster as Normal priority.
Dec 8 2020, 6:00 PM · System administration
vsellier triaged T2865: Prepare an environment to test the ClearlyDefined integration as Normal priority.
Dec 8 2020, 5:57 PM · System administration
vsellier accepted D4687: search: Add initialization step on install or upgrade.

LGTM

Dec 8 2020, 5:13 PM
vsellier accepted D4668: Add swh-search-journal-client to swh_search_with_journal_client role.

ack, let's go to try the current version of the journal client :)

Dec 8 2020, 12:11 PM
vsellier requested changes to D4668: Add swh-search-journal-client to swh_search_with_journal_client role.

As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet too

Dec 8 2020, 11:36 AM
vsellier accepted D4666: staging: Deploy swh-search rpc backend on search0.

My previous comment was not for this diff but for D4668 :)

Dec 8 2020, 11:31 AM
vsellier requested changes to D4666: staging: Deploy swh-search rpc backend on search0.

As discussed together, we will need several instances of the journal to be able to use different prefixes.
It should be managed by puppet to

Dec 8 2020, 11:17 AM
vsellier accepted D4664: search0: Add swh-search rpc backend node.

LGTM

Dec 8 2020, 11:14 AM
vsellier added a comment to T2817: Enable the swh-search environment in staging.

A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests

Dec 8 2020, 10:49 AM · System administrators, Staging environment, Journal, Archive search

Dec 7 2020

vsellier added a comment to T2817: Enable the swh-search environment in staging.

Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html

Dec 7 2020, 6:15 PM · System administrators, Staging environment, Journal, Archive search
vsellier committed rSPSITE262e122fa89b: monitoring: gather metrics into prometheus (authored by vsellier).
monitoring: gather metrics into prometheus
Dec 7 2020, 2:41 PM
vsellier closed D4674: monitoring: gather metrics into prometheus.
Dec 7 2020, 2:41 PM
vsellier closed T2859: Out of disk space on prometheus storage as Resolved.
Dec 7 2020, 2:38 PM · Monitoring, System administration