Page MenuHomeSoftware Heritage
Feed Advanced Search

Jan 11 2021

vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

The model number to use on the request is : ST6000NM0115
There is an obscure message limiting the number of return / country / year to 3 (!):

Jan 11 2021, 8:14 PM · System administration
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 11 2021, 1:53 PM · System administration
vsellier accepted D4831: hedgedoc: Fix reverse proxy configuration.

LGTM

Jan 11 2021, 9:49 AM
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 11 2021, 9:42 AM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

The test of /dev/sdb finally ends ... in error :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%     26004         2559298584

So we have 2 disks to replace on each server. What's weird is that the 2 disks to replace are at the same position on each server...

Jan 11 2021, 9:39 AM · System administration

Jan 8 2021

vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

the test is still running on one disk on storage1 (sdb). No new errors were discovered on all the other disk

Jan 8 2021, 2:17 PM · System administration
vsellier accepted D4822: admin: Provision new rp0.internal.admin.swh.network.

LGTM

Jan 8 2021, 11:16 AM
vsellier accepted D4823: admin: Add rp0.internal.admin.swh.network reverse proxy for admin nodes.

LGTM

Jan 8 2021, 10:14 AM

Jan 7 2021

vsellier triaged T2944: Deploy swh-search v0.4.1 as Normal priority.
Jan 7 2021, 6:39 PM · System administration, Journal, Archive search
vsellier added a comment to T2936: Update the swh-search journal client to only set "has_visit" on "full" status of the visit.

version v0.4.1 created with the last commit (rDSEA47db624364d4e781f8fa157b2d72d0eb9929b7a0)

Jan 7 2021, 4:16 PM · Journal, Archive search
vsellier accepted D4818: Do not set 'has_visit' when receiving a visit from the journal.

LGTM
thanks for the query to fix the index

Jan 7 2021, 2:16 PM
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 7 2021, 12:36 PM · System administration
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 7 2021, 12:35 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

Tests launched :

root@db1:~# echo /dev/sd{a..n} | xargs -t -n1 smartctl -t long
smartctl -t long /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Jan 7 2021, 12:21 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

Complete disk statuses :

  • db1.staging:
root@db1:~# ls  /dev/sd{a..n} | xargs -t -n1 smartctl -a | grep -e "/dev/sd?" -e Reallocated_Sector_Ct -e "Model Family" -e "Serial Number" -e "Reported_Uncorrect" -e lifetime -e "Extended offline" -e "Offline_Uncorrectable"
smartctl -a /dev/sda 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27CCS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdb 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27C4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
# 1  Extended offline    Completed: read failure       70%     25421         4131034152
smartctl -a /dev/sdc 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DW0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdd 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A44
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sde 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27BA5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdf 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DCG
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdg 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD270KS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdh 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdi 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27E48
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdj 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD26YN2
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdk 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279XY
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdl 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279ZX
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25427         -
smartctl -a /dev/sdm 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV71810017150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
smartctl -a /dev/sdn 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV718004DM150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
Jan 7 2021, 12:18 PM · System administration
vsellier lowered the priority of T2888: Elasticsearch cluster failure during a rolling restart from High to Normal.

Reducing priority to normal as there is no more risks for the data

Jan 7 2021, 12:04 PM · System administration
vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from Backlog to in-progress on the System administration board.
Jan 7 2021, 12:03 PM · System administration
vsellier changed the status of T2939: Replace out of order disks on db1.staging and storage1.staging from Open to Work in Progress.
Jan 7 2021, 12:02 PM · System administration
vsellier added a comment to T2905: Deploy swh-search for production.

It depends of what will be implemented in T2936, but a new reindex will probably have to be done to fix the search. It will be the opportunity to think on how doing it without killing all the search

Jan 7 2021, 11:36 AM · System administration, Journal, Archive search
vsellier updated subscribers of T2905: Deploy swh-search for production.

@vlorentz I was checking some differences between swh-search and the current search. does the journal client has to listen the origin_visit topic? It seems that `origin_visit_status should be enough to match the behavior of the search in the webapp.

Jan 7 2021, 10:14 AM · System administration, Journal, Archive search

Jan 6 2021

vsellier committed rSPRE556448d54882: align search-esnode* configuration with the real number (authored by vsellier).
align search-esnode* configuration with the real number
Jan 6 2021, 4:05 PM
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

previous comment moved to T2903#56023

Jan 6 2021, 3:45 PM · System administration
vsellier added a comment to T2903: Test different disk configuration on esnode1.

The benchmark is done:

Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
esnode1-zfs-arc 63G  312k  99  478m  49  200m  43  640k  93  445m  53 400.8  31
Latency             31118us   58579us     748ms     231ms   78052us     275ms
Version  1.98       ------Sequential Create------ --------Random Create--------
esnode1-zfs-arc-lim -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  24 +++++ +++ 16384   7 16384  35 +++++ +++ 16384   6
Latency               145ms    2012us     826ms     105ms      21us     842ms
1.98,1.98,esnode1-zfs-arc-limited,1,1609729287,63G,,8192,5,312,99,489919,49,204649,43,640,93,455669,53,400.8,31,16,,,,,4059,24,+++++,+++,3023,7,11686,35,+++++,+++,2398,6,31118us,58579us,748ms,231ms,78052us,275ms,145ms,2012us,826ms,105ms,21us,842ms

(sorry for the formating, didn't find how to make it better)

Jan 6 2021, 3:43 PM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Jan 6 2021, 3:43 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.
Jan 6 2021, 3:42 PM · System administration
vsellier updated the task description for T2905: Deploy swh-search for production.
Jan 6 2021, 11:06 AM · System administration, Journal, Archive search
vsellier added a comment to T2905: Deploy swh-search for production.

webapp1 is now plugged on the real live production index
Let monitor the behavior with real searches.
First constatation, the search retrieves all the documents and is not as progressive as the random search script.
The response times are longer than expected:

Jan 06 09:59:46 search1 python3[813]: 2021-01-06 09:59:46 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:3.399s]
Jan 06 10:06:18 search1 python3[848]: 2021-01-06 10:06:18 [848] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:7.422s]
Jan 06 10:06:21 search1 python3[813]: 2021-01-06 10:06:21 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:5.077s]
Jan 06 10:07:32 search1 python3[813]: 2021-01-06 10:07:32 [813] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:4.819s]
Jan 06 10:08:06 search1 python3[813]: 2021-01-06 10:08:06 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.700s]
Jan 06 10:08:15 search1 python3[813]: 2021-01-06 10:08:15 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.414s]
Jan 6 2021, 11:01 AM · System administration, Journal, Archive search
vsellier closed D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 10:28 AM
vsellier committed rSPSITE57b96d45f705: Plug webapp1 on the swh-search with live production data (authored by vsellier).
Plug webapp1 on the swh-search with live production data
Jan 6 2021, 10:28 AM
vsellier added a comment to T2905: Deploy swh-search for production.

the performances looks acceptable as it for a small number of parallel searches (~10), let's try now with real searches, it will also help to adapt the cluster configuration and validate the behavior

Jan 6 2021, 9:59 AM · System administration, Journal, Archive search
vsellier updated the task description for T2905: Deploy swh-search for production.
Jan 6 2021, 9:56 AM · System administration, Journal, Archive search
vsellier requested review of D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 9:55 AM
vsellier added a revision to T2905: Deploy swh-search for production: D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 9:55 AM · System administration, Journal, Archive search
vsellier committed rSENV69103055ea85: Update octocatalog-diff facts (authored by vsellier).
Update octocatalog-diff facts
Jan 6 2021, 9:53 AM

Jan 5 2021

vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The new disk are ok according the smart test:

root@esnode1:~# echo /dev/sd{b,c} | xargs -n1 smartctl -a | grep -A2 "Self-test log"
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
--
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
Jan 5 2021, 7:52 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The 2 disks were replaced :

root@esnode1:~# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Jan 5 2021, 3:50 PM · System administration
vsellier added a comment to T1414: Set up an inventory app.

Can this task be closed since the subject was addressed in T2620 ?

Jan 5 2021, 2:43 PM · System administration, Sprint 2018 12
vsellier moved T2905: Deploy swh-search for production from Backlog to in-progress on the System administration board.
Jan 5 2021, 2:39 PM · System administration, Journal, Archive search
vsellier updated the task description for T2905: Deploy swh-search for production.
Jan 5 2021, 2:37 PM · System administration, Journal, Archive search
vsellier added a comment to T2905: Deploy swh-search for production.

In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :

❯ ./random_search.sh                                                                                        12:36:37
Jan 5 2021, 2:16 PM · System administration, Journal, Archive search
vsellier added a comment to T2905: Deploy swh-search for production.

the index configuration was reset to its default :

cat >/tmp/config.json <<EOF
{
  "index" : {
"translog.sync_interval" : null,
"translog.durability": null,
"refresh_interval": null
  }
}
EOF
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "90",
        "translog" : {
          "sync_interval" : "60s",
          "durability" : "async"
        },
        "provided_name" : "origin",
        "creation_date" : "1608761881782",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        }
      }
    }
  }
}
❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json
{
  "acknowledged" : true
}
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "creation_date" : "1608761881782",
        "number_of_shards" : "90",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        },
        "provided_name" : "origin"
      }
    }
  }
}

A *simple* search doesn't looked impacted (it's not a real benchmark):

❯ ./random_search.sh
Jan 5 2021, 9:47 AM · System administration, Journal, Archive search

Jan 4 2021

vsellier accepted D4802: Decomission webapp0 node.

LGTM

Jan 4 2021, 2:09 PM
vsellier closed T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage) as Resolved.

Closing this task as all the direct work is done.
The documentation will be addressed in T2920

Jan 4 2021, 12:33 PM · Staging environment, System administration
vsellier triaged T2920: Document staging infrastructure as Normal priority.
Jan 4 2021, 12:32 PM · Documentation, System administration, Staging environment
vsellier added a comment to T2905: Deploy swh-search for production.

The backfill was done in a couple of days.

Jan 4 2021, 9:41 AM · System administration, Journal, Archive search
vsellier created P913 random es_search with word typing simulation.
Jan 4 2021, 9:40 AM · Archive search

Dec 23 2020

vsellier added a comment to T2905: Deploy swh-search for production.

search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.

Dec 23 2020, 11:34 PM · System administration, Journal, Archive search
vsellier committed rSPRE09ed19b27ecb: Declare production node search1 (authored by vsellier).
Declare production node search1
Dec 23 2020, 11:25 PM
vsellier committed rSPSITEadd22e6d09bf: swh-search: configure search1 backend and journal client (authored by vsellier).
swh-search: configure search1 backend and journal client
Dec 23 2020, 10:46 PM
vsellier committed rSPREd4b2f147188f: Declare search-node[1-3] (authored by vsellier).
Declare search-node[1-3]
Dec 23 2020, 10:05 PM
vsellier committed rSENV0dd616846051: Declare search-esnode[1-3] nodes (authored by vsellier).
Declare search-esnode[1-3] nodes
Dec 23 2020, 10:05 PM
vsellier added a comment to T2905: Deploy swh-search for production.

Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:

export ES_SERVER=192.168.100.81:9200
curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '
Dec 23 2020, 8:55 PM · System administration, Journal, Archive search
vsellier added a comment to T2905: Deploy swh-search for production.

search-esnode[1-3] installed with zfs configured :

apt update && apt install linux-image-amd64 linux-headers-amd64 
# reboot to upgrade the kernel
apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
systemctl stop elasticsearch
rm -rf /srv/elasticsearch/nodes/0
zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb
chown elasticsearch: /srv/elasticsearch/nodes
Dec 23 2020, 8:48 PM · System administration, Journal, Archive search
vsellier committed rSPSITEeb61c65f918a: swh-search/elasticsearch: use a memory configuration matching the vm size (authored by vsellier).
swh-search/elasticsearch: use a memory configuration matching the vm size
Dec 23 2020, 8:27 PM
vsellier committed rSPSITE839ed57e32c5: swh-search: Use the right seed hosts (authored by vsellier).
swh-search: Use the right seed hosts
Dec 23 2020, 8:23 PM
vsellier committed rSPSITEfa8080844c2c: swh-search: Deploy an elasticsearch cluster for productin (authored by vsellier).
swh-search: Deploy an elasticsearch cluster for productin
Dec 23 2020, 7:01 PM
vsellier added a comment to T2905: Deploy swh-search for production.

Inventory was updated to reserve the elastisearch vms :

  • search-esnode[1-3].internal.softwareheritage.org
  • ips : 192.168.100.8[1-3]/24
Dec 23 2020, 6:20 PM · System administration, Journal, Archive search
vsellier committed rSPSITE68ba4af7f1c7: webapp1: declare it on the vagrant environment (authored by vsellier).
webapp1: declare it on the vagrant environment
Dec 23 2020, 6:10 PM
vsellier changed the status of T2905: Deploy swh-search for production, a subtask of T2904: Create a new production webapp using the frozen index on the staging ES, from Open to Work in Progress.
Dec 23 2020, 5:53 PM · System administrators, Journal, Archive search
vsellier changed the status of T2905: Deploy swh-search for production from Open to Work in Progress.
Dec 23 2020, 5:53 PM · System administration, Journal, Archive search
vsellier committed rSPREd14c2628127e: Declare webapp1.internal.softwareheritage.org (authored by vsellier).
Declare webapp1.internal.softwareheritage.org
Dec 23 2020, 5:49 PM
vsellier closed T2904: Create a new production webapp using the frozen index on the staging ES, a subtask of T2590: Finish the indexer -> swh-search pipeline, as Resolved.
Dec 23 2020, 5:42 PM · Journal, Archive search
vsellier closed T2904: Create a new production webapp using the frozen index on the staging ES as Resolved.
Dec 23 2020, 5:42 PM · System administrators, Journal, Archive search
vsellier added a comment to T2904: Create a new production webapp using the frozen index on the staging ES.

The webapp is available at https://webapp1.internal.softwareheritage.org

Dec 23 2020, 5:42 PM · System administrators, Journal, Archive search
vsellier committed rSPSITEc33c839bd7d3: webapp1: fix storage configuration (authored by vsellier).
webapp1: fix storage configuration
Dec 23 2020, 5:32 PM
vsellier committed rSENV181f6cdeb427: Add a new VM to test webapp1 (authored by vsellier).
Add a new VM to test webapp1
Dec 23 2020, 4:52 PM
vsellier committed rSPSITEfcc7b923acf3: Add a new webapp to test swh-search (authored by vsellier).
Add a new webapp to test swh-search
Dec 23 2020, 4:46 PM
vsellier added a comment to T2904: Create a new production webapp using the frozen index on the staging ES.

In prevision of the deployment, the production index present on the staging's elasticsearch was renamed from origin-production2 to production_origin (a clone operation will be user [1], the original index will be let in place)
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clone-index.html

Dec 23 2020, 4:18 PM · System administrators, Journal, Archive search
vsellier committed rDSEA5560ba0ad621: Allow to prefix the index name(s) (authored by vsellier).
Allow to prefix the index name(s)
Dec 23 2020, 3:55 PM
vsellier closed D4785: Allow to configure the index to use on elasticsearch.
Dec 23 2020, 3:55 PM
vsellier updated the diff for D4785: Allow to configure the index to use on elasticsearch.

Remove useless fixture declaration

Dec 23 2020, 3:25 PM
vsellier added a comment to D4785: Allow to configure the index to use on elasticsearch.

thanks, I change that

Dec 23 2020, 3:15 PM
vsellier updated the diff for D4785: Allow to configure the index to use on elasticsearch.

Use a prefix instead of changing the index name.
Make it optional to avoid to have to rename the index on the instances already deployed

Dec 23 2020, 3:04 PM
vsellier added a revision to T2904: Create a new production webapp using the frozen index on the staging ES: D4785: Allow to configure the index to use on elasticsearch.
Dec 23 2020, 1:03 PM · System administrators, Journal, Archive search
vsellier created D4785: Allow to configure the index to use on elasticsearch.
Dec 23 2020, 1:03 PM
vsellier changed the status of T2904: Create a new production webapp using the frozen index on the staging ES, a subtask of T2590: Finish the indexer -> swh-search pipeline, from Open to Work in Progress.
Dec 23 2020, 10:04 AM · Journal, Archive search
vsellier changed the status of T2904: Create a new production webapp using the frozen index on the staging ES from Open to Work in Progress.
Dec 23 2020, 10:04 AM · System administrators, Journal, Archive search
vsellier added a comment to T2903: Test different disk configuration on esnode1.

the shards reallocation is still in progress :

~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/shards\?h\=prirep,node | sort | uniq -c                                                                                                09:40:21
   1216 p esnode1
   1183 p esnode2
      1 p esnode2 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1
   1840 p esnode3
      1 p esnode3 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1
   1208 r esnode1
   1845 r esnode2
   1188 r esnode3

p: primary shard
r: replica shard

Dec 23 2020, 9:47 AM · System administration

Dec 22 2020

vsellier accepted D4783: Define the facts deployment/subnet per deployment types.
Dec 22 2020, 7:16 PM
vsellier accepted D4782: Add subnet for sesi_rocquencourt_admin.
Dec 22 2020, 7:08 PM
vsellier added a comment to T2903: Test different disk configuration on esnode1.

The atime was activated by default. I switched to relatime :

root@esnode1:~# zfs get all  | grep time
elasticsearch-data  atime                 on                        default
elasticsearch-data  relatime              off                       default
Dec 22 2020, 6:01 PM · System administration
vsellier committed rSPSITE609c9833d0d4: Use an inria email to bypass the gandi antispam (authored by vsellier).
Use an inria email to bypass the gandi antispam
Dec 22 2020, 5:52 PM
vsellier committed rSPSITEd8ff8797eac0: Allow admin network to query internal dns (authored by vsellier).
Allow admin network to query internal dns
Dec 22 2020, 5:19 PM
vsellier committed rSPSITEd002388b94bb: esnode1: add necessary packages for zfs (authored by vsellier).
esnode1: add necessary packages for zfs
Dec 22 2020, 4:17 PM
vsellier added a comment to T2903: Test different disk configuration on esnode1.
  • puppet executed
  • esnode1 is back on the cluster but still not selected to received shard due to a configuration rule :
~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/nodes\?v; echo; curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/health\?v                                               16:02:37
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.100.61            3          57   0    0.35    0.25     0.12 dilmrt    -      esnode1
192.168.100.63           35          97   1    0.68    0.65     0.70 dilmrt    *      esnode3
192.168.100.62           35          96   2    0.66    0.75     0.82 dilmrt    -      esnode2
Dec 22 2020, 4:14 PM · System administration
vsellier added a comment to T2903: Test different disk configuration on esnode1.

As puppet can be restart to avoid elasticsearch to restart before zfs is configured, zfs was manually installed :

Dec 22 2020, 3:57 PM · System administration
vsellier added a comment to T2903: Test different disk configuration on esnode1.

Replicate disk sda partitioning on all disks

root@esnode1:~# sfdisk -l /dev/sda
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS726020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9
Dec 22 2020, 2:54 PM · System administration
vsellier added a comment to T2903: Test different disk configuration on esnode1.

old raid cleanup

root@esnode1:~# umount /srv/elasticsearch 
root@esnode1:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2020-12-22 11:37:17.318967701 +0000
+++ /etc/fstab	2020-12-22 11:37:28.687049499 +0000
@@ -11,5 +11,3 @@
 UUID=AE23-D5B8  /boot/efi       vfat    umask=0077      0       1
 # swap was on /dev/sda3 during installation
 UUID=3eaaa22d-e1d2-4dde-9a45-d2fa22696cdf none            swap    sw              0       0
-UUID=6adb1e63-e709-4efb-8be1-76818b1b4751 /srv/kafka	ext4	errors=remount-ro	0 0
-/dev/md127	/srv/elasticsearch	xfs	defaults,noatime	0 0
Dec 22 2020, 2:52 PM · System administration
vsellier accepted D4768: production: Add new hedgedoc instance.
Dec 22 2020, 12:22 PM
vsellier changed the status of T2903: Test different disk configuration on esnode1, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, from Open to Work in Progress.
Dec 22 2020, 12:08 PM · System administration
vsellier changed the status of T2903: Test different disk configuration on esnode1 from Open to Work in Progress.
Dec 22 2020, 12:08 PM · System administration
vsellier added a comment to T2908: sentry does not log tracebacks in the swh-deposit server.

The fix is deployed in staging and production

Dec 22 2020, 12:02 PM · System administration, SWORD deposit
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The disks can't be replaced before beginning of January because of a closed logistic service
Dell was notified about the delay for the disk replacement. The next package retrieval attempt by UPS is scheduled for the *2020-01-11*

Dec 22 2020, 11:51 AM · System administration
vsellier accepted D4772: deposit: initialize Sentry when gunicorn starts.

testing in staging with a manual change in the code to force an assertion, It works well

Dec 22 2020, 11:26 AM
vsellier added a comment to T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage).

Everything looks good, let's try to add some documentation before closing the issue

Dec 22 2020, 9:56 AM · Staging environment, System administration
vsellier updated the task description for T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage).
Dec 22 2020, 9:54 AM · Staging environment, System administration
vsellier accepted D4768: production: Add new hedgedoc instance.

Tested locally, it looks good. I just add a small comment about the installation directory usually in /opt instead of the user home dir.

Dec 22 2020, 9:53 AM

Dec 21 2020

vsellier added a comment to T2682: Deploy a small publicly available kafka server (with some content) on a staging (+ the related objstorage).
  • A new vm objstorage0.internal.staging.swh.network is configured with an read-only object storage service
  • It's exposed to internet via the reverse proxy at https://objstorage.staging.swh.network (it quite different as the usual objstorage:5003 url but it allow to expose the service without new network configuration)
  • DNS entry added on gandi
  • Inventory updated
Dec 21 2020, 7:32 PM · Staging environment, System administration
vsellier closed T2910: Sentry: Increase disk space, a subtask of T2899: Sentry doesn't react to new errors, as Resolved.
Dec 21 2020, 7:05 PM · Sentry, System administration
vsellier closed T2910: Sentry: Increase disk space as Resolved.
Dec 21 2020, 7:05 PM · Sentry, System administration