Page MenuHomeSoftware Heritage

Deploy swh-search for production
Closed, MigratedEdits Locked

Description

Deploy the swh-search stack for production. This index will be used by the webapp deployed on T2904

It needs :

  • a new ES cluster hosted on the promox node
  • deploy a new node with the swh-search backend and the swh-search-journal-client
  • Launch a new indexation on all the production origins
  • test the basic performances
  • wire webapp1 on this cluster

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

vsellier triaged this task as Normal priority.Dec 21 2020, 10:03 AM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Dec 23 2020, 5:53 PM

Inventory was updated to reserve the elastisearch vms :

  • search-esnode[1-3].internal.softwareheritage.org
  • ips : 192.168.100.8[1-3]/24

search-esnode[1-3] installed with zfs configured :

apt update && apt install linux-image-amd64 linux-headers-amd64 
# reboot to upgrade the kernel
apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
systemctl stop elasticsearch
rm -rf /srv/elasticsearch/nodes/0
zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb
chown elasticsearch: /srv/elasticsearch/nodes

Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:

export ES_SERVER=192.168.100.81:9200
curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '

search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.

The "backfill" of the elasticsearch index is in progress by the journal client deployed on the same node.
It's a naive import with a single thread. The duration is estimated to less than 1 week in the grafana dashboard but it progressively increase [1]

~ ❯ curl -s http://192.168.100.82:9200/_cat/indices\?v                                                                                                                                                    23:32:33
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin Mq8dnlpuRXO4yYoC6CTuQw  90   1     758903            0    726.8mb        651.6mb

[1] https://grafana.softwareheritage.org/d/rmJHuD9Wk/kafka-estimated-time-to-completion?orgId=1&var-group=swh.search.journal_client&from=1608760721429&to=1608762665803

When the backfill will be finished, we will be able to switch webapp1 to this elasticsearch cluster

The backfill was done in a couple of days.


source : https://grafana.softwareheritage.org/goto/O37Bgv-Mz
The long stable curve is for origin messages
The big peak is fir origin_visits

At the middle of the graph, the es cluster was reconfigured to improve the ingestion:

  • 32Go of memory per vms
  • 16go of heap for the ES
  • index tuning to improve the indexing (refresh_interval and translog configuration)
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty                                                                                      
{
  "origin" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "90",
        "translog" : {
          "sync_interval" : "60s",
          "durability" : "async"
        },
        "provided_name" : "origin",
        "creation_date" : "1608761881782",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        }
      }
    }
  }
}

the search looks quite reactive with response times under the seconde (P913) :

~/wip 8s ❯ ./random_search.sh                                                                                                                                                                             09:37:40

Searching for 1/5 pinpointing
Searching for p : took=364ms nb_results=10000
Searching for pi : took=217ms nb_results=10000
Searching for pin : took=168ms nb_results=10000
Searching for pinp : took=238ms nb_results=6857
Searching for pinpo : took=375ms nb_results=5163
Searching for pinpoi : took=88ms nb_results=4862
Searching for pinpoin : took=79ms nb_results=4854
Searching for pinpoint : took=97ms nb_results=4848
Searching for pinpointi : took=35ms nb_results=8
Searching for pinpointin : took=18ms nb_results=6
Searching for pinpointing : took=118ms nb_results=6

Searching for 2/5 Szechuan
Searching for S : took=305ms nb_results=10000
Searching for Sz : took=260ms nb_results=10000
Searching for Sze : took=226ms nb_results=10000
Searching for Szec : took=89ms nb_results=517
Searching for Szech : took=185ms nb_results=312
Searching for Szechu : took=47ms nb_results=108
Searching for Szechua : took=21ms nb_results=85
Searching for Szechuan : took=27ms nb_results=85

Searching for 3/5 indecent
Searching for i : took=465ms nb_results=10000
Searching for in : took=351ms nb_results=10000
Searching for ind : took=202ms nb_results=10000
Searching for inde : took=251ms nb_results=10000
Searching for indec : took=422ms nb_results=3415
Searching for indece : took=75ms nb_results=83
Searching for indecen : took=63ms nb_results=61
Searching for indecent : took=22ms nb_results=59

Searching for 4/5 sparsity
Searching for s : took=263ms nb_results=10000
Searching for sp : took=242ms nb_results=10000
Searching for spa : took=197ms nb_results=10000
Searching for spar : took=133ms nb_results=10000
Searching for spars : took=245ms nb_results=10000
Searching for sparsi : took=61ms nb_results=518
Searching for sparsit : took=39ms nb_results=221
Searching for sparsity : took=38ms nb_results=221

Searching for 5/5 Foucault
Searching for F : took=573ms nb_results=10000
Searching for Fo : took=278ms nb_results=10000
Searching for Fou : took=296ms nb_results=10000
Searching for Fouc : took=113ms nb_results=693
Searching for Fouca : took=55ms nb_results=188
Searching for Foucau : took=56ms nb_results=144
Searching for Foucaul : took=28ms nb_results=140
Searching for Foucault : took=35ms nb_results=133

the cluster will be reconfigured to the nominal parameters to check the search performances

the index configuration was reset to its default :

cat >/tmp/config.json <<EOF
{
  "index" : {
"translog.sync_interval" : null,
"translog.durability": null,
"refresh_interval": null
  }
}
EOF
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "90",
        "translog" : {
          "sync_interval" : "60s",
          "durability" : "async"
        },
        "provided_name" : "origin",
        "creation_date" : "1608761881782",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        }
      }
    }
  }
}
❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json
{
  "acknowledged" : true
}
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "creation_date" : "1608761881782",
        "number_of_shards" : "90",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        },
        "provided_name" : "origin"
      }
    }
  }
}

A *simple* search doesn't looked impacted (it's not a real benchmark):

❯ ./random_search.sh

Searching for 1/5 headsets
Searching for h : took=868ms nb_results=10000
Searching for he : took=358ms nb_results=10000
Searching for hea : took=216ms nb_results=10000
Searching for head : took=173ms nb_results=10000
Searching for heads : took=256ms nb_results=5434
Searching for headse : took=101ms nb_results=632
Searching for headset : took=37ms nb_results=612
Searching for headsets : took=38ms nb_results=28

Searching for 2/5 sullenness's
Searching for s : took=283ms nb_results=10000
Searching for su : took=323ms nb_results=10000
Searching for sul : took=206ms nb_results=10000
Searching for sull : took=231ms nb_results=7768
Searching for sulle : took=163ms nb_results=943
Searching for sullen : took=55ms nb_results=348
Searching for sullenn : took=47ms nb_results=1
Searching for sullenne : took=13ms nb_results=0
Searching for sullennes : took=36ms nb_results=0
Searching for sullenness : took=22ms nb_results=0
Searching for sullenness' : took=69ms nb_results=0
Searching for sullenness's : took=61ms nb_results=0

Searching for 3/5 referrals
Searching for r : took=455ms nb_results=10000
Searching for re : took=300ms nb_results=10000
Searching for ref : took=315ms nb_results=10000
Searching for refe : took=187ms nb_results=10000
Searching for refer : took=197ms nb_results=10000
Searching for referr : took=205ms nb_results=3152
Searching for referra : took=283ms nb_results=1869
Searching for referral : took=118ms nb_results=1859
Searching for referrals : took=37ms nb_results=354

Searching for 4/5 zealous
Searching for z : took=345ms nb_results=10000
Searching for ze : took=271ms nb_results=10000
Searching for zea : took=228ms nb_results=10000
Searching for zeal : took=182ms nb_results=8500
Searching for zealo : took=212ms nb_results=2790
Searching for zealou : took=316ms nb_results=1050
Searching for zealous : took=72ms nb_results=1047

Searching for 5/5 cousins
Searching for c : took=993ms nb_results=10000
Searching for co : took=897ms nb_results=10000
Searching for cou : took=343ms nb_results=10000
Searching for cous : took=189ms nb_results=8907
Searching for cousi : took=225ms nb_results=1254
Searching for cousin : took=89ms nb_results=1209
Searching for cousins : took=63ms nb_results=362

In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :

❯ ./random_search.sh                                                                                        12:36:37

Searching for 1/5 castoffs
Searching for c : took=10302ms nb_results=10000
Searching for ca : took=4566ms nb_results=10000
Searching for cas : took=1119ms nb_results=10000
Searching for cast : took=1885ms nb_results=10000
Searching for casto : took=798ms nb_results=2483
Searching for castof : took=499ms nb_results=37
Searching for castoff : took=703ms nb_results=23
Searching for castoffs : took=17ms nb_results=0

Searching for 2/5 nonstandard
Searching for n : took=1506ms nb_results=10000
Searching for no : took=664ms nb_results=10000
Searching for non : took=525ms nb_results=10000
Searching for nons : took=489ms nb_results=4711
Searching for nonst : took=193ms nb_results=1246
Searching for nonsta : took=78ms nb_results=220
Searching for nonstan : took=93ms nb_results=72
Searching for nonstand : took=87ms nb_results=70
Searching for nonstanda : took=20ms nb_results=69
Searching for nonstandar : took=54ms nb_results=69
Searching for nonstandard : took=20ms nb_results=64

Searching for 3/5 engraver
Searching for e : took=324ms nb_results=10000
Searching for en : took=234ms nb_results=10000
Searching for eng : took=314ms nb_results=10000
Searching for engr : took=354ms nb_results=6780
Searching for engra : took=182ms nb_results=1453
Searching for engrav : took=201ms nb_results=551
Searching for engrave : took=40ms nb_results=341
Searching for engraver : took=67ms nb_results=174

Searching for 4/5 predilections
Searching for p : took=556ms nb_results=10000
Searching for pr : took=533ms nb_results=10000
Searching for pre : took=391ms nb_results=10000
Searching for pred : took=346ms nb_results=10000
Searching for predi : took=375ms nb_results=10000
Searching for predil : took=27ms nb_results=29
Searching for predile : took=54ms nb_results=12
Searching for predilec : took=49ms nb_results=9
Searching for predilect : took=69ms nb_results=9
Searching for predilecti : took=91ms nb_results=4
Searching for predilectio : took=15ms nb_results=4
Searching for predilection : took=9ms nb_results=4
Searching for predilections : took=10ms nb_results=1

Searching for 5/5 sprinkling
Searching for s : took=486ms nb_results=10000
Searching for sp : took=557ms nb_results=10000
Searching for spr : took=508ms nb_results=10000
Searching for spri : took=404ms nb_results=10000
Searching for sprin : took=118ms nb_results=10000
Searching for sprink : took=322ms nb_results=2480
Searching for sprinkl : took=127ms nb_results=2249
Searching for sprinkli : took=53ms nb_results=8
Searching for sprinklin : took=38ms nb_results=8
Searching for sprinkling : took=15ms nb_results=7

The memory of the server will be set to 24go to check if giving more system memory to zfs/file cache improve the situation.
The cpu doesn't look to be the bottleneck for the moment, the thread pools are not queuing anything with a single search. We will probably have to add more cpus if we want to support a lot of parallel searchs but we'll see that when needed

the performances looks acceptable as it for a small number of parallel searches (~10), let's try now with real searches, it will also help to adapt the cluster configuration and validate the behavior

webapp1 is now plugged on the real live production index
Let monitor the behavior with real searches.
First constatation, the search retrieves all the documents and is not as progressive as the random search script.
The response times are longer than expected:

Jan 06 09:59:46 search1 python3[813]: 2021-01-06 09:59:46 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:3.399s]
Jan 06 10:06:18 search1 python3[848]: 2021-01-06 10:06:18 [848] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:7.422s]
Jan 06 10:06:21 search1 python3[813]: 2021-01-06 10:06:21 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:5.077s]
Jan 06 10:07:32 search1 python3[813]: 2021-01-06 10:07:32 [813] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:4.819s]
Jan 06 10:08:06 search1 python3[813]: 2021-01-06 10:08:06 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.700s]
Jan 06 10:08:15 search1 python3[813]: 2021-01-06 10:08:15 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.414s]

@vlorentz I was checking some differences between swh-search and the current search. does the journal client has to listen the origin_visit topic? It seems that `origin_visit_status should be enough to match the behavior of the search in the webapp.

For example, if we are considering this repo recently indexed but partially :

softwareheritage=> select * from origin_visit_status join origin on origin=id where url='https://github.com/kantacky/DoReMi';
  origin   | visit |             date              | status  | metadata | snapshot |    id     |                url                 
-----------+-------+-------------------------------+---------+----------+----------+-----------+------------------------------------
 150810263 |     1 | 2021-01-07 08:57:45.856561+00 | created |          |          | 150810263 | https://github.com/kantacky/DoReMi
 150810263 |     1 | 2021-01-07 08:57:46.280359+00 | partial |          |          | 150810263 | https://github.com/kantacky/DoReMi
(2 rows)

softwareheritage=> select * from origin_visit join origin on origin=id where url='https://github.com/kantacky/DoReMi';
  origin   | visit |             date              | type |    id     |                url                 
-----------+-------+-------------------------------+------+-----------+------------------------------------
 150810263 |     1 | 2021-01-07 08:57:45.856561+00 | git  | 150810263 | https://github.com/kantacky/DoReMi
(1 row)

The has_visit is set to true via origin_visit message : https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/journal_client.py$43
but not via the message on origin_visit_status : https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/journal_client.py$61

In the old legacy search, the repo is not found with or without the 'filter out origins with no archived content', in the new search it is with or without the 'filter out origins with no archived content' checked

I'm not sure of the "functional" meaning of the has_visit flag and the way it is interpreted by the 'filter out origins with no archived content' checkbox.

Yes indeed. swh-search was written before we had origin visit statuses, and I forgot to update it.

Thanks :)

It depends of what will be implemented in T2936, but a new reindex will probably have to be done to fix the search. It will be the opportunity to think on how doing it without killing all the search

how doing it without killing all the search

What do you mean? We'll just need to update the journal client and restart it.

Oh right, they were wrongfully set to True. I guess we can write a small script to set them all to False before we re-consume stasuses

I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues