Deploy swh-search for production
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Dec 21 2020, 10:03 AM

Description

Deploy the swh-search stack for production. This index will be used by the webapp deployed on T2904

It needs :

a new ES cluster hosted on the promox node
deploy a new node with the swh-search backend and the swh-search-journal-client
Launch a new indexation on all the production origins
test the basic performances
wire webapp1 on this cluster

Revisions and Commits

rSPSITE puppet-swh-site
	D4809	rSPSITE57b96d45f705 Plug webapp1 on the swh-search with live production data

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1523 Search tools on metadata
Migrated	gitlab-migration	T1117 Origin search is slow when you look for very common words
Migrated	gitlab-migration	T1910 Redesign origin search using a dedicated component (swh-search)
Migrated	gitlab-migration	T2182 Switch production swh-web to use swh-search instead of postgresql search.
Migrated	gitlab-migration	T2590 Finish the indexer -> swh-search pipeline
Migrated	gitlab-migration	T2651 Make the indexer-storage publish its rows to Kafka
Migrated	gitlab-migration	T2652 Make the indexer-storage interface use attr classes instead of dicts
Migrated	gitlab-migration	T2816 Enable the journal-writer for the swh-idx-storage in staging
Migrated	gitlab-migration	T2817 Enable the swh-search environment in staging
Migrated	gitlab-migration	T2876 metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types
Migrated	gitlab-migration	T2904 Create a new production webapp using the frozen index on the staging ES
Migrated	gitlab-migration	T2905 Deploy swh-search for production
Migrated	gitlab-migration	T2936 Update the swh-search journal client to only set "has_visit" on "full" status of the visit
Migrated	gitlab-migration	T2944 Deploy swh-search v0.4.1
Migrated	gitlab-migration	T3037 Reschedule origin-intrinsic-metadata tasks for all origins
Migrated	gitlab-migration	T2780 Enable the journal-writer for the swh-idx-storage in production
Migrated	gitlab-migration	T3040 [production] Enable swh-search's journal-client for indexed objects
Migrated	gitlab-migration	T3041 [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata
Migrated	gitlab-migration	T3373 Metadata search is failing due to a boolean field in the mapping of the metadata fields
Migrated	gitlab-migration	T3391 [swh-search] Deploy v0.9.0 on staging and execute a full origin and metadata reindexation
Migrated	gitlab-migration	T3392 [staging] Properly recreate the origin_intrinsic_metadata topic
Migrated	gitlab-migration	T3398 [swh-search] Deploy v0.9.0 on production and execute a full origin and metadata reindexation
Migrated	gitlab-migration	T3047 Enable to search in origin metadata with swh-search in webapp
Migrated	gitlab-migration	T3058 Metadata search is failing with "failed to parse date field"
Migrated	gitlab-migration	T3060 Deploy swh-search v0.6.0 in staging
Migrated	gitlab-migration	T3076 [swh-search] Improve the index/mapping migration process
Migrated	gitlab-migration	T3083 Deploy swh-search v0.7.0/v0.7.1

Event Timeline

vsellier triaged this task as Normal priority.Dec 21 2020, 10:03 AM

vsellier created this task.

vsellier mentioned this in T2904: Create a new production webapp using the frozen index on the staging ES.Dec 23 2020, 5:42 PM

vsellier changed the task status from Open to Work in Progress.Dec 23 2020, 5:53 PM

Inventory was updated to reserve the elastisearch vms :

search-esnode[1-3].internal.softwareheritage.org
ips : 192.168.100.8[1-3]/24

vsellier mentioned this in rSPSITEfa8080844c2c: swh-search: Deploy an elasticsearch cluster for productin.Dec 23 2020, 7:01 PM

vsellier mentioned this in rSPSITE839ed57e32c5: swh-search: Use the right seed hosts.Dec 23 2020, 8:23 PM

vsellier mentioned this in rSPSITEeb61c65f918a: swh-search/elasticsearch: use a memory configuration matching the vm size.Dec 23 2020, 8:27 PM

search-esnode[1-3] installed with zfs configured :

apt update && apt install linux-image-amd64 linux-headers-amd64 
# reboot to upgrade the kernel
apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
systemctl stop elasticsearch
rm -rf /srv/elasticsearch/nodes/0
zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb
chown elasticsearch: /srv/elasticsearch/nodes

Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:

export ES_SERVER=192.168.100.81:9200
curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '

vsellier mentioned this in rSENV0dd616846051: Declare search-esnode[1-3] nodes.Dec 23 2020, 10:05 PM

vsellier mentioned this in rSPREd4b2f147188f: Declare search-node[1-3].

vsellier mentioned this in rSPSITEadd22e6d09bf: swh-search: configure search1 backend and journal client.Dec 23 2020, 10:46 PM

vsellier mentioned this in rSPRE09ed19b27ecb: Declare production node search1.Dec 23 2020, 11:25 PM

search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.

The "backfill" of the elasticsearch index is in progress by the journal client deployed on the same node.
It's a naive import with a single thread. The duration is estimated to less than 1 week in the grafana dashboard but it progressively increase [1]

~ ❯ curl -s http://192.168.100.82:9200/_cat/indices\?v                                                                                                                                                    23:32:33
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin Mq8dnlpuRXO4yYoC6CTuQw  90   1     758903            0    726.8mb        651.6mb

[1] https://grafana.softwareheritage.org/d/rmJHuD9Wk/kafka-estimated-time-to-completion?orgId=1&var-group=swh.search.journal_client&from=1608760721429&to=1608762665803

When the backfill will be finished, we will be able to switch webapp1 to this elasticsearch cluster

The backfill was done in a couple of days.

source : https://grafana.softwareheritage.org/goto/O37Bgv-Mz
The long stable curve is for origin messages
The big peak is fir origin_visits

At the middle of the graph, the es cluster was reconfigured to improve the ingestion:

32Go of memory per vms
16go of heap for the ES
index tuning to improve the indexing (refresh_interval and translog configuration)

❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty                                                                                      
{
  "origin" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "90",
        "translog" : {
          "sync_interval" : "60s",
          "durability" : "async"
        },
        "provided_name" : "origin",
        "creation_date" : "1608761881782",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        }
      }
    }
  }
}

the search looks quite reactive with response times under the seconde (P913) :

~/wip 8s ❯ ./random_search.sh                                                                                                                                                                             09:37:40

Searching for 1/5 pinpointing
Searching for p : took=364ms nb_results=10000
Searching for pi : took=217ms nb_results=10000
Searching for pin : took=168ms nb_results=10000
Searching for pinp : took=238ms nb_results=6857
Searching for pinpo : took=375ms nb_results=5163
Searching for pinpoi : took=88ms nb_results=4862
Searching for pinpoin : took=79ms nb_results=4854
Searching for pinpoint : took=97ms nb_results=4848
Searching for pinpointi : took=35ms nb_results=8
Searching for pinpointin : took=18ms nb_results=6
Searching for pinpointing : took=118ms nb_results=6

Searching for 2/5 Szechuan
Searching for S : took=305ms nb_results=10000
Searching for Sz : took=260ms nb_results=10000
Searching for Sze : took=226ms nb_results=10000
Searching for Szec : took=89ms nb_results=517
Searching for Szech : took=185ms nb_results=312
Searching for Szechu : took=47ms nb_results=108
Searching for Szechua : took=21ms nb_results=85
Searching for Szechuan : took=27ms nb_results=85

Searching for 3/5 indecent
Searching for i : took=465ms nb_results=10000
Searching for in : took=351ms nb_results=10000
Searching for ind : took=202ms nb_results=10000
Searching for inde : took=251ms nb_results=10000
Searching for indec : took=422ms nb_results=3415
Searching for indece : took=75ms nb_results=83
Searching for indecen : took=63ms nb_results=61
Searching for indecent : took=22ms nb_results=59

Searching for 4/5 sparsity
Searching for s : took=263ms nb_results=10000
Searching for sp : took=242ms nb_results=10000
Searching for spa : took=197ms nb_results=10000
Searching for spar : took=133ms nb_results=10000
Searching for spars : took=245ms nb_results=10000
Searching for sparsi : took=61ms nb_results=518
Searching for sparsit : took=39ms nb_results=221
Searching for sparsity : took=38ms nb_results=221

Searching for 5/5 Foucault
Searching for F : took=573ms nb_results=10000
Searching for Fo : took=278ms nb_results=10000
Searching for Fou : took=296ms nb_results=10000
Searching for Fouc : took=113ms nb_results=693
Searching for Fouca : took=55ms nb_results=188
Searching for Foucau : took=56ms nb_results=144
Searching for Foucaul : took=28ms nb_results=140
Searching for Foucault : took=35ms nb_results=133

the cluster will be reconfigured to the nominal parameters to check the search performances

the index configuration was reset to its default :

cat >/tmp/config.json <<EOF
{
  "index" : {
"translog.sync_interval" : null,
"translog.durability": null,
"refresh_interval": null
  }
}
EOF

❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "60s",
        "number_of_shards" : "90",
        "translog" : {
          "sync_interval" : "60s",
          "durability" : "async"
        },
        "provided_name" : "origin",
        "creation_date" : "1608761881782",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        }
      }
    }
  }
}
❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json
{
  "acknowledged" : true
}
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty
{
  "origin" : {
    "settings" : {
      "index" : {
        "creation_date" : "1608761881782",
        "number_of_shards" : "90",
        "number_of_replicas" : "1",
        "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw",
        "version" : {
          "created" : "7090399"
        },
        "provided_name" : "origin"
      }
    }
  }
}

A *simple* search doesn't looked impacted (it's not a real benchmark):

❯ ./random_search.sh

Searching for 1/5 headsets
Searching for h : took=868ms nb_results=10000
Searching for he : took=358ms nb_results=10000
Searching for hea : took=216ms nb_results=10000
Searching for head : took=173ms nb_results=10000
Searching for heads : took=256ms nb_results=5434
Searching for headse : took=101ms nb_results=632
Searching for headset : took=37ms nb_results=612
Searching for headsets : took=38ms nb_results=28

Searching for 2/5 sullenness's
Searching for s : took=283ms nb_results=10000
Searching for su : took=323ms nb_results=10000
Searching for sul : took=206ms nb_results=10000
Searching for sull : took=231ms nb_results=7768
Searching for sulle : took=163ms nb_results=943
Searching for sullen : took=55ms nb_results=348
Searching for sullenn : took=47ms nb_results=1
Searching for sullenne : took=13ms nb_results=0
Searching for sullennes : took=36ms nb_results=0
Searching for sullenness : took=22ms nb_results=0
Searching for sullenness' : took=69ms nb_results=0
Searching for sullenness's : took=61ms nb_results=0

Searching for 3/5 referrals
Searching for r : took=455ms nb_results=10000
Searching for re : took=300ms nb_results=10000
Searching for ref : took=315ms nb_results=10000
Searching for refe : took=187ms nb_results=10000
Searching for refer : took=197ms nb_results=10000
Searching for referr : took=205ms nb_results=3152
Searching for referra : took=283ms nb_results=1869
Searching for referral : took=118ms nb_results=1859
Searching for referrals : took=37ms nb_results=354

Searching for 4/5 zealous
Searching for z : took=345ms nb_results=10000
Searching for ze : took=271ms nb_results=10000
Searching for zea : took=228ms nb_results=10000
Searching for zeal : took=182ms nb_results=8500
Searching for zealo : took=212ms nb_results=2790
Searching for zealou : took=316ms nb_results=1050
Searching for zealous : took=72ms nb_results=1047

Searching for 5/5 cousins
Searching for c : took=993ms nb_results=10000
Searching for co : took=897ms nb_results=10000
Searching for cou : took=343ms nb_results=10000
Searching for cous : took=189ms nb_results=8907
Searching for cousi : took=225ms nb_results=1254
Searching for cousin : took=89ms nb_results=1209
Searching for cousins : took=63ms nb_results=362

In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :

❯ ./random_search.sh                                                                                        12:36:37

Searching for 1/5 castoffs
Searching for c : took=10302ms nb_results=10000
Searching for ca : took=4566ms nb_results=10000
Searching for cas : took=1119ms nb_results=10000
Searching for cast : took=1885ms nb_results=10000
Searching for casto : took=798ms nb_results=2483
Searching for castof : took=499ms nb_results=37
Searching for castoff : took=703ms nb_results=23
Searching for castoffs : took=17ms nb_results=0

Searching for 2/5 nonstandard
Searching for n : took=1506ms nb_results=10000
Searching for no : took=664ms nb_results=10000
Searching for non : took=525ms nb_results=10000
Searching for nons : took=489ms nb_results=4711
Searching for nonst : took=193ms nb_results=1246
Searching for nonsta : took=78ms nb_results=220
Searching for nonstan : took=93ms nb_results=72
Searching for nonstand : took=87ms nb_results=70
Searching for nonstanda : took=20ms nb_results=69
Searching for nonstandar : took=54ms nb_results=69
Searching for nonstandard : took=20ms nb_results=64

Searching for 3/5 engraver
Searching for e : took=324ms nb_results=10000
Searching for en : took=234ms nb_results=10000
Searching for eng : took=314ms nb_results=10000
Searching for engr : took=354ms nb_results=6780
Searching for engra : took=182ms nb_results=1453
Searching for engrav : took=201ms nb_results=551
Searching for engrave : took=40ms nb_results=341
Searching for engraver : took=67ms nb_results=174

Searching for 4/5 predilections
Searching for p : took=556ms nb_results=10000
Searching for pr : took=533ms nb_results=10000
Searching for pre : took=391ms nb_results=10000
Searching for pred : took=346ms nb_results=10000
Searching for predi : took=375ms nb_results=10000
Searching for predil : took=27ms nb_results=29
Searching for predile : took=54ms nb_results=12
Searching for predilec : took=49ms nb_results=9
Searching for predilect : took=69ms nb_results=9
Searching for predilecti : took=91ms nb_results=4
Searching for predilectio : took=15ms nb_results=4
Searching for predilection : took=9ms nb_results=4
Searching for predilections : took=10ms nb_results=1

Searching for 5/5 sprinkling
Searching for s : took=486ms nb_results=10000
Searching for sp : took=557ms nb_results=10000
Searching for spr : took=508ms nb_results=10000
Searching for spri : took=404ms nb_results=10000
Searching for sprin : took=118ms nb_results=10000
Searching for sprink : took=322ms nb_results=2480
Searching for sprinkl : took=127ms nb_results=2249
Searching for sprinkli : took=53ms nb_results=8
Searching for sprinklin : took=38ms nb_results=8
Searching for sprinkling : took=15ms nb_results=7

The memory of the server will be set to 24go to check if giving more system memory to zfs/file cache improve the situation.
The cpu doesn't look to be the bottleneck for the moment, the thread pools are not queuing anything with a single search. We will probably have to add more cpus if we want to support a lot of parallel searchs but we'll see that when needed

vsellier updated the task description. (Show Details)Jan 5 2021, 2:37 PM

vsellier edited projects, added System administration; removed System administrators.

vsellier moved this task from Backlog to in-progress on the System administration board.Jan 5 2021, 2:39 PM

vsellier added a revision: D4809: Plug webapp1 on the swh-search with live production data.Jan 6 2021, 9:55 AM

vsellier updated the task description. (Show Details)Jan 6 2021, 9:56 AM

the performances looks acceptable as it for a small number of parallel searches (~10), let's try now with real searches, it will also help to adapt the cluster configuration and validate the behavior

vsellier added a commit: rSPSITE57b96d45f705: Plug webapp1 on the swh-search with live production data.Jan 6 2021, 10:28 AM

webapp1 is now plugged on the real live production index
Let monitor the behavior with real searches.
First constatation, the search retrieves all the documents and is not as progressive as the random search script.
The response times are longer than expected:

Jan 06 09:59:46 search1 python3[813]: 2021-01-06 09:59:46 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:3.399s]
Jan 06 10:06:18 search1 python3[848]: 2021-01-06 10:06:18 [848] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:7.422s]
Jan 06 10:06:21 search1 python3[813]: 2021-01-06 10:06:21 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:5.077s]
Jan 06 10:07:32 search1 python3[813]: 2021-01-06 10:07:32 [813] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:4.819s]
Jan 06 10:08:06 search1 python3[813]: 2021-01-06 10:08:06 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.700s]
Jan 06 10:08:15 search1 python3[813]: 2021-01-06 10:08:15 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.414s]

vsellier updated the task description. (Show Details)Jan 6 2021, 11:06 AM

olasd added a parent task: T2182: Switch production swh-web to use swh-search instead of postgresql search..Jan 6 2021, 11:15 AM

vsellier mentioned this in rSPRE556448d54882: align search-esnode* configuration with the real number.Jan 6 2021, 4:05 PM

@vlorentz I was checking some differences between swh-search and the current search. does the journal client has to listen the origin_visit topic? It seems that `origin_visit_status should be enough to match the behavior of the search in the webapp.

For example, if we are considering this repo recently indexed but partially :

softwareheritage=> select * from origin_visit_status join origin on origin=id where url='https://github.com/kantacky/DoReMi';
  origin   | visit |             date              | status  | metadata | snapshot |    id     |                url                 
-----------+-------+-------------------------------+---------+----------+----------+-----------+------------------------------------
 150810263 |     1 | 2021-01-07 08:57:45.856561+00 | created |          |          | 150810263 | https://github.com/kantacky/DoReMi
 150810263 |     1 | 2021-01-07 08:57:46.280359+00 | partial |          |          | 150810263 | https://github.com/kantacky/DoReMi
(2 rows)

softwareheritage=> select * from origin_visit join origin on origin=id where url='https://github.com/kantacky/DoReMi';
  origin   | visit |             date              | type |    id     |                url                 
-----------+-------+-------------------------------+------+-----------+------------------------------------
 150810263 |     1 | 2021-01-07 08:57:45.856561+00 | git  | 150810263 | https://github.com/kantacky/DoReMi
(1 row)

The has_visit is set to true via origin_visit message : https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/journal_client.py$43
but not via the message on origin_visit_status : https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/journal_client.py$61

In the old legacy search, the repo is not found with or without the 'filter out origins with no archived content', in the new search it is with or without the 'filter out origins with no archived content' checked

I'm not sure of the "functional" meaning of the has_visit flag and the way it is interpreted by the 'filter out origins with no archived content' checkbox.

Yes indeed. swh-search was written before we had origin visit statuses, and I forgot to update it.

Thanks :)

It depends of what will be implemented in T2936, but a new reindex will probably have to be done to fix the search. It will be the opportunity to think on how doing it without killing all the search