Page MenuHomeSoftware Heritage

Deploy visit-stats journal client on production
Closed, ResolvedPublic

Description

  • Stop the loaders (so they won't fail and consume too much messages once the storage is stopped)
  • Stop the storage service
  • Upgrade the storage storage server to last version of swh-storage/swh-model/swh-journal
  • Upgrade the storage database model
  • Start the storage service
  • Upgrade the replica db model so replication continues
  • Upgrade and start the loaders
  • Upgrade scheduler stack to last versions created during the sprint
  • Stop swh-scheduler-runner to avoid blocking queries
  • Upgrade the scheduler database model
  • Restart scheduler service (gunicorn-swh-scheduler, swh-scheduler-runner, ...)
  • Create consumer group swh.scheduler.journal_client so it starts at the end of the topics
  • Deploy a new scheduler journal-client service on saatchi (which must start at the end of the topic)
  • Launch a backfill of the origin_visit_status topic

Event Timeline

ardumont created this task.
ardumont moved this task from Backlog to Weekly backlog on the System administration board.
ardumont added a subscriber: vsellier.
  • Stop the workers:
$ clush -b -w @swh-workers 'puppet agent --disable "Deploy new storage version"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
  • Upgrade saam backend to latest storage/model/scheduler versions:
# apt ...
# dpkg -l python3-swh.model python3-swh.storage python3-swh.scheduler python3-swh.journal
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                  Version               Architecture Description
+++-=====================-=====================-============-===================================
ii  python3-swh.journal   0.6.2-1~swh1~bpo10+1  all          Software Heritage Journal utilities
ii  python3-swh.model     0.11.0-1~swh1~bpo10+1 all          Software Heritage data model
ii  python3-swh.scheduler 0.9.2-1~swh1~bpo10+1  all          Software Heritage Scheduler
ii  python3-swh.storage   0.21.0-1~swh1~bpo10+1 all          Software Heritage storage utilities
  • Upgrade db from 164 to 166:
$ psql service=swh
softwareheritage=> \conninfo
You are connected to database "softwareheritage" as user "swhstorage" on host "belvedere.internal.softwareheritage.org" (address "192.168.100.210") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
softwareheritage=> select dbversion from dbversion order by dbversion desc limit 3;
                        dbversion
----------------------------------------------------------
 (166,"2021-01-26 09:28:10.658088+00","Work In Progress")
 (165,"2021-01-26 09:28:10.658088+00","Work In Progress")
 (164,"2020-11-04 14:00:07.29092+00","Work In Progress")
  • Ensure replication is fine
$ psql service=mirror-swh
softwareheritage=> \conninfo
You are connected to database "softwareheritage" as user "guest" on host "somerset.internal.softwareheritage.org" (address "192.168.100.103") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)

softwareheritage=> select dbversion from dbversion order by dbversion desc limit 3;
                        dbversion
---------------------------------------------------------
 (166,"2021-01-26 09:43:57.94437+00","Work In Progress")
 (165,"2021-01-26 09:43:10.95685+00","Work In Progress")
 (164,"2020-12-02 10:17:09.47653+00","Work In Progress")
  • Upgrade swh-workers with latest swh versions.
  • Restart those

(break, deposit meeting ;)

  • stop scheduler runner to avoid blocking queries:
systemctl stop swh-scheduler-runner
  • upgrade scheduler stack
# apt ...
  • Upgrade scheduler db model:
softwareheritage-scheduler=> \conninfo
You are connected to database "softwareheritage-scheduler" as user "swhscheduler" on host "belvedere.internal.softwareheritage.org" (address "192.168.100.210") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
softwareheritage-scheduler=> select dbversion from dbversion order by dbversion desc limit 8;
                        dbversion
---------------------------------------------------------
 (25,"2021-01-26 11:42:29.918304+00","Work In Progress")
 (24,"2021-01-26 11:42:15.917849+00","Work In Progress")
 (23,"2021-01-26 11:42:05.066391+00","Work In Progress")
 (20,"2021-01-26 11:41:02.493991+00","Work In Progress")
 (19,"2021-01-26 11:40:39.74381+00","Work In Progress")
 (18,"2021-01-26 11:40:32.628425+00","Work In Progress")
 (17,"2021-01-26 11:30:31.575419+00","Work In Progress")
 (16,"2020-06-22 18:00:33.852039+00","Work In Progress")
(8 rows)
  • restart scheduler services
# systemctl restart gunicorn-swh-scheduler swh-scheduler-runner swh-scheduler-listener

(break food time ¯\_(ツ)_/¯)

ardumont changed the task status from Open to Work in Progress.Jan 26 2021, 12:54 PM
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
  • Create consumer group swh.scheduler.journal_client so it starts at the end of the topics
root@kafka1# cd /opt/kafka/bin
root@kafka1# export SERVER=kafka1.internal.softwareheritage.org:9092
root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --group swh.scheduler.journal_client \
>   --topic swh.journal.objects.origin_visit_status --reset-offsets --to-latest --execute

GROUP                          TOPIC                          PARTITION  NEW-OFFSET
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 109        5087970
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 26         5080073
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 201        5079810
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 238        5076767
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 12         5080050
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 231        5085073
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 140        5078655
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 35         5079412
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 10         5080078
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 115        5083860
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 243        5080515
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 99         5086378
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 215        5080377
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 208        5082540
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 143        5082577
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 178        5081556
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 39         5079159
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 63         5082302
swh.scheduler.journal_client   swh.journal.objects.origin_visit_status 194        5081268
...

root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --list
swh.search.journal_client
KMOffsetCache-getty
swh.indexer.journal_client
swh.scheduler.journal_client

root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client

Consumer group 'swh.scheduler.journal_client' has no active members.

GROUP                        TOPIC                                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 134        5082134         5082144         10              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 167        5082487         5082499         12              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 200        5082963         5082976         13              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 233        5080371         5080384         13              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 18         5080655         5080667         12              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 51         5086240         5086252         12              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 84         5080830         5080839         9               -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 117        5078954         5078967         13              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 150        5081221         5081237         16              -               -               -
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 183        5079272         5079282         10              -               -               -
...
  • Deploy a new scheduler journal-client service on saatchi (which must start at the end of the topic)
root@saatchi:~# puppet agent --enable; puppet agent --test
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for saatchi.internal.softwareheritage.org
Info: Applying configuration version '1611667435'
Notice: /Stage[main]/Profile::Swh::Deploy::Journal/File[/etc/softwareheritage/journal]/ensure: created
Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/File[/etc/softwareheritage/scheduler/journal-client.yml]/ensure: defined content as '{md5}b007df0fe6dd8fc60558eeb60c71c83d'
Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/File[/etc/softwareheritage/scheduler/journal-client.yml]: Scheduling refresh of Service[swh-scheduler-journal-client]
Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Systemd::Unit_file[swh-scheduler-journal-client.service]/File[/etc/systemd/system/swh-scheduler-journal-client.service]/ensure: defined content as '{md5}7958a9db226f3b0d774df2ebb512f350'
Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Systemd::Unit_file[swh-scheduler-journal-client.service]/File[/etc/systemd/system/swh-scheduler-journal-client.service]: Scheduling refresh of Class[Systemd::Systemctl::Daemon_reload]
Info: Systemd::Unit_file[swh-scheduler-journal-client.service]: Scheduling refresh of Service[swh-scheduler-journal-client]
Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Service[swh-scheduler-journal-client]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Service[swh-scheduler-journal-client]: Unscheduling refresh on Service[swh-scheduler-journal-client]
Info: Class[Systemd::Systemctl::Daemon_reload]: Scheduling refresh of Exec[systemctl-daemon-reload]
Notice: /Stage[main]/Systemd::Systemctl::Daemon_reload/Exec[systemctl-daemon-reload]: Triggered 'refresh' from 1 event
Notice: Applied catalog in 14.32 seconds
root@saatchi:~#
  • Check the lag subsides thanks to the new deployed (scheduler) journal client consuming the topic:
root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head

GROUP                        TOPIC                                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST             CLIENT-ID
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109        5088064         5088064         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26         5080170         5080170         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201        5079903         5079903         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238        5076872         5076872         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12         5080135         5080135         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231        5085161         5085161         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140        5078748         5078748         0               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35         5079492         5079493         1               rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
...
  • Check the new visit-stats table is populated along the way
$ psql service=swh-scheduler -c 'select now() , count(*) from origin_visit_stats';
              now              | count
-------------------------------+-------
 2021-01-26 13:27:37.053621+00 | 12682
(1 row)

$ psql service=swh-scheduler -c 'select now() , count(*) from origin_visit_stats';
             now              | count
------------------------------+-------
 2021-01-26 13:27:40.89528+00 | 12719
(1 row)
  • getty: Update swh dependencies (storage upgrade for the backfiller)
apt ...
  • Install backfill.sh with the right configuration backfill.yml and logging.yml (P927)
  • Trigger backfill
root@getty:/srv/softwareheritage/backfill-2021-01# ./backfill.sh
Starting origin_visit_status backfill for range 0 -> 10000000
Starting origin_visit_status backfill for range 10000000 -> 20000000
Starting origin_visit_status backfill for range 20000000 -> 30000000
Starting origin_visit_status backfill for range 30000000 -> 40000000
Starting origin_visit_status backfill for range 40000000 -> 50000000
Starting origin_visit_status backfill for range 50000000 -> 60000000
Starting origin_visit_status backfill for range 60000000 -> 70000000
Starting origin_visit_status backfill for range 70000000 -> 80000000
Starting origin_visit_status backfill for range 80000000 -> 90000000
Starting origin_visit_status backfill for range 90000000 -> 100000000
Starting origin_visit_status backfill for range 100000000 -> 110000000
Starting origin_visit_status backfill for range 110000000 -> 120000000
Starting origin_visit_status backfill for range 120000000 -> 130000000
Starting origin_visit_status backfill for range 130000000 -> 140000000
Starting origin_visit_status backfill for range 140000000 -> 150000000
Starting origin_visit_status backfill for range 150000000 -> 160000000
2021-01-26T13:47:28 INFO     swh.storage.backfill Processing origin_visit_status range 100000000 to 100001000
2021-01-26T13:47:28 INFO     swh.storage.backfill Processing origin_visit_status range 140000000 to 140001000
2021-01-26T13:47:28 INFO     swh.storage.backfill Processing origin_visit_status range 120000000 to 120001000
2021-01-26T13:47:28 INFO     swh.storage.backfill Processing origin_visit_status range 110000000 to 110001000
...
  • Backfiller running:
root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head
GROUP                        TOPIC                                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST             CLIENT-ID
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109        5089421         5093873         4452            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26         5085041         5085832         791             rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201        5081258         5085564         4306            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238        5081351         5082517         1166            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12         5081466         5085931         4465            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231        5089665         5090877         1212            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140        5080086         5084443         4357            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35         5083996         5085273         1277            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
...

root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head

GROUP                        TOPIC                                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST             CLIENT-ID
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109        5098373         5112888         14515           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26         5095076         5104761         9685            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201        5081258         5104579         23321           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238        5083285         5101595         18310           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12         5094831         5104895         10064           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231        5091602         5110168         18566           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140        5089005         5103470         14465           rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35         5094654         5104363         9709            rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka

Everything looks fine and visit-stats is growing gently:

softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats;
              now              |  count
-------------------------------+----------
 2021-01-26 18:07:40.803783+00 | 25672292
(1 row)

status:

  • backfill almost done on getty [1] [2]

board: https://grafana.softwareheritage.org/d/KvQqUhsWz/kafka-consumers-lag?orgId=1&refresh=30s&from=now-24h&to=now

root@getty:/srv/softwareheritage/backfill-2021-01# date; ps -ef | grep swh
Wed 27 Jan 2021 11:34:12 AM UTC
swhstor+ 2260701       1 22 Jan26 ?        05:24:36 /usr/bin/python3 /usr/bin/swh indexer --config-file /etc/softwareheritage/indexer/journal_client.yml journal-client
root     2268957 2268954 36 Jan26 pts/1    07:58:42 /usr/bin/python3 /usr/bin/swh --log-config logging.yml storage --config-file backfill.yml backfill --start-object 0 --end-object 10000000 origin_visit_status
root     2349020 2269720  0 11:34 pts/2    00:00:00 grep swh
  • visit-stats keeps growing (out of ~151M origins)
softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats;
              now              |  count
-------------------------------+----------
 2021-01-27 11:32:12.871945+00 | 83049411
(1 row)
  • eta until visit-stats is completely populated (in regards to the backill): 18 hours [1]

[1] https://grafana.softwareheritage.org/goto/WF8NbGLGk

Status update, the main part of the scheduler journal client is done [1]. 98M origins
referenced in the cache table.

So this task is completed. The pipeline is deployed.

But we hit multiple issues which will be tracked in at least one dedicated issue (maybe
more if need be). T3000

[1]

softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats;
              now              |  count
-------------------------------+----------
 2021-01-28 08:34:40.152554+00 | 98231002
ardumont claimed this task.
ardumont moved this task from deployed/landed to done on the System administration board.