- Stop the loaders (so they won't fail and consume too much messages once the storage is stopped)
- Stop the storage service
- Upgrade the storage storage server to last version of swh-storage/swh-model/swh-journal
- Upgrade the storage database model
- Start the storage service
- Upgrade the replica db model so replication continues
- Upgrade and start the loaders
- Upgrade scheduler stack to last versions created during the sprint
- Stop swh-scheduler-runner to avoid blocking queries
- Upgrade the scheduler database model
- Restart scheduler service (gunicorn-swh-scheduler, swh-scheduler-runner, ...)
- Create consumer group swh.scheduler.journal_client so it starts at the end of the topics
- Deploy a new scheduler journal-client service on saatchi (which must start at the end of the topic)
- Launch a backfill of the origin_visit_status topic
Description
Description
Revisions and Commits
Revisions and Commits
rSPSITE puppet-swh-site | |||
D4946 | rSPSITEfc34778071b7 Install scheduler journal client to saatchi |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2345 Improve handling of recurrent loading tasks in scheduler | ||
Migrated | gitlab-migration | T2454 Stop creating tasks directly in listers | ||
Migrated | gitlab-migration | T2444 Implement the scheduling policy for the recurrent visit scheduler | ||
Migrated | gitlab-migration | T2993 Deploy visit-stats journal client on production |
Event Timeline
Comment Actions
- Stop the workers:
$ clush -b -w @swh-workers 'puppet agent --disable "Deploy new storage version"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
- Upgrade saam backend to latest storage/model/scheduler versions:
# apt ... # dpkg -l python3-swh.model python3-swh.storage python3-swh.scheduler python3-swh.journal Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=====================-=====================-============-=================================== ii python3-swh.journal 0.6.2-1~swh1~bpo10+1 all Software Heritage Journal utilities ii python3-swh.model 0.11.0-1~swh1~bpo10+1 all Software Heritage data model ii python3-swh.scheduler 0.9.2-1~swh1~bpo10+1 all Software Heritage Scheduler ii python3-swh.storage 0.21.0-1~swh1~bpo10+1 all Software Heritage storage utilities
- Upgrade db from 164 to 166:
$ psql service=swh softwareheritage=> \conninfo You are connected to database "softwareheritage" as user "swhstorage" on host "belvedere.internal.softwareheritage.org" (address "192.168.100.210") at port "5432". SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) softwareheritage=> select dbversion from dbversion order by dbversion desc limit 3; dbversion ---------------------------------------------------------- (166,"2021-01-26 09:28:10.658088+00","Work In Progress") (165,"2021-01-26 09:28:10.658088+00","Work In Progress") (164,"2020-11-04 14:00:07.29092+00","Work In Progress")
- Ensure replication is fine
$ psql service=mirror-swh softwareheritage=> \conninfo You are connected to database "softwareheritage" as user "guest" on host "somerset.internal.softwareheritage.org" (address "192.168.100.103") at port "5432". SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) softwareheritage=> select dbversion from dbversion order by dbversion desc limit 3; dbversion --------------------------------------------------------- (166,"2021-01-26 09:43:57.94437+00","Work In Progress") (165,"2021-01-26 09:43:10.95685+00","Work In Progress") (164,"2020-12-02 10:17:09.47653+00","Work In Progress")
- Upgrade swh-workers with latest swh versions.
- Restart those
(break, deposit meeting ;)
Comment Actions
- stop scheduler runner to avoid blocking queries:
systemctl stop swh-scheduler-runner
- upgrade scheduler stack
# apt ...
- Upgrade scheduler db model:
softwareheritage-scheduler=> \conninfo You are connected to database "softwareheritage-scheduler" as user "swhscheduler" on host "belvedere.internal.softwareheritage.org" (address "192.168.100.210") at port "5432". SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) softwareheritage-scheduler=> select dbversion from dbversion order by dbversion desc limit 8; dbversion --------------------------------------------------------- (25,"2021-01-26 11:42:29.918304+00","Work In Progress") (24,"2021-01-26 11:42:15.917849+00","Work In Progress") (23,"2021-01-26 11:42:05.066391+00","Work In Progress") (20,"2021-01-26 11:41:02.493991+00","Work In Progress") (19,"2021-01-26 11:40:39.74381+00","Work In Progress") (18,"2021-01-26 11:40:32.628425+00","Work In Progress") (17,"2021-01-26 11:30:31.575419+00","Work In Progress") (16,"2020-06-22 18:00:33.852039+00","Work In Progress") (8 rows)
- restart scheduler services
# systemctl restart gunicorn-swh-scheduler swh-scheduler-runner swh-scheduler-listener
(break food time ¯\_(ツ)_/¯)
Comment Actions
- Create consumer group swh.scheduler.journal_client so it starts at the end of the topics
root@kafka1# cd /opt/kafka/bin root@kafka1# export SERVER=kafka1.internal.softwareheritage.org:9092 root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --group swh.scheduler.journal_client \ > --topic swh.journal.objects.origin_visit_status --reset-offsets --to-latest --execute GROUP TOPIC PARTITION NEW-OFFSET swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109 5087970 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26 5080073 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201 5079810 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238 5076767 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12 5080050 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231 5085073 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140 5078655 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35 5079412 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 10 5080078 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 115 5083860 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 243 5080515 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 99 5086378 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 215 5080377 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 208 5082540 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 143 5082577 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 178 5081556 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 39 5079159 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 63 5082302 swh.scheduler.journal_client swh.journal.objects.origin_visit_status 194 5081268 ... root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --list swh.search.journal_client KMOffsetCache-getty swh.indexer.journal_client swh.scheduler.journal_client root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client Consumer group 'swh.scheduler.journal_client' has no active members. GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID swh.scheduler.journal_client swh.journal.objects.origin_visit_status 134 5082134 5082144 10 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 167 5082487 5082499 12 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 200 5082963 5082976 13 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 233 5080371 5080384 13 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 18 5080655 5080667 12 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 51 5086240 5086252 12 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 84 5080830 5080839 9 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 117 5078954 5078967 13 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 150 5081221 5081237 16 - - - swh.scheduler.journal_client swh.journal.objects.origin_visit_status 183 5079272 5079282 10 - - - ...
- Deploy a new scheduler journal-client service on saatchi (which must start at the end of the topic)
root@saatchi:~# puppet agent --enable; puppet agent --test Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for saatchi.internal.softwareheritage.org Info: Applying configuration version '1611667435' Notice: /Stage[main]/Profile::Swh::Deploy::Journal/File[/etc/softwareheritage/journal]/ensure: created Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/File[/etc/softwareheritage/scheduler/journal-client.yml]/ensure: defined content as '{md5}b007df0fe6dd8fc60558eeb60c71c83d' Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/File[/etc/softwareheritage/scheduler/journal-client.yml]: Scheduling refresh of Service[swh-scheduler-journal-client] Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Systemd::Unit_file[swh-scheduler-journal-client.service]/File[/etc/systemd/system/swh-scheduler-journal-client.service]/ensure: defined content as '{md5}7958a9db226f3b0d774df2ebb512f350' Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Systemd::Unit_file[swh-scheduler-journal-client.service]/File[/etc/systemd/system/swh-scheduler-journal-client.service]: Scheduling refresh of Class[Systemd::Systemctl::Daemon_reload] Info: Systemd::Unit_file[swh-scheduler-journal-client.service]: Scheduling refresh of Service[swh-scheduler-journal-client] Notice: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Service[swh-scheduler-journal-client]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Profile::Swh::Deploy::Scheduler::Journal_client/Service[swh-scheduler-journal-client]: Unscheduling refresh on Service[swh-scheduler-journal-client] Info: Class[Systemd::Systemctl::Daemon_reload]: Scheduling refresh of Exec[systemctl-daemon-reload] Notice: /Stage[main]/Systemd::Systemctl::Daemon_reload/Exec[systemctl-daemon-reload]: Triggered 'refresh' from 1 event Notice: Applied catalog in 14.32 seconds root@saatchi:~#
- Check the lag subsides thanks to the new deployed (scheduler) journal client consuming the topic:
root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109 5088064 5088064 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26 5080170 5080170 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201 5079903 5079903 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238 5076872 5076872 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12 5080135 5080135 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231 5085161 5085161 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140 5078748 5078748 0 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35 5079492 5079493 1 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka ...
- Check the new visit-stats table is populated along the way
$ psql service=swh-scheduler -c 'select now() , count(*) from origin_visit_stats'; now | count -------------------------------+------- 2021-01-26 13:27:37.053621+00 | 12682 (1 row) $ psql service=swh-scheduler -c 'select now() , count(*) from origin_visit_stats'; now | count ------------------------------+------- 2021-01-26 13:27:40.89528+00 | 12719 (1 row)
Comment Actions
- getty: Update swh dependencies (storage upgrade for the backfiller)
apt ...
- Install backfill.sh with the right configuration backfill.yml and logging.yml (P927)
- Trigger backfill
root@getty:/srv/softwareheritage/backfill-2021-01# ./backfill.sh Starting origin_visit_status backfill for range 0 -> 10000000 Starting origin_visit_status backfill for range 10000000 -> 20000000 Starting origin_visit_status backfill for range 20000000 -> 30000000 Starting origin_visit_status backfill for range 30000000 -> 40000000 Starting origin_visit_status backfill for range 40000000 -> 50000000 Starting origin_visit_status backfill for range 50000000 -> 60000000 Starting origin_visit_status backfill for range 60000000 -> 70000000 Starting origin_visit_status backfill for range 70000000 -> 80000000 Starting origin_visit_status backfill for range 80000000 -> 90000000 Starting origin_visit_status backfill for range 90000000 -> 100000000 Starting origin_visit_status backfill for range 100000000 -> 110000000 Starting origin_visit_status backfill for range 110000000 -> 120000000 Starting origin_visit_status backfill for range 120000000 -> 130000000 Starting origin_visit_status backfill for range 130000000 -> 140000000 Starting origin_visit_status backfill for range 140000000 -> 150000000 Starting origin_visit_status backfill for range 150000000 -> 160000000 2021-01-26T13:47:28 INFO swh.storage.backfill Processing origin_visit_status range 100000000 to 100001000 2021-01-26T13:47:28 INFO swh.storage.backfill Processing origin_visit_status range 140000000 to 140001000 2021-01-26T13:47:28 INFO swh.storage.backfill Processing origin_visit_status range 120000000 to 120001000 2021-01-26T13:47:28 INFO swh.storage.backfill Processing origin_visit_status range 110000000 to 110001000 ...
- Backfiller running:
root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109 5089421 5093873 4452 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26 5085041 5085832 791 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201 5081258 5085564 4306 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238 5081351 5082517 1166 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12 5081466 5085931 4465 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231 5089665 5090877 1212 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140 5080086 5084443 4357 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35 5083996 5085273 1277 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka ... root@kafka1:/opt/kafka/bin# ./kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.scheduler.journal_client | head GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID swh.scheduler.journal_client swh.journal.objects.origin_visit_status 109 5098373 5112888 14515 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 26 5095076 5104761 9685 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 201 5081258 5104579 23321 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 238 5083285 5101595 18310 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 12 5094831 5104895 10064 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 231 5091602 5110168 18566 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 140 5089005 5103470 14465 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka swh.scheduler.journal_client swh.journal.objects.origin_visit_status 35 5094654 5104363 9709 rdkafka-39edd9ab-33ce-4de2-9dad-9cac7fa0e0fe /192.168.100.104 rdkafka
Comment Actions
Everything looks fine and visit-stats is growing gently:
softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats; now | count -------------------------------+---------- 2021-01-26 18:07:40.803783+00 | 25672292 (1 row)
Comment Actions
status:
- backfill almost done on getty [1] [2]
root@getty:/srv/softwareheritage/backfill-2021-01# date; ps -ef | grep swh Wed 27 Jan 2021 11:34:12 AM UTC swhstor+ 2260701 1 22 Jan26 ? 05:24:36 /usr/bin/python3 /usr/bin/swh indexer --config-file /etc/softwareheritage/indexer/journal_client.yml journal-client root 2268957 2268954 36 Jan26 pts/1 07:58:42 /usr/bin/python3 /usr/bin/swh --log-config logging.yml storage --config-file backfill.yml backfill --start-object 0 --end-object 10000000 origin_visit_status root 2349020 2269720 0 11:34 pts/2 00:00:00 grep swh
- visit-stats keeps growing (out of ~151M origins)
softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats; now | count -------------------------------+---------- 2021-01-27 11:32:12.871945+00 | 83049411 (1 row)
- eta until visit-stats is completely populated (in regards to the backill): 18 hours [1]
Comment Actions
Status update, the main part of the scheduler journal client is done [1]. 98M origins
referenced in the cache table.
So this task is completed. The pipeline is deployed.
But we hit multiple issues which will be tracked in at least one dedicated issue (maybe
more if need be). T3000
[1]
softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats; now | count -------------------------------+---------- 2021-01-28 08:34:40.152554+00 | 98231002