Page MenuHomeSoftware Heritage

Deploy swh-scheduler v0.22.0
Closed, MigratedEdits Locked

Description

v0.21.0 : To fix the conflict issue when duplicated origins are inserted in the same batch
v0.22.0: To fix the datetime overflow relative to the next_visit_queue_position field

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 7 2021, 8:59 AM
vsellier triaged this task as High priority.
vsellier created this task.

The problem is reproduced in staging before the deployment

swhworker@worker1:~$ swh lister -C /etc/softwareheritage/lister.yml run -l npm 
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.15.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/lister/cli.py", line 65, in run
    get_lister(lister, **config).run()
  File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 130, in run
    full_stats.origins += self.send_origins(origins)
  File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 234, in send_origins
    ret = self.scheduler.record_listed_origins(batch_origins)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 354, in _decode_response
    self.raise_for_status(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 CardinalityViolation: ['ON CONFLICT DO UPDATE command cannot affect row a second time\nHINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.\n']>

Version v0.21.0 deployed in staging:

root@scheduler0:~# apt list --upgradable 2>/dev/null | grep swh | cut -f1 -d'/' | xargs -t apt install
apt install python3-swh.core python3-swh.counters python3-swh.journal python3-swh.lister python3-swh.loader.core python3-swh.model python3-swh.objstorage python3-swh.scheduler python3-swh.storage 
...
root@scheduler0:~# systemctl reload gunicorn-swh-scheduler.service

Loading test is ok, the process was oom killed but it run far longer than previously:

swhworker@worker1:~$ /usr/bin/time swh lister -C /etc/softwareheritage/lister.yml run -l npm                                                                     
Command terminated by signal 9
407.67user 81.62system 3:13:08elapsed 4%CPU (0avgtext+0avgdata 8394852maxresident)k
36216inputs+0outputs (240major+11017567minor)pagefaults 0swaps
vsellier renamed this task from Deploy swh-scheduler v0.21.0 to Deploy swh-scheduler v0.22.0.Dec 7 2021, 6:43 PM
vsellier updated the task description. (Show Details)

Deployment of the version v0.22.0 in staging

Current database version:

swh-scheduler=> \conninfo
You are connected to database "swh-scheduler" as user "guest" on host "db1.internal.staging.swh.network" (address "192.168.130.11") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
swh-scheduler=> select * from dbversion order by version desc limit 1;
 version |            release            |   description    
---------+-------------------------------+------------------
      30 | 2021-08-06 09:24:17.902284+00 | Work In Progress
(1 row)
  • Preparing the migration (stop puppet, swh-scheduler-journal-client.service, swh-scheduler-schedule-recurrent.service)

From irc:

20:12:27     +olasd | vsellier: you'll want swh-scheduler-journal-client.service and swh-scheduler-schedule-recurrent.service stopped for the migration
root@scheduler0:~# puppet agent --disable "T3773"

root@scheduler0:~# systemctl stop swh-scheduler-journal-client
root@scheduler0:~# systemctl stop swh-scheduler-schedule-recurrent

root@scheduler0:~# systemctl status swh-scheduler-journal-client
● swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Wed 2021-12-08 10:19:16 UTC; 15s ago
    Process: 3982600 ExecStart=/usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client (code=exited, status=1/FAILU
   Main PID: 3982600 (code=exited, status=1/FAILURE)
        CPU: 805ms
Dec 08 10:18:45 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Failed with result 'exit-code'.
Dec 08 10:19:16 scheduler0 systemd[1]: Stopped Software Heritage Scheduler Journal Client.
● swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Wed 2021-12-08 10:19:16 UTC; 15s ago
    Process: 3982600 ExecStart=/usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client (code=exited, status=1/FAILU
   Main PID: 3982600 (code=exited, status=1/FAILURE)
        CPU: 805ms

Dec 08 10:18:45 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Failed with result 'exit-code'.
Dec 08 10:19:16 scheduler0 systemd[1]: Stopped Software Heritage Scheduler Journal Client.

root@scheduler0:~# systemctl status swh-scheduler-schedule-recurrent.service 
● swh-scheduler-schedule-recurrent.service - Software Heritage scheduler for recurrent visits
     Loaded: loaded (/etc/systemd/system/swh-scheduler-schedule-recurrent.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Wed 2021-12-08 10:19:24 UTC; 17s ago
    Process: 3826024 ExecStart=/usr/bin/swh --log-level INFO scheduler --config-file /etc/softwareheritage/scheduler/listener-runner.yml schedule-recurrent (code=
   Main PID: 3826024 (code=exited, status=0/SUCCESS)
        CPU: 2min 4.694s

Dec 08 10:07:29 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy already_visited_order_by_l
Dec 08 10:07:29 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy never_visited_oldest_updat
Dec 08 10:07:29 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy origins_without_last_updat
Dec 08 10:07:29 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:git: 125 visits scheduled in queue swh.loader.git.tasks.UpdateGitRepos
Dec 08 10:19:24 scheduler0 systemd[1]: Stopping Software Heritage scheduler for recurrent visits...
Dec 08 10:19:24 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:Termination requested...
Dec 08 10:19:24 scheduler0 swh[3826024]: INFO:swh.scheduler.celery_backend.recurrent_visits:Terminating visit scheduling threads: archive-files, cran, deb, deb-pa
Dec 08 10:19:24 scheduler0 systemd[1]: swh-scheduler-schedule-recurrent.service: Succeeded.
Dec 08 10:19:24 scheduler0 systemd[1]: Stopped Software Heritage scheduler for recurrent visits.
Dec 08 10:19:24 scheduler0 systemd[1]: swh-scheduler-schedule-recurrent.service: Consumed 2min 4.694s CPU time.
  • upgrade the packages
root@scheduler0:~# apt list --upgradable 2>/dev/null | grep python3-swh
python3-swh.loader.core/unknown 2.0.0-1~swh1~bpo10+1 all [upgradable from: 1.2.1-1~swh1~bpo10+1]
python3-swh.scheduler/unknown 0.22.0-1~swh1~bpo10+1 all [upgradable from: 0.21.0-1~swh1~bpo10+1]
root@scheduler0:~# apt install python3-swh.scheduler python3-swh.loader.core
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages will be upgraded:
  python3-swh.loader.core python3-swh.scheduler
2 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.
Need to get 4940 kB of archives.
After this operation, 74.8 kB of additional disk space will be used.
Get:1 https://debian.softwareheritage.org buster-swh/main amd64 python3-swh.scheduler all 0.22.0-1~swh1~bpo10+1 [85.1 kB]
Get:2 https://debian.softwareheritage.org buster-swh/main amd64 python3-swh.loader.core all 2.0.0-1~swh1~bpo10+1 [4855 kB]
Fetched 4940 kB in 0s (45.0 MB/s)             
Reading changelogs... Done
(Reading database ... 78138 files and directories currently installed.)
Preparing to unpack .../python3-swh.scheduler_0.22.0-1~swh1~bpo10+1_all.deb ...
Unpacking python3-swh.scheduler (0.22.0-1~swh1~bpo10+1) over (0.21.0-1~swh1~bpo10+1) ...
Preparing to unpack .../python3-swh.loader.core_2.0.0-1~swh1~bpo10+1_all.deb ...
Unpacking python3-swh.loader.core (2.0.0-1~swh1~bpo10+1) over (1.2.1-1~swh1~bpo10+1) ...
Setting up python3-swh.scheduler (0.22.0-1~swh1~bpo10+1) ...
Setting up python3-swh.loader.core (2.0.0-1~swh1~bpo10+1) ...

# restart the services not impacted by the database migration
root@scheduler0:~# systemctl restart swh-scheduler-runner gunicorn-swh-scheduler.service swh-scheduler-listener.service swh-scheduler-runner-priority.service
root@scheduler0:~# systemctl status swh-scheduler-runner gunicorn-swh-scheduler.service swh-scheduler-listener.service swh-scheduler-runner-priority.service
<they are all running>
  • Database migration to v31
root@db1:~/swh-scheduler/sql/updates# git pull
Already up to date.
root@db1:~/swh-scheduler/sql/updates# cp 31.sql /tmp
root@db1:~/swh-scheduler/sql/updates# sudo -u postgres psql swh-scheduler
could not change directory to "/root/swh-scheduler/sql/updates": Permission denied
psql (13.4 (Debian 13.4-1.pgdg100+1), server 12.8 (Debian 12.8-1.pgdg100+1))
Type "help" for help.

swh-scheduler=# \timing
Timing is on.
swh-scheduler=# \i /tmp/31.sql 
INSERT 0 1
Time: 19.725 ms
ALTER TABLE
Time: 38065.123 ms (00:38.065)
COMMENT
Time: 65.156 ms
ALTER TABLE
Time: 95.791 ms
swh-scheduler=# \d origin_visit_stats
                           Table "public.origin_visit_stats"
          Column           |           Type           | Collation | Nullable | Default 
---------------------------+--------------------------+-----------+----------+---------
 url                       | text                     |           | not null | 
 visit_type                | text                     |           | not null | 
 last_snapshot             | bytea                    |           |          | 
 last_scheduled            | timestamp with time zone |           |          | 
 next_visit_queue_position | bigint                   |           |          | 
 next_position_offset      | integer                  |           | not null | 4
 successive_visits         | integer                  |           | not null | 1
 last_successful           | timestamp with time zone |           |          | 
 last_visit                | timestamp with time zone |           |          | 
 last_visit_status         | last_visit_status        |           |          | 
Indexes:
    "origin_visit_stats_pkey" PRIMARY KEY, btree (url, visit_type)
  • Restarting the services
root@scheduler0:~# systemctl start swh-scheduler-journal-client
root@scheduler0:~# systemctl status swh-scheduler-journal-client
● swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-12-08 10:48:33 UTC; 34s ago
   Main PID: 3986437 (swh)
      Tasks: 6 (limit: 9537)
     Memory: 46.3M
        CPU: 4.867s
     CGroup: /system.slice/swh-scheduler-journal-client.service
             └─3986437 /usr/bin/python3 /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client

Dec 08 10:48:33 scheduler0 systemd[1]: Started Software Heritage Scheduler Journal Client.



root@scheduler0:~# systemctl start swh-scheduler-schedule-recurrent.service
root@scheduler0:~# systemctl status swh-scheduler-schedule-recurrent.service
● swh-scheduler-schedule-recurrent.service - Software Heritage scheduler for recurrent visits
     Loaded: loaded (/etc/systemd/system/swh-scheduler-schedule-recurrent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-12-08 10:50:13 UTC; 16s ago
   Main PID: 3986611 (swh)
      Tasks: 15 (limit: 9537)
     Memory: 58.0M
        CPU: 1.297s
     CGroup: /system.slice/swh-scheduler-schedule-recurrent.service
             └─3986611 /usr/bin/python3 /usr/bin/swh --log-level INFO scheduler --config-file /etc/softwareheritage/scheduler/listener-runner.yml schedule-recurre

Dec 08 10:50:23 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type svn with policy origins_without_last_updat
Dec 08 10:50:23 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:svn: 27 visits scheduled in queue swh.loader.svn.tasks.DumpMountAndLoa
Dec 08 10:50:27 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy already_visited_order_by_l
Dec 08 10:50:27 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy never_visited_oldest_updat
Dec 08 10:50:27 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git with policy origins_without_last_updat
Dec 08 10:50:27 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:git: 149 visits scheduled in queue swh.loader.git.tasks.UpdateGitRepos
Dec 08 10:50:29 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type svn with policy already_visited_order_by_l
Dec 08 10:50:29 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type svn with policy never_visited_oldest_updat
Dec 08 10:50:29 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type svn with policy origins_without_last_updat
Dec 08 10:50:30 scheduler0 swh[3986611]: INFO:swh.scheduler.celery_backend.recurrent_visits:svn: 26 visits scheduled in queue swh.loader.svn.tasks.DumpMountAndLoa

deployment of version v0.22.0 in production

  • database migration preparation
root@belvedere:~/swh-scheduler/sql/updates# git pull
Already up to date.
root@belvedere:~/swh-scheduler/sql/updates# cp 31.sql /tmp
root@belvedere:~/swh-scheduler/sql/updates# sudo -u postgres psql -p 5434 softwareheritage-scheduler
could not change directory to "/root/swh-scheduler/sql/updates": Permission denied
psql (12.8 (Debian 12.8-1.pgdg100+1))
Type "help" for help.

softwareheritage-scheduler=# \conninfo
You are connected to database "softwareheritage-scheduler" as user "postgres" via socket in "/var/run/postgresql" at port "5434".
softwareheritage-scheduler=# select * from dbversion order by version desc limit 1;
 version |            release            |   description    
---------+-------------------------------+------------------
      30 | 2021-08-12 06:25:11.755147+00 | Work In Progress
(1 row)
  • deployment on saatchi
    • Stop the schedulers running on the tmux session
    • stop puppet and the impacted services
root@saatchi:~# puppet agent --disable "T3773"
root@saatchi:~# systemctl stop swh-scheduler-journal-client
root@saatchi:~# systemctl stop swh-scheduler-schedule-recurrent
<finished in timeout but effectively stopped>
  • check the upgradable packages, there is lot of package to upgrade
root@saatchi:~# apt list --upgradable 2>/dev/null | grep python3-swh
python3-swh.core/unknown 1.0.0-2~swh1~bpo10+1 all [upgradable from: 0.15.0-1~swh1~bpo10+1]
python3-swh.counters/unknown 0.9.0-1~swh1~bpo10+1 all [upgradable from: 0.8.0-1~swh1~bpo10+1]
python3-swh.deposit.client/unknown 0.16.0-1~swh1~bpo10+1 all [upgradable from: 0.14.3-1~swh1~bpo10+1]
python3-swh.deposit.loader/unknown 0.16.0-1~swh1~bpo10+1 all [upgradable from: 0.14.3-1~swh1~bpo10+1]
python3-swh.journal/unknown 0.9.1-1~swh1~bpo10+1 all [upgradable from: 0.8.0-1~swh1~bpo10+1]
python3-swh.lister/unknown 2.6.1-1~swh1~bpo10+1 all [upgradable from: 2.5.0-1~swh1~bpo10+1]
python3-swh.loader.core/unknown 2.0.0-1~swh1~bpo10+1 all [upgradable from: 1.2.0-1~swh1~bpo10+1]
python3-swh.loader.git/unknown 1.2.0-1~swh1~bpo10+1 all [upgradable from: 1.0.1-1~swh1~bpo10+1]
python3-swh.loader.mercurial/unknown 3.0.0-1~swh1~bpo10+1 all [upgradable from: 2.3.1-1~swh1~bpo10+1]
python3-swh.loader.svn/unknown 0.10.2-1~swh1~bpo10+1 all [upgradable from: 0.7.1-1~swh1~bpo10+1]
python3-swh.model/unknown 3.1.0-1~swh1~bpo10+1 all [upgradable from: 3.0.0-1~swh1~bpo10+1]
python3-swh.objstorage/unknown 0.3.0-1~swh2~bpo10+1 all [upgradable from: 0.2.3-1~swh1~bpo10+1]
python3-swh.scheduler/unknown 0.22.0-1~swh1~bpo10+1 all [upgradable from: 0.20.0-1~swh1~bpo10+1]
python3-swh.storage/unknown 0.40.0-2~swh1~bpo10+1 all [upgradable from: 0.37.0-1~swh2~bpo10+1]
python3-swh.vault/unknown 1.3.0-1~swh1~bpo10+1 all [upgradable from: 1.2.0-1~swh1~bpo10+1]
  • upgrade the packages
root@saatchi:~# apt list --upgradable 2>/dev/null | grep python3-swh | cut -f1 -d"/" | xargs -t apt install
apt install python3-swh.core python3-swh.counters python3-swh.deposit.client python3-swh.deposit.loader python3-swh.journal python3-swh.lister python3-swh.loader.core python3-swh.loader.git python3-swh.loader.mercurial python3-swh.loader.svn python3-swh.model python3-swh.objstorage python3-swh.scheduler python3-swh.storage python3-swh.vault
  • restart the services
root@saatchi:~# systemctl restart swh-scheduler-runner gunicorn-swh-scheduler.service swh-scheduler-listener.service swh-scheduler-runner-priority.service
root@saatchi:~# systemctl status swh-scheduler-runner gunicorn-swh-scheduler.service swh-scheduler-listener.service swh-scheduler-runner-priority.service
<they are all running>
  • database migration to v31
softwareheritage-scheduler=# \timing
Timing is on.
softwareheritage-scheduler=# \i /tmp/31.sql 
INSERT 0 1
Time: 4.983 ms
ALTER TABLE
Time: 2346962.104 ms (39:06.962)
COMMENT
Time: 7.763 ms
ALTER TABLE
Time: 49.457 ms
  • restarting services and puppet
root@saatchi:~# systemctl start swh-scheduler-journal-client
root@saatchi:~# systemctl start swh-scheduler-schedule-recurrent

root@saatchi:~# systemctl status swh-scheduler-journal-client
* swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-12-08 13:14:44 UTC; 30s ago
   Main PID: 997137 (swh)
      Tasks: 8 (limit: 24037)
     Memory: 46.7M
     CGroup: /system.slice/swh-scheduler-journal-client.service
             `-997137 /usr/bin/python3 /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client

Dec 08 13:14:44 saatchi systemd[1]: Started Software Heritage Scheduler Journal Client.


root@saatchi:~# systemctl status swh-scheduler-schedule-recurrent
* swh-scheduler-schedule-recurrent.service - Software Heritage scheduler for recurrent visits
     Loaded: loaded (/etc/systemd/system/swh-scheduler-schedule-recurrent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-12-08 13:15:04 UTC; 21s ago
   Main PID: 997182 (swh)
      Tasks: 18 (limit: 24037)
     Memory: 71.9M
     CGroup: /system.slice/swh-scheduler-schedule-recurrent.service
             `-997182 /usr/bin/python3 /usr/bin/swh --log-level INFO scheduler --config-file /etc/softwareheritage/scheduler/listener-runner.yml schedule-recurren

Dec 08 13:15:04 saatchi systemd[1]: Started Software Heritage scheduler for recurrent visits.

root@saatchi:~# puppet agent --enable
  • schedulers running on a tmux restarted
vsellier moved this task from Backlog to done on the System administration board.