Reduce azure cost: change workers to 'b2ms' vms (current 'ds2v2' underused and costly)
Plan:
- Reasoning: https://hedgedoc.softwareheritage.org/0_eK1R3iSFmMWxwHDQfqOw?edit
- Provision vault-worker[01-02] as b2ms (terraform)
- Decomission worker13
- Check vault worker are doing their job [1]
- Decomission worker[11-12]
- Adapt puppet manifest to the fqdn changes ^ and deploy
- Provision indexer-worker[01-02] as b2ms (terraform)
- Check everything is fine ^ (firewall rule to edit to allow connection)
- Decomission ds2v2 worker[07-10]
- Provision indexer-worker[03-06] as b2ms (terraform)
- Decomission remaining ds2v2 worker[03-06]
- Update firewall rule + alias
- Update inventory with vms and network interfaces according to ^
-
Kept worker[01-02] for now (so they finish their current job consuming old queue messages)[2] - Clean up old oneshot tasks related to ^ [4]
Note:
- This talks about worker*.euwest.azure nodes
- Decomission is deleting the node, then remove references to it within puppet master, then update inventory
[1]
Jul 18 14:45:59 vault-worker01 python3[2648]: [2022-07-18 14:45:59,239: INFO/MainProcess] vault_cooker@vault-worker01.euwest.azure.internal.softwareheritage.org ready. Jul 18 14:58:49 vault-worker01 python3[2648]: [2022-07-18 14:58:49,852: INFO/MainProcess] Received task: swh.vault.cooking_tasks.SWHCookingTask[a3c95ae7-4256-4231-bca7-d3224a9149ce] Jul 18 14:58:54 vault-worker01 python3[2670]: [2022-07-18 14:58:54,821: INFO/ForkPoolWorker-16] Task swh.vault.cooking_tasks.SWHCookingTask[a3c95ae7-4256-4231-bca7-d3224a9149ce] succeeded in 4.852631129999963s: None Jul 18 15:01:58 vault-worker02 python3[617]: [2022-07-18 15:01:58,023: INFO/MainProcess] Connected to amqp://swhconsumer:**@rabbitmq:5672// Jul 18 15:01:58 vault-worker02 python3[617]: [2022-07-18 15:01:58,293: INFO/MainProcess] vault_cooker@vault-worker02.euwest.azure.internal.softwareheritage.org ready. Jul 18 15:02:59 vault-worker02 python3[617]: [2022-07-18 15:02:59,734: INFO/MainProcess] Received task: swh.vault.cooking_tasks.SWHCookingTask[e3649dcc-9d53-4d88-8245-2543e97d584a] Jul 18 15:03:19 vault-worker02 python3[997]: [2022-07-18 15:03:19,915: INFO/ForkPoolWorker-16] Task swh.vault.cooking_tasks.SWHCookingTask[e3649dcc-9d53-4d88-8245-2543e97d584a] succeeded in 20.0749026s: None
[2] Too much lag that will take some time to subside with only 2 vms. Instead, as the
new vms will work on the resetted topics and will pass on the missing data [3], we can
just scratch those now in the end.
[3] T4282#88364
[4]
11:50:47 softwareheritage-scheduler@belvedere:5432=> select now(), status, count(*) from task where type = 'index-origin-metadata' group by status; +-------------------------------+------------------------+---------+ | now | status | count | +-------------------------------+------------------------+---------+ | 2022-07-19 09:50:55.403248+00 | next_run_not_scheduled | 9802941 | | 2022-07-19 09:50:55.403248+00 | next_run_scheduled | 5263 | | 2022-07-19 09:50:55.403248+00 | completed | 3225591 | | 2022-07-19 09:50:55.403248+00 | disabled | 5736 | +-------------------------------+------------------------+---------+ (4 rows) Time: 27451.213 ms (00:27.451) softwareheritage-scheduler=# update task set status='disabled' where type = 'index-origin-metadata' and status in ('next_run_scheduled', 'next_run_not_scheduled'); UPDATE 9808204 12:28:16 softwareheritage-scheduler@belvedere:5432=> select now(), status, count(*) from task where type = 'index-origin-metadata' group by status; +-------------------------------+-----------+---------+ | now | status | count | +-------------------------------+-----------+---------+ | 2022-07-19 10:28:26.489037+00 | completed | 3225591 | | 2022-07-19 10:28:26.489037+00 | disabled | 9813940 | +-------------------------------+-----------+---------+ (2 rows) Time: 32793.481 ms (00:32.793)
(ongoing ^)