Page MenuHomeSoftware Heritage

Deploy new origin intrinsic metadata journal client indexer > v1.1
Closed, ResolvedPublic

Description

Plan, out of [1] diff:

  • scheduler0.staging, workers.staging: Stop puppet [2]
  • scheduler0.staging: stop the old journal client [3]
  • workers.staging: Wait for all tasks to finish
  • Stop swh-worker@indexer_origin_intrinsic_metadata [4]
  • D7928: Rework puppet manifest to drop old services + update indexer service (as journal client)
  • scheduler-nodes: Clean up old service (previous diff ^ does it)
  • Remove celery workers and queues (click-click on rabbitmq ui from scheduler0.staging [5])
  • pergamon: Deploy diff [6]
  • scheduler0.staging: Apply puppet change (drop old journal client service)
  • workers.staging: Upgrade python3-swh.indexer to v1.1.0
  • P1370: Issue with that version [7] ^
  • w/ vlorentz: Package new python3-swh.indexer to v1.2.0
  • workers.staging: Upgrade python3-swh.indexer to v1.2.0
  • Unstuck next problem... (the configuration is now off) [9]
  • workers.staging: Apply puppet change (drop old service, deploy new journal client service) [10]
  • T4282#86233: Backing down: It's not ready so reverting the current deployment
  • Blocked by T4274
  • D7951: Actual deployment when it's ready
  • Follow journal consumption (from current offsets) [12]
  • T4282#88364: Reindex everything from scratch (reset offsets [11]
  • Follow journal consumption [12]

[1] D7899

[2]

root@pergamon:~# clush -b -w scheduler0.internal.staging.swh.network -w @staging-workers 'puppet agent --disable "T4282: Migrate to origin intrinsic meta indexer as journal client"'

[3]

root@pergamon:~# clush -b -w scheduler0.internal.staging.swh.network systemctl stop swh-indexer-journal-client.service

[4]

root@pergamon:~# clush -b -w @staging-workers systemctl stop swh-worker@indexer_origin_intrinsic_metadata.service

[5] http://scheduler0.internal.staging.swh.network:15672/#/queues

[6]

root@pergamon:~# /usr/local/bin/deploy.sh
HEAD is now at eff3f30 Add snyk-stg-01 credentials
Already up to date.
HEAD is now at eff3f30 Add snyk-stg-01 credentials
Already up to date.

[7]

root@scheduler0:~# puppet agent --enable; puppet agent --test
...  # it passed nonetheless beyond me and applied the stuff ¯\_(ツ)_/¯
root@scheduler0:~# systemctl list-units | grep swh-indexer-journal-client.service
root@scheduler0:~# # no longer present ^ vs saatchi

[8] prod (untouched for now)

root@saatchi:~# systemctl list-units | grep swh-indexer-journal-client.service
  swh-indexer-journal-client.service                                                          loaded active running   Software Heritage Indexer Journal Client

[9]

swhworker@worker0:~$ /usr/bin/swh indexer --config-file $SWH_CONFIG_FILENAME journal-client '*'
Traceback (most recent call last):
  File "/usr/bin/swh", line 33, in <module>
    sys.exit(load_entry_point('swh.core==2.8.0', 'console_scripts', 'swh')())
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 184, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/indexer/cli.py", line 310, in journal_client
    idx = OriginMetadataIndexer()
  File "/usr/lib/python3/dist-packages/swh/indexer/metadata.py", line 325, in __init__
    self.revision_metadata_indexer = RevisionMetadataIndexer(config=config)
  File "/usr/lib/python3/dist-packages/swh/indexer/metadata.py", line 163, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 167, in __init__
    self.check()
  File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 202, in check
    raise ValueError("Tools %s is unknown, cannot continue" % self.tools)
ValueError: Tools [] is unknown, cannot continue
swhworker@worker0:~$ /usr/bin/swh indexer --config-file $SWH_CONFIG_FILENAME journal-client '*'

[10]

root@pergamon:~# clush -b -w @staging-workers "systemctl status swh-indexer-journal-client" | grep -c running
4

[11] It can reuse either the group_id to avoid re-indexing, or use a new one to reindex
everything (to solve old previous temporary failures we ever had). Or we can reset the
topic to reindex everything.

[12] https://grafana.softwareheritage.org/goto/P4UllFR4z?orgId=1

Related Objects

Event Timeline

ardumont triaged this task as Normal priority.May 30 2022, 1:16 PM
ardumont created this task.
ardumont renamed this task from staging: Deploy new origin intrinsic metadata journal client indexer to staging: Deploy new origin intrinsic metadata journal client indexer v1.1.May 31 2022, 3:27 PM

Should be ready to be deployed now.

ardumont changed the task status from Open to Work in Progress.Jun 1 2022, 11:56 AM
ardumont moved this task from Backlog to in-progress on the System administration board.
ardumont renamed this task from staging: Deploy new origin intrinsic metadata journal client indexer v1.1 to staging: Deploy new origin intrinsic metadata journal client indexer > v1.1.Jun 1 2022, 5:09 PM

This needs to be reverted in waiting for [1] to be resolved.
I'll attend to it tomorrow.

[1] T4274 (added as a "blocking" subtask)

Reverting:

  • Stopping and disabling journal client services [1]
  • D7950: Revert puppet manifest changes
  • scheduler0.staging: deploy manifest changes [2]
  • workers.staging: Deploy manifest changes [3]
  • check everything is back to normal [4]

[1]

root@pergamon:~# clush -b -w @staging-workers 'puppet agent --disable "T4282: revert deployment"; systemctl stop cron; systemctl stop swh-indexer-journal-client.service; systemctl disable swh-indexer-journal-client.service'
worker0.internal.staging.swh.network: Removed /etc/systemd/system/multi-user.target.wants/swh-indexer-journal-client.service.
worker3.internal.staging.swh.network: Removed /etc/systemd/system/multi-user.target.wants/swh-indexer-journal-client.service.
worker2.internal.staging.swh.network: Removed /etc/systemd/system/multi-user.target.wants/swh-indexer-journal-client.service.
worker1.internal.staging.swh.network: Removed /etc/systemd/system/multi-user.target.wants/swh-indexer-journal-client.service.

[2]

root@scheduler0:~# puppet agent --test
Info: Using configured environment 'staging'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for scheduler0.internal.staging.swh.network                                                      Info: Applying configuration version '1654184968'
Notice: /Stage[main]/Profile::Swh::Deploy::Indexer_journal_client/Service[swh-indexer-journal-client]/ensure: ensure changed 'stopped' to 'running'                                                                                            Info: /Stage[main]/Profile::Swh::Deploy::Indexer_journal_client/Service[swh-indexer-journal-client]: Unscheduling refresh on Service[swh-indexer-journal-client]
Notice: Applied catalog in 23.75 seconds

[3]

root@worker0:~# puppet agent --enable; puppet agent --test
Info: Using configured environment 'staging'
Info: Retrieving pluginfacts                                                                                                                                                                                                                   Info: Retrieving plugin
Info: Retrieving locales                                                                                                                                                                                                                       Info: Loading facts
Info: Caching catalog for worker0.internal.staging.swh.network
Info: Applying configuration version '1654184981'
Notice: /Stage[main]/Profile::Swh::Deploy::Worker::Indexer_origin_intrinsic_metadata/Profile::Swh::Deploy::Worker::Instance[indexer_origin_intrinsic_metadata]/File[/etc/softwareheritage/indexer_origin_intrinsic_metadata.yml]/ensure: defined content as '{md5}23d53fad94eb956a49b1c1ca282fbed2'
Notice: /Stage[main]/Profile::Swh::Deploy::Worker::Indexer_origin_intrinsic_metadata/Profile::Swh::Deploy::Worker::Instance[indexer_origin_intrinsic_metadata]/Service[swh-worker@indexer_origin_intrinsic_metadata]/enable: enable changed 'false' to 'true'
Notice: /Stage[main]/Profile::Swh::Deploy::Worker::Indexer_origin_intrinsic_metadata/Profile::Swh::Deploy::Worker::Instance[indexer_origin_intrinsic_metadata]/Systemd::Dropin_file[swh-worker@indexer_origin_intrinsic_metadata/parameters.conf]/File[/etc/systemd/system/swh-worker@indexer_origin_intrinsic_metadata.service.d/parameters.conf]/ensure: defined content as '{md5}ed74b358a7bbe3680dcc00ebbcaf6857'
Info: /Stage[main]/Profile::Swh::Deploy::Worker::Indexer_origin_intrinsic_metadata/Profile::Swh::Deploy::Worker::Instance[indexer_origin_intrinsic_metadata]/Systemd::Dropin_file[swh-worker@indexer_origin_intrinsic_metadata/parameters.conf]/File[/etc/systemd/system/swh-worker@indexer_origin_intrinsic_metadata.service.d/parameters.conf]: Scheduling refresh of Class[Systemd::Systemctl::Daemon_reload]
Info: Class[Systemd::Systemctl::Daemon_reload]: Scheduling refresh of Exec[systemctl-daemon-reload]
Notice: /Stage[main]/Systemd::Systemctl::Daemon_reload/Exec[systemctl-daemon-reload]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Profile::Swh::Deploy::Worker::Base/Profile::Cron::D[cleanup-workers-tmp]/Profile::Cron::File[swh-worker]/File[/etc/puppet-cron.d/swh-worker]/content:
--- /etc/puppet-cron.d/swh-worker       2022-06-01 14:21:42.041809432 +0000
+++ /tmp/puppet-file20220602-4140753-17ba10v    2022-06-02 15:50:08.093603620 +0000
@@ -8,6 +8,8 @@
 11-56/15 * * * * root chronic /usr/local/sbin/swh-worker-ping-restart indexer_content_mimetype@worker0.internal.staging.swh.network indexer_content_mimetype
 # Cron snippet swh-worker-indexer_fossology_license-autorestart
 9-54/15 * * * * root chronic /usr/local/sbin/swh-worker-ping-restart indexer_fossology_license@worker0.internal.staging.swh.network indexer_fossology_license
+# Cron snippet swh-worker-indexer_origin_intrinsic_metadata-autorestart
+13-58/15 * * * * root chronic /usr/local/sbin/swh-worker-ping-restart indexer_origin_intrinsic_metadata@worker0.internal.staging.swh.network indexer_origin_intrinsic_metadata
 # Cron snippet swh-worker-lister-autorestart
 4-49/15 * * * * root chronic /usr/local/sbin/swh-worker-ping-restart lister@worker0.internal.staging.swh.network lister
 # Cron snippet swh-worker-loader_archive-autorestart

Info: Computing checksum on file /etc/puppet-cron.d/swh-worker
Info: /Stage[main]/Profile::Swh::Deploy::Worker::Base/Profile::Cron::D[cleanup-workers-tmp]/Profile::Cron::File[swh-worker]/File[/etc/puppet-cron.d/swh-worker]: Filebucketed /etc/puppet-cron.d/swh-worker to puppet with sum 7a469799b5d9998c3994744df37d0a18
Notice: /Stage[main]/Profile::Swh::Deploy::Worker::Base/Profile::Cron::D[cleanup-workers-tmp]/Profile::Cron::File[swh-worker]/File[/etc/puppet-cron.d/swh-worker]/content: content changed '{md5}7a469799b5d9998c3994744df37d0a18' to '{md5}b2592f2c9df13676c63c223c14862287'
Notice: Applied catalog in 16.13 seconds

[4]

root@pergamon:~# clush -b -w @staging-workers 'systemctl status swh-worker@indexer_origin_intrinsic_metadata | grep "running\|succeeded"'
---------------
worker0.internal.staging.swh.network
---------------
     Active: active (running) since Thu 2022-06-02 15:51:59 UTC; 6min ago
Jun 02 15:56:40 worker0 python3[4141528]: [2022-06-02 15:56:40,635: INFO/ForkPoolWorker-4] Task swh.indexer.tasks.OriginMetadata[f1c1ad8b-403c-4638-a91e-ec2f6e38953e] succeeded in 5.756153889931738s: {'status': 'uneventful'}
Jun 02 15:57:04 worker0 python3[4141528]: [2022-06-02 15:57:04,078: INFO/ForkPoolWorker-4] Task swh.indexer.tasks.OriginMetadata[00657e08-660f-4eea-8bc0-9b13534a237d] succeeded in 17.980035815853626s: {'status': 'uneventful'}
Jun 02 15:57:14 worker0 python3[4141528]: [2022-06-02 15:57:14,163: INFO/ForkPoolWorker-4] Task swh.indexer.tasks.OriginMetadata[891afdd2-bf69-48c6-97fb-82300a4712e5] succeeded in 6.308635601773858s: {'status': 'uneventful'}
Jun 02 15:57:28 worker0 python3[4141928]: [2022-06-02 15:57:28,981: INFO/ForkPoolWorker-5] Task swh.indexer.tasks.OriginMetadata[96d89845-0c37-4949-8b3d-e0ff781d6373] succeeded in 9.877047970890999s: {'status': 'uneventful'}
Jun 02 15:57:50 worker0 python3[4141928]: [2022-06-02 15:57:50,370: INFO/ForkPoolWorker-5] Task swh.indexer.tasks.OriginMetadata[24b5850c-5d17-4949-ad7a-e239ef68166a] succeeded in 7.8305811220780015s: {'status': 'uneventful'}
---------------
worker1.internal.staging.swh.network
---------------
     Active: active (running) since Thu 2022-06-02 15:57:19 UTC; 52s ago
---------------
worker2.internal.staging.swh.network
---------------
     Active: active (running) since Thu 2022-06-02 15:57:18 UTC; 53s ago
---------------
worker3.internal.staging.swh.network
---------------
     Active: active (running) since Thu 2022-06-02 15:57:16 UTC; 55s ago
Jun 02 15:57:37 worker3 python3[3697645]: [2022-06-02 15:57:37,149: INFO/ForkPoolWorker-1] Task swh.indexer.tasks.OriginMetadata[8a3e51a3-905a-4d70-91b8-53c861c1db74] succeeded in 6.815938267856836s: {'status': 'uneventful'}
ardumont changed the task status from Work in Progress to Open.Jun 2 2022, 6:00 PM
ardumont changed the task status from Open to Work in Progress.Jul 18 2022, 6:23 PM
ardumont changed the status of subtask T4395: Migrate azure worker vms to cheaper and more efficient vms from Open to Work in Progress.
ardumont moved this task from Weekly backlog to in-progress on the System administration board.
ardumont renamed this task from staging: Deploy new origin intrinsic metadata journal client indexer > v1.1 to Deploy new origin intrinsic metadata journal client indexer > v1.1.Jul 18 2022, 6:35 PM
ardumont updated the task description. (Show Details)

Reindex

  • Stop journal client [1] (export of offset needs the topics to be inactive [1'])
  • Keep current offset dump just in case [2]
  • Reset topics to earliest
  • Restart journal client

[1]

root@pergamon:~# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client"

[1']

Error: Assignments can only be reset if the group 'swh.indexer.journal_client' is inactive, but the current state is Stable.

[2]

root@kafka1:~# /opt/kafka/bin/kafka-consumer-groups.sh   --bootstrap-server $SERVER   --reset-offsets   --all-topics   --to-current   --dry-run   --export   --group $GROUP_ID 2>&1 > indexer-journal-client-offsets-$(date +%Y%m%d-%H%M).csv

Follow journal consumption [12]

Current ETA is 1h according to the [12] dashboard link.

ardumont claimed this task.
ardumont updated the task description. (Show Details)
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.