Page MenuHomeSoftware Heritage

Deploy swh-indexer > v2.6 on staging then production
Closed, MigratedEdits Locked

Description

Staging:

  • Upgrade package on workers
  • Restart workers
  • Reset journal client on the swh.journal.objects.raw_extrinsic_metadata topic (for the new SWORD metadata mapping) [1]
  • Reset journal client on the swh.journal.objects.origin_visit_status topic (for the new Nuget metadata mapping by @VickyMerzOwn) [2]
  • Wait ~10 minutes to make sure they don't crash because of the refactorings

Production:

  • Upgrade package on workers
  • Restart workers
  • so previous puppet changes can be applied
  • Reset journal client on the swh.journal.objects.raw_extrinsic_metadata topic (for the new SWORD metadata mapping)
  • Reset journal client on the swh.journal.objects.origin_visit_status topic (for the new Nuget metadata mapping by @VickyMerzOwn)

No change to swh-indexer-storage since v2.3.0.

[1]

root@storage1:~# GROUP_ID=swh.indexer.journal_client.extrinsic_metadata
root@storage1:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-earliest --group $GROUP_ID --execute
...

[2]

root@storage1:~# GROUP_ID=swh.indexer.journal_client.origin_intrinsic_metadata
root@storage1:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-earliest --group $GROUP_ID --execute
...

Event Timeline

vlorentz created this task.
ardumont changed the task status from Open to Work in Progress.Aug 29 2022, 4:35 PM
ardumont updated the task description. (Show Details)
ardumont moved this task from Backlog to in-progress on the System administration board.

Consumer lag is steadily increasing since yesterday [1]. I believe workers are hit by [2] issue.
I've opened [3] to try and unstuck it.

[1] https://grafana.softwareheritage.org/goto/C9lpYwZVz?orgId=1

[2] https://sentry.softwareheritage.org/share/issue/1d3de3b47c234408889bff5c4f4b0d20/

[3] D8340

ardumont raised the priority of this task from Low to Normal.Aug 30 2022, 11:36 AM

Workers refuse to upgrade to the actual 2.4.3 version [1]. I did not realize my previous
upgrade from yesterday stopped at the v2.3.0.

It seems related to the new dependency version constraint on rdflib introduced recently
[2]. That version is not available on indexer workers [3].

[1]

root@indexer-worker02:~# apt-get upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  python3-swh.indexer python3-swh.indexer.storage
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.

[2] rDCIDXb0e448e963bb57f49f7f326f557520886010f7dc

[3]

$ rmadison python3-rdflib
python3-rdflib | 4.1.2-3       | oldoldoldstable | amd64, armel, armhf, i386
python3-rdflib | 4.2.1-2       | oldoldstable    | all
python3-rdflib | 4.2.2-2       | oldstable       | all
python3-rdflib | 5.0.0-1.1     | stable          | all
python3-rdflib | 6.1.1-1       | unstable        | all

Drop the debian constraint on python3-rdflib. Trigger a rebuild and upgraded the package
again. Added the unconditional dependency on python3-rdflib-jsonld dependency (which on
latest debian release is not useful but without being a blocker).

After that the journal client restarted.

@vlorentz fixed another error directly within the model to deal with old versioned objects.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

@vlorentz fixed another error directly within the model to deal with old versioned objects out of the model.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

objstorage build unstuck [1]
Triggered back the build for indexer.

[1] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DOBJS/job/gbp-buildpackage/

@vlorentz fixed another error directly within the model to deal with old versioned objects out of the model.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

objstorage build unstuck [1]
Triggered back the build for indexer.

[1] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DOBJS/job/gbp-buildpackage/

Build ok for indexer as well.
https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DCIDX/job/gbp-buildpackage/

ardumont renamed this task from Deploy swh-indexer v2.4.2 on production and staging to Deploy swh-indexer v2.4 on production and staging.Aug 31 2022, 4:39 PM
ardumont renamed this task from Deploy swh-indexer v2.4 on production and staging to Deploy swh-indexer > v2.5 on production and staging.Sep 8 2022, 11:16 AM
ardumont moved this task from in-progress to Weekly backlog on the System administration board.
ardumont renamed this task from Deploy swh-indexer > v2.5 on production and staging to Deploy swh-indexer > v2.6 on staging then production.Sep 12 2022, 5:33 PM
ardumont updated the task description. (Show Details)

There's a few issues with the configuration of these indexer clients:

  • the traffic should not be going through the IPSec VPN. They need to use the public, authenticated kafka endpoints. The IPSec load is making all azure communication struggle.
  • It seems that there are some old services on the azure hosts that have not been disabled and are consistently restarting with a missing configuration file.
  • There is also a bunch of services that are trying to schedule tasks on the scheduler backend (and failing, because that's firewalled).

@vsellier has stopped everything to avoid getting spammed by traffic issues all night, until someone can properly investigate.

All the indexers were stopped at 20:00 FR because something was consummng all the bandwidth of the VPN between azure and the our infra.

root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "puppet agent --disable 'stop indexer to avoid bandwith consumption'"
root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client@*"

I'm guessing that's the extrinsic metadata indexer; others need to do plenty of random access to the storage, but that one consumes very quickly from Kafka. On the bright side, it consumes the entire topic within hours so parallelism could be reduced, as a quick fix

There's a few issues with the configuration of these indexer clients:

the traffic should not be going through the IPSec VPN. They need to use the public, authenticated kafka endpoints. The IPSec load is making all azure communication struggle.

ack, that should be "simple" enough to adapt [1]

[1] https://docs.softwareheritage.org/sysadm/mirror-operations/onboard.html?highlight=credential#how-to-create-the-journal-credentials

It seems that there are some old services on the azure hosts that have not been disabled and are consistently restarting with a missing configuration file.

That's surprising as those are new nodes...
One thing i can think of would be to use the wrong clush command and starting for all indexer nodes some services (even those not installed...).

There is also a bunch of services that are trying to schedule tasks on the scheduler backend (and failing, because that's firewalled).

That must be a side effect of the previous points as the "new" indexer journal client services no longer do that.


In any case, thanks for the heads up, i'll investigate and clean up when i'll have a go ahead from @vsellier.

There's a few issues with the configuration of these indexer clients:
the traffic should not be going through the IPSec VPN. They need to use the public,
authenticated kafka endpoints. The IPSec load is making all azure communication
struggle.

ack, that should be "simple" enough to adapt [1]
[1] https://docs.softwareheritage.org/sysadm/mirror-operations/onboard.html?highlight=credential#how-to-create-the-journal-credentials

as usual, not so simple but here is the diff [2] to update our puppet manifest to allow
such journal configuration.

[2] D8492

ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.