⚓ T4459 Deploy swh-indexer > v2.6 on staging then production

Status	Assigned	Task
Migrated	gitlab-migration	T3097 Expose metadata in the WebApp and make it searchable
Migrated	gitlab-migration	T2064 Add metadata from deposits to metadata search
Migrated	gitlab-migration	T4401 Index metadata from the deposit
Migrated	gitlab-migration	T4392 Metadata Indexer for NuGet (.nuspec)
Migrated	gitlab-migration	T4459 Deploy swh-indexer > v2.6 on staging then production
Migrated	gitlab-migration	T4429 Deploy swh-indexer v2.3.0 on production and staging
Migrated	gitlab-migration	T4477 staging origin intrinsic metadata indexer are stuck

vlorentz triaged this task as Low priority.Aug 25 2022, 2:54 PM

vlorentz created this task.

vlorentz added a subtask: T4429: Deploy swh-indexer v2.3.0 on production and staging.Aug 25 2022, 2:57 PM

vlorentz updated the task description. (Show Details)

vlorentz added a parent task: T4401: Index metadata from the deposit.Aug 25 2022, 3:00 PM

vlorentz added a parent task: T4392: Metadata Indexer for NuGet (.nuspec).

ardumont changed the task status from Open to Work in Progress.Aug 29 2022, 4:35 PM

ardumont updated the task description. (Show Details)

ardumont moved this task from Backlog to in-progress on the System administration board.

ardumont updated the task description. (Show Details)Aug 29 2022, 5:27 PM

ardumont closed subtask T4429: Deploy swh-indexer v2.3.0 on production and staging as Resolved.Aug 30 2022, 10:01 AM

ardumont mentioned this in T4429: Deploy swh-indexer v2.3.0 on production and staging.

Consumer lag is steadily increasing since yesterday [1]. I believe workers are hit by [2] issue.
I've opened [3] to try and unstuck it.

[1] https://grafana.softwareheritage.org/goto/C9lpYwZVz?orgId=1

[2] https://sentry.softwareheritage.org/share/issue/1d3de3b47c234408889bff5c4f4b0d20/

[3] D8340

ardumont added a revision: D8340: metadata: Drop unsupported key 'type'.Aug 30 2022, 10:14 AM

ardumont added a commit: rDCIDX85b675fd19e8: metadata: Drop unsupported key 'type'.Aug 30 2022, 11:09 AM

ardumont raised the priority of this task from Low to Normal.Aug 30 2022, 11:36 AM

Workers refuse to upgrade to the actual 2.4.3 version [1]. I did not realize my previous
upgrade from yesterday stopped at the v2.3.0.

It seems related to the new dependency version constraint on rdflib introduced recently
[2]. That version is not available on indexer workers [3].

[1]

root@indexer-worker02:~# apt-get upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  python3-swh.indexer python3-swh.indexer.storage
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.

[2] rDCIDXb0e448e963bb57f49f7f326f557520886010f7dc

[3]

$ rmadison python3-rdflib
python3-rdflib | 4.1.2-3       | oldoldoldstable | amd64, armel, armhf, i386
python3-rdflib | 4.2.1-2       | oldoldstable    | all
python3-rdflib | 4.2.2-2       | oldstable       | all
python3-rdflib | 5.0.0-1.1     | stable          | all
python3-rdflib | 6.1.1-1       | unstable        | all

ardumont mentioned this in rSPSITE51b4e096efa4: Declare dependency constraint on python3-rdflib-jsonld for indexers.Aug 30 2022, 4:02 PM

Drop the debian constraint on python3-rdflib. Trigger a rebuild and upgraded the package
again. Added the unconditional dependency on python3-rdflib-jsonld dependency (which on
latest debian release is not useful but without being a blocker).

After that the journal client restarted.

@vlorentz fixed another error directly within the model to deal with old versioned objects.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

@vlorentz fixed another error directly within the model to deal with old versioned objects out of the model.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

objstorage build unstuck [1]
Triggered back the build for indexer.

[1] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DOBJS/job/gbp-buildpackage/

vlorentz added revisions: D8351: Revert "metadata: Drop unsupported key 'type'", D8350: Add support for old dicts in RawExtrinsicMetadata.from_dict.Aug 31 2022, 3:00 PM

vlorentz removed a commit: rDCIDX85b675fd19e8: metadata: Drop unsupported key 'type'.

vlorentz removed a subscriber: vlorentz.

vlorentz added a subscriber: vlorentz.

@vlorentz fixed another error directly within the model to deal with old versioned objects out of the model.
This meant a new release for swh.model and swh.indexer.

Unfortunately, now the indexer debian build is broken due to the objstorage debian build being broken...

objstorage build unstuck [1]
Triggered back the build for indexer.

[1] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DOBJS/job/gbp-buildpackage/

Build ok for indexer as well.
https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DCIDX/job/gbp-buildpackage/

ardumont renamed this task from Deploy swh-indexer v2.4.2 on production and staging to Deploy swh-indexer v2.4 on production and staging.Aug 31 2022, 4:39 PM

ardumont changed the status of subtask T4477: staging origin intrinsic metadata indexer are stuck from Open to Work in Progress.Aug 31 2022, 6:49 PM

ardumont closed subtask T4477: staging origin intrinsic metadata indexer are stuck as Resolved.

vlorentz added revisions: D8372: base: Filter out empty URIs so PyLD does not crash, D8373: Filter out more invalid URIs that make PyLD crash.Aug 31 2022, 9:14 PM

ardumont updated the task description. (Show Details)Sep 1 2022, 11:50 AM

vlorentz added a commit: rDCIDX2ebd7ee876cc: Filter out more invalid URIs that make PyLD crash.Sep 1 2022, 12:48 PM

vlorentz added a commit: rDCIDX25e709c84a64: base: Filter out empty URIs so PyLD does not crash.

ardumont renamed this task from Deploy swh-indexer v2.4 on production and staging to Deploy swh-indexer > v2.5 on production and staging.Sep 8 2022, 11:16 AM

ardumont moved this task from in-progress to Weekly backlog on the System administration board.

ardumont renamed this task from Deploy swh-indexer > v2.5 on production and staging to Deploy swh-indexer > v2.6 on staging then production.Sep 12 2022, 5:33 PM

ardumont updated the task description. (Show Details)Sep 12 2022, 5:47 PM

ardumont updated the task description. (Show Details)

ardumont updated the task description. (Show Details)Sep 12 2022, 6:01 PM

ardumont moved this task from Weekly backlog to in-progress on the System administration board.

ardumont moved this task from in-progress to deployed/landed/monitoring on the System administration board.

There's a few issues with the configuration of these indexer clients:

the traffic should not be going through the IPSec VPN. They need to use the public, authenticated kafka endpoints. The IPSec load is making all azure communication struggle.
It seems that there are some old services on the azure hosts that have not been disabled and are consistently restarting with a missing configuration file.
There is also a bunch of services that are trying to schedule tasks on the scheduler backend (and failing, because that's firewalled).

@vsellier has stopped everything to avoid getting spammed by traffic issues all night, until someone can properly investigate.

All the indexers were stopped at 20:00 FR because something was consummng all the bandwidth of the VPN between azure and the our infra.

root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "puppet agent --disable 'stop indexer to avoid bandwith consumption'"
root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client@*"

I'm guessing that's the extrinsic metadata indexer; others need to do plenty of random access to the storage, but that one consumes very quickly from Kafka. On the bright side, it consumes the entire topic within hours so parallelism could be reduced, as a quick fix

There's a few issues with the configuration of these indexer clients:

the traffic should not be going through the IPSec VPN. They need to use the public, authenticated kafka endpoints. The IPSec load is making all azure communication struggle.

ack, that should be "simple" enough to adapt [1]

[1] https://docs.softwareheritage.org/sysadm/mirror-operations/onboard.html?highlight=credential#how-to-create-the-journal-credentials

It seems that there are some old services on the azure hosts that have not been disabled and are consistently restarting with a missing configuration file.

That's surprising as those are new nodes...
One thing i can think of would be to use the wrong clush command and starting for all indexer nodes some services (even those not installed...).

There is also a bunch of services that are trying to schedule tasks on the scheduler backend (and failing, because that's firewalled).

That must be a side effect of the previous points as the "new" indexer journal client services no longer do that.

In any case, thanks for the heads up, i'll investigate and clean up when i'll have a go ahead from @vsellier.

ardumont moved this task from deployed/landed/monitoring to code-review/await-feedback/pause on the System administration board.Sep 13 2022, 3:42 PM

ardumont mentioned this in rSPPRIVCcd358c363d8d: Sync private-data and private-data-censored.Sep 15 2022, 5:45 PM

ardumont added a revision: D8492: indexer: Allow journal client authentication configuration.Sep 15 2022, 5:49 PM

There's a few issues with the configuration of these indexer clients:
the traffic should not be going through the IPSec VPN. They need to use the public,
authenticated kafka endpoints. The IPSec load is making all azure communication
struggle.

ack, that should be "simple" enough to adapt [1]
[1] https://docs.softwareheritage.org/sysadm/mirror-operations/onboard.html?highlight=credential#how-to-create-the-journal-credentials

as usual, not so simple but here is the diff [2] to update our puppet manifest to allow
such journal configuration.

[2] D8492

ardumont added a revision: D8493: indexer: Use public brokers in production, internal ones for staging.Sep 15 2022, 6:28 PM

ardumont added a commit: rSPSITE009c924339c1: indexer: Allow journal client authentication configuration.Sep 16 2022, 4:47 PM

ardumont added a commit: rSPSITE74264b68145b: indexer: Use public brokers in production, internal ones for staging.

ardumont mentioned this in rSPSITE24f2623d46fb: Adapt configuration for indexer workers running on azure.Sep 16 2022, 5:02 PM

ardumont mentioned this in rSPSITEca0e247c1eb1: indexer: Simplify journal client configuration.Sep 16 2022, 5:38 PM

ardumont moved this task from code-review/await-feedback/pause to deployed/landed/monitoring on the System administration board.Sep 16 2022, 6:09 PM

ardumont closed this task as Resolved.Sep 28 2022, 7:22 PM

ardumont claimed this task.

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T4477: staging origin intrinsic metadata indexer are stuck from Resolved to Migrated.Oct 19 2022, 6:08 PM

rDCIDX Metadata indexer
	Closed		D8351 Revert "metadata: Drop unsupported key 'type'"
	Closed		D8340 metadata: Drop unsupported key 'type'
		D8373	rDCIDX2ebd7ee876cc Filter out more invalid URIs that make PyLD crash
		D8372	rDCIDX25e709c84a64 base: Filter out empty URIs so PyLD does not crash
rDMOD Data model
	Closed		D8350 Add support for old dicts in RawExtrinsicMetadata.from_dict
rSPSITE puppet-swh-site
		D8493	rSPSITE74264b68145b indexer: Use public brokers in production, internal ones for staging
		D8492	rSPSITE009c924339c1 indexer: Allow journal client authentication configuration

Deploy swh-indexer > v2.6 on staging then production
Closed, MigratedEdits Locked
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

Deploy swh-indexer > v2.6 on staging then productionClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Deploy swh-indexer > v2.6 on staging then production
Closed, MigratedEdits Locked
Actions

Related Objects
Search...