Page MenuHomeSoftware Heritage

staging origin intrinsic metadata indexer are stuck
Closed, MigratedEdits Locked

Description

There is no consumer for that indexer in staging.
After a systemctl status on the dedicated journal client service, they are all showing the same warning [2].

[1] http://getty.internal.softwareheritage.org:9000/clusters/staging/consumers

[2]

root@pergamon:~# clush -b -w @staging-workers 'systemctl status swh-indexer-journal-client@origin_intrinsic_metadata_1'
---------------
worker0.internal.staging.swh.network
---------------
● swh-indexer-journal-client@origin_intrinsic_metadata_1.service - Software Heritage Indexer Journal Client origin_intrinsic_metadata_1
     Loaded: loaded (/etc/systemd/system/swh-indexer-journal-client@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/swh-indexer-journal-client@origin_intrinsic_metadata_1.service.d
             └─parameters.conf
     Active: active (running) since Wed 2022-08-31 14:05:35 UTC; 33min ago
   Main PID: 2748153 (swh)
      Tasks: 6 (limit: 28838)
     Memory: 140.7M
     CGroup: /system.slice/system-swh\x2dindexer\x2djournal\x2dclient.slice/swh-indexer-journal-client@origin_intrinsic_metadata_1.service
             └─2748153 /usr/bin/python3 /usr/bin/swh --log-level INFO --log-level azure.core.pipeline.policies.http_logging_policy:WARNING indexer --config-file /etc/softwareheritage/indexer/origin_intrinsic_metadata.yml journal-client origin_intrinsic_metadata

Aug 31 14:05:35 worker0 systemd[1]: swh-indexer-journal-client@origin_intrinsic_metadata_1.service: Scheduled restart job, restart counter is at 1.
Aug 31 14:05:35 worker0 systemd[1]: Stopped Software Heritage Indexer Journal Client origin_intrinsic_metadata_1.
Aug 31 14:05:35 worker0 systemd[1]: Started Software Heritage Indexer Journal Client origin_intrinsic_metadata_1.
Aug 31 14:20:58 worker0 swh[2748153]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 313ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:20:58 worker0 swh[2748153]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:20:59 worker0 swh[2748153]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 313ms"}
Aug 31 14:32:15 worker0 swh[2748153]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 396ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:32:15 worker0 swh[2748153]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:32:16 worker0 swh[2748153]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 396ms"}
---------------
worker1.internal.staging.swh.network
---------------
● swh-indexer-journal-client@origin_intrinsic_metadata_1.service - Software Heritage Indexer Journal Client origin_intrinsic_metadata_1
     Loaded: loaded (/etc/systemd/system/swh-indexer-journal-client@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/swh-indexer-journal-client@origin_intrinsic_metadata_1.service.d
             └─parameters.conf
     Active: active (running) since Wed 2022-08-31 14:03:27 UTC; 35min ago
   Main PID: 2743429 (swh)
      Tasks: 6 (limit: 28838)
     Memory: 134.7M
     CGroup: /system.slice/system-swh\x2dindexer\x2djournal\x2dclient.slice/swh-indexer-journal-client@origin_intrinsic_metadata_1.service
             └─2743429 /usr/bin/python3 /usr/bin/swh --log-level INFO --log-level azure.core.pipeline.policies.http_logging_policy:WARNING indexer --config-file /etc/softwareheritage/indexer/origin_intrinsic_metadata.yml journal-client origin_intrinsic_metadata

Aug 31 14:03:27 worker1 systemd[1]: Started Software Heritage Indexer Journal Client origin_intrinsic_metadata_1.
Aug 31 14:16:38 worker1 swh[2743429]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 462ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:16:38 worker1 swh[2743429]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:16:39 worker1 swh[2743429]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 462ms"}
Aug 31 14:32:50 worker1 swh[2743429]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 418ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:32:50 worker1 swh[2743429]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:32:51 worker1 swh[2743429]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 418ms"}
---------------
worker2.internal.staging.swh.network
---------------
● swh-indexer-journal-client@origin_intrinsic_metadata_1.service - Software Heritage Indexer Journal Client origin_intrinsic_metadata_1
     Loaded: loaded (/etc/systemd/system/swh-indexer-journal-client@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/swh-indexer-journal-client@origin_intrinsic_metadata_1.service.d
             └─parameters.conf
     Active: active (running) since Wed 2022-08-31 14:03:27 UTC; 35min ago
   Main PID: 2758980 (swh)
      Tasks: 6 (limit: 28840)
     Memory: 62.2M
     CGroup: /system.slice/system-swh\x2dindexer\x2djournal\x2dclient.slice/swh-indexer-journal-client@origin_intrinsic_metadata_1.service
             └─2758980 /usr/bin/python3 /usr/bin/swh --log-level INFO --log-level azure.core.pipeline.policies.http_logging_policy:WARNING indexer --config-file /etc/softwareheritage/indexer/origin_intrinsic_metadata.yml journal-client origin_intrinsic_metadata

Aug 31 14:03:27 worker2 systemd[1]: Started Software Heritage Indexer Journal Client origin_intrinsic_metadata_1.
Aug 31 14:16:28 worker2 swh[2758980]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 490ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:16:29 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:16:29 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 490ms"}
Aug 31 14:29:24 worker2 swh[2758980]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 337ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:29:24 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:29:25 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 337ms"}
Aug 31 14:35:44 worker2 swh[2758980]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 286ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:35:45 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:35:45 worker2 swh[2758980]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 286ms"}
---------------
worker3.internal.staging.swh.network
---------------
● swh-indexer-journal-client@origin_intrinsic_metadata_1.service - Software Heritage Indexer Journal Client origin_intrinsic_metadata_1
     Loaded: loaded (/etc/systemd/system/swh-indexer-journal-client@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/swh-indexer-journal-client@origin_intrinsic_metadata_1.service.d
             └─parameters.conf
     Active: active (running) since Wed 2022-08-31 14:03:27 UTC; 35min ago
   Main PID: 2562808 (swh)
      Tasks: 6 (limit: 28889)
     Memory: 134.7M
     CGroup: /system.slice/system-swh\x2dindexer\x2djournal\x2dclient.slice/swh-indexer-journal-client@origin_intrinsic_metadata_1.service
             └─2562808 /usr/bin/python3 /usr/bin/swh --log-level INFO --log-level azure.core.pipeline.policies.http_logging_policy:WARNING indexer --config-file /etc/softwareheritage/indexer/origin_intrinsic_metadata.yml journal-client origin_intrinsic_metadata

Aug 31 14:03:27 worker3 systemd[1]: Started Software Heritage Indexer Journal Client origin_intrinsic_metadata_1.
Aug 31 14:03:28 worker3 swh[2562808]: WARNING:swh.core.cli:Could not load subcommand graph: DistributionNotFound(Requirement.parse('py4j'), None)
Aug 31 14:16:48 worker3 swh[2562808]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 20ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:16:48 worker3 swh[2562808]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:16:49 worker3 swh[2562808]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 20ms"}
Aug 31 14:32:44 worker3 swh[2562808]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 461ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Aug 31 14:32:44 worker3 swh[2562808]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Broker: Unknown member"}
Aug 31 14:32:45 worker3 swh[2562808]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (300000ms) exceeded by 461ms"}

Event Timeline

ardumont triaged this task as Normal priority.Aug 31 2022, 4:41 PM
ardumont created this task.

Activating debug log [1]

It's possible the default batch_size of 200 is too much for staging.
Hence the timeout.
A current tryout to decrease that batch_size by half is ongoing.

[1] P1441

After further investigation w/ @vsellier, it's also related to our storage and indexer storage having too few gunicorn workers serving the journal clients (among other things).
So decreasing the batch size to something like 100 and fixing that should fairly help ^.

The lag is subsiding now, slowly because only 1 journal client:

root@storage1:~# $KAFKA_CONSUMER_GROUPS --bootstrap-server $SERVER   --describe --group $GROUP_ID | grep -v " 0 "

Warning: Consumer group 'swh.indexer.journal_client.origin_intrinsic_metadata' is rebalancing.

GROUP                                                TOPIC                                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
swh.indexer.journal_client.origin_intrinsic_metadata swh.journal.objects.origin_visit_status 43         400             1547777         1547377         -               -               -
swh.indexer.journal_client.origin_intrinsic_metadata swh.journal.objects.origin_visit_status 22         200             1546526         1546326         -               -               -
swh.indexer.journal_client.origin_intrinsic_metadata swh.journal.objects.origin_visit_status 48         600             1548433         1547833         -               -               -
swh.indexer.journal_client.origin_intrinsic_metadata swh.journal.objects.origin_visit_status 16         380             1546978         1546598         -               -               -
swh.indexer.journal_client.origin_intrinsic_metadata swh.journal.objects.origin_visit_status 32         830             1549152         1548322         -               -               -
ardumont changed the task status from Open to Work in Progress.Aug 31 2022, 6:49 PM
ardumont closed this task as Resolved.
ardumont claimed this task.
ardumont moved this task from Backlog to in-progress on the System administration board.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.