Page MenuHomeSoftware Heritage

Count authors from revisions and releases
Closed, ResolvedPublic

Description

The default counters are using only the journal's message keys for perfomance reason and to be completely agnostic regarding the king of messages on a topic.

But sometimes, detailed information stored in the messages needs to be counted, like in this case, the authors.

It will need a less generic journal client in charge of deserializing the messages and count other specific fields.

Event Timeline

vsellier triaged this task as Normal priority.Apr 15 2021, 1:13 PM
vsellier created this task.

don't forget to count committers too

vsellier changed the task status from Open to Work in Progress.Mon, Apr 19, 3:52 PM
  • version 0.7.0 release with the last improvement (D5576) of vlorentz (thanks)
  • deployment done in staging
  • the person counting has started on the live messages:
root@counters0:~# redis-cli
127.0.0.1:6379> pfcount person
(integer) 7
  • now let reset the consumer offsets for the release and revision topics to backfill the person counter:
# offsets backup
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets.csv
# revision reset
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.revision
# release reset
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.release

# checks
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets-backfill.csv  

# diff ~/counters_journal_client_offsets.csv ~/counters_journal_client_offsets-backfill.csv | less
2c2
< "swh.journal.objects.revision",25,543932
---
> "swh.journal.objects.revision",25,0
5c5
< "swh.journal.objects.release",57,2874
---
> "swh.journal.objects.release",57,0
12c12
< "swh.journal.objects.revision",28,543953
...
  • journal client restarted
  • person count is increasing:
root@counters0:~# redis-cli pfcount person
(integer) 33807

Awesome!

I think you can close D5573 which is obsolete now with the latest change.

Also [1] to follow through the journal client consumption (it has data now ;)

[1] https://grafana.softwareheritage.org/goto/X999Uz9Mz

the swh-counters is deployed in production too:

  • upgrade swh-counters package and restart swh-counters backend and journal
root@counters1:~# apt dist-upgrade
...
Setting up python3-swh.counters (0.7.0-1~swh1~bpo10+1) ...
root@counters1:~# systemctl stop swh-counters-journal-client.service 
root@counters1:~# systemctl restart gunicorn-swh-counters.service 
root@counters1:~# systemctl start swh-counters-journal-client.service 
root@counters1:~# redis-cli pfcount person
(integer) 7

The count of the person already starts

  • stopping the journal-client to be able to reset the releases and revisions offsets
root@counters1:~# systemctl stop swh-counters-journal-client.service
  • reset the offsets
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets.csv
# revision reset
vsellier@kafka1 ~ % 
 /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.revision
# release reset
vsellier@kafka1 ~ %  /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets  --group swh.counters.journal_client --to-earliest --execute --topic swh.journal.objects.release 
# checks
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --all-topics --to-current --dry-run  --export --group swh.counters.journal_client 2>&1 > ~/counters_journal_client_offsets-backfill.csv 
vsellier@kafka1 ~ % diff ~/counters_journal_client_offsets.csv ~/counters_journal_client_offsets-backfill.csv | less 
1c1
< "swh.journal.objects.revision",25,8275180
---
> "swh.journal.objects.revision",25,0
8c8
< "swh.journal.objects.release",128,78484
---
> "swh.journal.objects.release",128,0
16c16
...
  • journal client restarted
root@counters1:~# systemctl start swh-counters-journal-client.service
  • the person counters is growing fastly
root@counters1:~# date;redis-cli pfcount person
Fri 23 Apr 2021 10:55:54 AM UTC
(integer) 72358
root@counters1:~# date;redis-cli pfcount person
Fri 23 Apr 2021 10:55:57 AM UTC
(integer) 80618

The lag for the production can be followed here: https://grafana.softwareheritage.org/goto/Di2H3z9Gk
(staging has already recovered)

and the authors are now displayed on staging and production (webapp1)

We just have to wait for the backfill to be done to have realistics values displayed on webapp1

Hear hear, it's kept up now:

ardumont@counters1:~% date;redis-cli pfcount person
Sat 24 Apr 2021 05:31:18 PM UTC
(integer) 42190221

as can be seen on webapp1 [1]

I'll attend to deploy on the main archive on monday.

[1] https://webapp1.internal.softwareheritage.org/