Page MenuHomeSoftware Heritage

Implement the jounal client counting an internal property of an object
ClosedPublic

Authored by vsellier on Apr 22 2021, 10:36 AM.

Details

Summary

and use it to count the persons inside the releases and revisions.

Via an new argument 'keys'/'messages' supported by the cli, different
journal client implemtations and worker functions are used.
'keys' is the old behavior counting only the different message keys
of a given topic.

Related to T3251

Somme adaptations on the puppet code will be necessary to be able to deploy
this diff

Test Plan
  • tox
  • containers

Diff Detail

Repository
rDCNT Archive counters
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5572 (id=19919)

Rebasing onto 0e5d551ff5...

Current branch diff-target is up to date.
Changes applied before test
commit d11a8c649a21f10a6060038e9497a47e2ce5eef5
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Thu Apr 22 10:29:08 2021 +0200

    Implement the jounal client counting an internal property of an object
    
    and use it to count the persons inside the releases and revisions.
    
    Via an new argument 'keys'/'messages' supported by the cli, different
    journal client implemtations and worker functions are used.
    'keys' is the old behavior counting only the different message keys
    of a given topic.
    
    Related to T3251

See https://jenkins.softwareheritage.org/job/DCNT/job/tests-on-diff/42/ for more details.

ardumont added inline comments.
swh/counters/journal_client.py
31

Add a docstring to explain a bit you introspect revision to extract the author/committer.
It may not be immediately apparent for all.

32–33

To avoid 2 iterations on the same revision list.

I also wondered if we needed to check for None value in there but no need.
The person's fullname attribute is not optional (I checked the class swh.model.model.Person.;)

38

same, please add a docstring.

looks good to me, request changes for the sake of suggestions though.

swh/counters/journal_client.py
45–46

adding an empty list should be a noop, shouldn't it?

This revision now requires changes to proceed.Apr 22 2021, 10:56 AM
swh/counters/tests/test_journal_client.py
33

That will avoid any model desynchronization if at some point that changes...

39–45
48–51
57–69
vlorentz added inline comments.
swh/counters/cli.py
47

could you add an help= argument to document it?

swh/counters/journal_client.py
32–33

@ardumont But your variant does 2*n inserts, instead of directly adding n objects twice.

If this is a performance issue, it needs a benchmark.

It's true however that iterating twice over revisions won't work if it's a generator and not a sequence. But we don't need to support generators anyway, do we? Let's just make the argument Sequence[Dict]

swh/counters/journal_client.py
32–33

@ardumont But your variant does 2*n inserts, instead of directly adding n objects twice.

so what's better?

But if the first implementation is fine already fine ;)
Instinctively (which may be wrong in python then), iterating twice on something is wrong to me.

But the question from val about generator is true enough (in that case, the first reading will
consume the generator, making the second read renders nothing so making the person count wrong).

And I don't know what's the actual type of the input sequence is. If that's a list already it's fine.

swh/counters/cli.py
47

sure

swh/counters/journal_client.py
31

ok

32–33

I agree it's better to avoid the double iteration. I will change that. If it raises a performance issue (which I doubt), I will investigate later on it

45–46

it's a nohup in terms of count, but it was to avoid an unecessary remote call

But after checking the data, the number of releases without an author in production is 531 for a total of 17359351.
So it is useless ;)

swh/counters/tests/test_journal_client.py
39–45

ack for both suggestions

swh/counters/journal_client.py
32–33

so what's better?

I don't know, that's why we would need a benchmark if it's an issue.

swh/counters/journal_client.py
45–46

it's a nohup in terms of count, but it was to avoid an unecessary remote call

right about the remote call, i did not realize.

swh/counters/cli.py
47

finally it's a nope as it seems an argument doesn't support an help parameter

swh/counters/cli.py
47

then docstring

Update according the reviews' feedbacks

Build is green

Patch application report for D5572 (id=19922)

Rebasing onto 0e5d551ff5...

Current branch diff-target is up to date.
Changes applied before test
commit 9e1d234a56bd3a5fdf141f452d20f77f53149581
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Thu Apr 22 10:29:08 2021 +0200

    Implement the jounal client counting an internal property of an object
    
    and use it to count the persons inside the releases and revisions.
    
    Via an new argument 'keys'/'messages' supported by the cli, different
    journal client implemtations and worker functions are used.
    'keys' is the old behavior counting only the different message keys
    of a given topic.
    
    Related to T3251

See https://jenkins.softwareheritage.org/job/DCNT/job/tests-on-diff/43/ for more details.

ok for me

Note that you forgot to add the docstring asked you to though ;)

swh/counters/tests/test_journal_client.py
50

name and email are not by default thus why i did not mention them but as you wish.

This revision is now accepted and ready to land.Apr 22 2021, 1:25 PM

Still missing documentation of the journal_type CLI argument AFAICT

This revision now requires changes to proceed.Apr 22 2021, 1:37 PM

Build is green

Patch application report for D5572 (id=19923)

Rebasing onto 0e5d551ff5...

Current branch diff-target is up to date.
Changes applied before test
commit 5cae9b7cfe301d2460b9892c971e740bf974bcbd
Author: Vincent SELLIER <vincent.sellier@softwareheritage.org>
Date:   Thu Apr 22 10:29:08 2021 +0200

    Implement the jounal client counting an internal property of an object
    
    and use it to count the persons inside the releases and revisions.
    
    Via an new argument 'keys'/'messages' supported by the cli, different
    journal client implemtations and worker functions are used.
    'keys' is the old behavior counting only the different message keys
    of a given topic.
    
    Related to T3251

See https://jenkins.softwareheritage.org/job/DCNT/job/tests-on-diff/44/ for more details.

So we will need two journal clients, one that reads only keys, and one that reads only messages? Why not a single one to do both?

swh/counters/cli.py
52–58

It's for performance considerations only, for most of the counters, counting the keys is enough as it's the unique identifier in kafka.
The KeyOrientedJournalClient[1] is bypassing the object deserialization when a message is received, so a more classical client is needed for this specific Person case.

Having several journal clients is not a big deal, it's already the case for swh-search, for example.

[1] https://forge.softwareheritage.org/source/swh-counters/browse/master/swh/counters/kafka_client.py$14

It's for performance considerations only, for most of the counters, counting the keys is enough as it's the unique identifier in kafka.
The KeyOrientedJournalClient[1] is bypassing the object deserialization when a message is received, so a more classical client is needed for this specific Person case.

But why a CLI argument, instead of always selecting the right one for each topic (ie. message for revisions and releases, keys for other objects)?

Having several journal clients is not a big deal, it's already the case for swh-search, for example.

But swh-search's journal clients read different topics.

I'll send a diff with my idea, it will be clearer

This revision is now accepted and ready to land.Apr 22 2021, 3:36 PM

thanks for the diff you will propose. I will land this one in the interval.

The goal I had initially in mind, was to separate the concerns: one client counting all the messages of all the topics topics without any login inside, and another one for specific use cases that can take longer to compute.