Paths

Table of Contentst

Differential D6234

Add a --reset option to export_graph cli tool
AbandonedPublic
Actions

Authored by douardda on Sep 9 2021, 5:57 PM.

Details

Reviewers

None

Group Reviewers

Reviewers

Summary

allows to enforce consumong kafka topics from the beginning rather than
stored offsets.

Depends on D6233

Diff Detail

Repository

rDDATASET Datasets

Branch

master

Lint

Lint Skipped

Unit

Unit Tests Skipped

Build Status

Buildable 23498
Build 36660: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 36659: arc lint + arc unit

Event Timeline

douardda created this revision.Sep 9 2021, 5:57 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 9 2021, 5:57 PM

Build is green

Patch application report for D6234 (id=22556)

Could not rebase; Attempt merge onto 002ee70b99...

Updating 002ee70..4d70697
Fast-forward
 swh/dataset/cli.py              |  9 +++++-
 swh/dataset/journalprocessor.py | 66 +++++++++++++++++++++++++----------------
 2 files changed, 49 insertions(+), 26 deletions(-)

Changes applied before test

commit 4d70697c63955e90efeecbd90f1da7994c89cdab
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:20:26 2021 +0200

    Add a --reset option to export_graph cli tool
    
    allows to enforce consumong kafka topics from the beginning rather than
    stored offsets.

commit 3f331e1823e3329085f01f073fe8a6bd6f43473a
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:30:25 2021 +0200

    Reduce the size of the progress bar
    
    so we get a chance to actually have a visible progress bar:
    
    - reduce the label size (shorter desc),
    - use a single 'workers' postfix (like "workers=n/m").

commit 48d246f178851dfe06e47b6c12555fcd095f5641
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:54:15 2021 +0200

    Make sure the progress bar for the export reaches 100%
    
    - ensure the last offset is sent to the queue,
    - fix the computation of the progress value (off-by-one).

commit 3a2f5076dcbf791d1ef43982b70551f048ee7c3e
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:47:57 2021 +0200

    Explicitly close the temporary kafka consumer in `get_offsets`
    
    used to retrieve partitions and lo/hi offets.
    
    It could cause some dead-lock/long timeout kind of situation sometime
    (especially in the developper docker environment).

commit 45126fd621e8b75c592d7c6cd3d8d1337f95c97e
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:39:44 2021 +0200

    Simplify the lo/high partition offset computation
    
    The computation of lo and high offsets used to be done in 2 steps:
    - first get the watermak offsets (thus the absolute min and max offsets
      of the whole partition)
    - then, as a "hook" in `process()`, retrieve the last committed offset
      for the partition and "push" these current offsets in the progress
      queue.
    
    Instead, this simplifies a bit this process by quering the committed
    offsets while computing the hi/low offsets.

commit e47a3db1287b3f6ada32c3afb3270ef0947a7659
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:22:37 2021 +0200

    Use proper signature for JournalClientOffsetRanges.process()

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/6/ for more details.

Harbormaster completed remote builds in B23498: Diff 22556.Sep 9 2021, 5:58 PM

douardda requested review of this revision.Sep 9 2021, 5:58 PM

douardda added a child revision: D6235: Commit kafka messages wich offset has reach the high limit.Sep 9 2021, 5:59 PM

Can we keep the reset stuff outside the journalprocessor.py logic? It's already complex enough

swh/dataset/cli.py
73	copy-pasted too fast

In D6234#161233, @vlorentz wrote:

Can we keep the reset stuff outside the journalprocessor.py logic? It's already complex enough

I'll give it a try

rebase and fix --reset help messsage

Build is green

Patch application report for D6234 (id=22582)

Could not rebase; Attempt merge onto 002ee70b99...

Updating 002ee70..40c2446
Fast-forward
 swh/dataset/cli.py              | 12 ++++++-
 swh/dataset/journalprocessor.py | 76 +++++++++++++++++++++++++++--------------
 2 files changed, 61 insertions(+), 27 deletions(-)

Changes applied before test

commit 40c24464da3a2af5918fc309d282e72b08b10c60
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:20:26 2021 +0200

    Add a --reset option to export_graph cli tool
    
    allows to enforce consumong kafka topics from the beginning rather than
    stored offsets.

commit 5881ae06f636a74e7fb0addca04127bfe18b687d
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:30:25 2021 +0200

    Reduce the size of the progress bar
    
    so we get a chance to actually have a visible progress bar:
    
    - reduce the label size (shorter desc),
    - use a single 'workers' postfix (like "workers=n/m").

commit 47713ee38c9498a0548535e5b8361d8158ee3e09
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:54:15 2021 +0200

    Make sure the progress bar for the export reaches 100%
    
    - ensure the last offset is sent to the queue,
    - fix the computation of the progress value (off-by-one).

commit d07b2a632256da4e7778bf7b1f4a02acd03f9ca0
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:47:57 2021 +0200

    Explicitly close the temporary kafka consumer in `get_offsets`
    
    used to retrieve partitions and lo/hi offets.
    
    It could cause some dead-lock/long timeout kind of situation sometime
    (especially in the developper docker environment).

commit 2760e322af7c5862e0329198671b49d2755491ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:39:44 2021 +0200

    Simplify the lo/high partition offset computation
    
    The computation of lo and high offsets used to be done in 2 steps:
    - first get the watermak offsets (thus the absolute min and max offsets
      of the whole partition)
    - then, as a "hook" in `process()`, retrieve the last committed offset
      for the partition and "push" these current offsets in the progress
      queue.
    
    Instead, this simplifies a bit this process by quering the committed
    offsets while computing the hi/low offsets.

commit e47a3db1287b3f6ada32c3afb3270ef0947a7659
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:22:37 2021 +0200

    Use proper signature for JournalClientOffsetRanges.process()

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/10/ for more details.

Harbormaster completed remote builds in B23516: Diff 22582.Sep 10 2021, 5:54 PM

In D6234#161331, @douardda wrote:

In D6234#161233, @vlorentz wrote:

Can we keep the reset stuff outside the journalprocessor.py logic? It's already complex enough

I'll give it a try

Actually I am not sure I can easily do that: in order for a consumer to start consuming from the beginning, it must declare so during the assignment of partitions (then commit after having started consuming).

There is no Consumer API (that I found) allowing to "reset stored offsets" then start consuming from there.

(see https://github.com/confluentinc/confluent-kafka-dotnet/issues/995#issuecomment-521320691 for example)

So either I kill this diff or it stays "intricate" with the setup of the consumer (so the whole journalprocessor.py)

Note: this feature is mainly useful for testing purpose IMHO, so I suppose it's not that critical to keep it, I just find it handy when "playing" with swh dataset export

rebase

douardda removed a child revision: D6235: Commit kafka messages wich offset has reach the high limit.Sep 13 2021, 3:16 PM

douardda edited parent revisions, added: D6235: Commit kafka messages wich offset has reach the high limit; removed: D6233: Make sure the progress bar for the export reaches 100%.

Build is green

Patch application report for D6234 (id=22606)

Could not rebase; Attempt merge onto 002ee70b99...

Updating 002ee70..ee72213
Fast-forward
 swh/dataset/cli.py              | 12 ++++-
 swh/dataset/journalprocessor.py | 97 +++++++++++++++++++++++++++++------------
 2 files changed, 79 insertions(+), 30 deletions(-)

Changes applied before test

commit ee72213c8b1d7ff31b3eab71fac62368840cd406
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Sep 13 15:06:53 2021 +0200

    Add a --reset option to export_graph cli tool
    
    allows to enforce consuming kafka topics from the beginning rather than
    current position for the consumer group (aka stored offsets).

commit c8db7698bc82f1472729fe4148803004b6c93b93
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:24:32 2021 +0200

    Commit kafka messages which offset has reach the high limit
    
    this is necessary to ensure these messages are committed in kafka,
    otherwise, since the (considered) empty partition is unsubscribed from,
    it never gets committed in `JournalClient.handle_messages()` (since this
    later only commit assigned partitions).
    
    Ensure offset are committed only after worker_fn is executed without
    error.
    
    This requires to overload the `JournalClient.handle_messages()` method in
    `JournalClientOffsetRanges` to make sure "pending" messages are
    committed after the proper execution of `worker_fn`.
    
    Doing so, we can both unsubscribe from "eof" partitions on the fly (with
    "eof" meaning when the partition has been consumed up to the high
    watermark offset at the beginning of the export), and commit ALL offsets
    that need to be, but only after proper execution of the `worker_fn`
    callback.
    
    This should guarantee proper and consistent behavior (famous last
    word...).

commit bd888a6c75d216bdef4653d9cbfdf7147a32ee05
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Sep 13 15:04:51 2021 +0200

    Add a JournalClientOffsetRanges.unsubscribe() method
    
    to make the code a bit clearer.

commit 5881ae06f636a74e7fb0addca04127bfe18b687d
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:30:25 2021 +0200

    Reduce the size of the progress bar
    
    so we get a chance to actually have a visible progress bar:
    
    - reduce the label size (shorter desc),
    - use a single 'workers' postfix (like "workers=n/m").

commit 47713ee38c9498a0548535e5b8361d8158ee3e09
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:54:15 2021 +0200

    Make sure the progress bar for the export reaches 100%
    
    - ensure the last offset is sent to the queue,
    - fix the computation of the progress value (off-by-one).

commit d07b2a632256da4e7778bf7b1f4a02acd03f9ca0
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:47:57 2021 +0200

    Explicitly close the temporary kafka consumer in `get_offsets`
    
    used to retrieve partitions and lo/hi offets.
    
    It could cause some dead-lock/long timeout kind of situation sometime
    (especially in the developper docker environment).

commit 2760e322af7c5862e0329198671b49d2755491ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 11:39:44 2021 +0200

    Simplify the lo/high partition offset computation
    
    The computation of lo and high offsets used to be done in 2 steps:
    - first get the watermak offsets (thus the absolute min and max offsets
      of the whole partition)
    - then, as a "hook" in `process()`, retrieve the last committed offset
      for the partition and "push" these current offsets in the progress
      queue.
    
    Instead, this simplifies a bit this process by quering the committed
    offsets while computing the hi/low offsets.

commit e47a3db1287b3f6ada32c3afb3270ef0947a7659
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:22:37 2021 +0200

    Use proper signature for JournalClientOffsetRanges.process()

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/13/ for more details.

Harbormaster completed remote builds in B23540: Diff 22606.Sep 13 2021, 3:17 PM

In D6234#161491, @douardda wrote:

So either I kill this diff or it stays "intricate" with the setup of the consumer (so the whole journalprocessor.py)

Note: this feature is mainly useful for testing purpose IMHO, so I suppose it's not that critical to keep it, I just find it handy when "playing" with swh dataset export

Meh. How much easier does it make testing, compared to using Kafka's CLI (from the linked comment)?

douardda edited parent revisions, added: D6247: Commit kafka messages which offset has reach the high limit; removed: D6235: Commit kafka messages wich offset has reach the high limit.Sep 13 2021, 4:04 PM

rebase

douardda removed a parent revision: D6233: Make sure the progress bar for the export reaches 100%.Sep 13 2021, 4:06 PM

Build is green

Patch application report for D6234 (id=22623)

Could not rebase; Attempt merge onto 358d84938d...

Updating 358d849..b039bc4
Fast-forward
 swh/dataset/cli.py              | 12 ++++++++-
 swh/dataset/journalprocessor.py | 58 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 59 insertions(+), 11 deletions(-)

Changes applied before test

commit b039bc4fb55170f600fbaf77789e30db8da8df4e
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Sep 13 15:06:53 2021 +0200

    Add a --reset option to export_graph cli tool
    
    allows to enforce consuming kafka topics from the beginning rather than
    current position for the consumer group (aka stored offsets).

commit fb6c2e7246c58ff91c1cc75613635447c3397785
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Sep 9 14:24:32 2021 +0200

    Commit kafka messages which offset has reach the high limit
    
    this is necessary to ensure these messages are committed in kafka,
    otherwise, since the (considered) empty partition is unsubscribed from,
    it never gets committed in `JournalClient.handle_messages()` (since this
    later only commit assigned partitions).
    
    Ensure offset are committed only after worker_fn is executed without
    error.
    
    This requires to overload the `JournalClient.handle_messages()` method in
    `JournalClientOffsetRanges` to make sure "pending" messages are
    committed after the proper execution of `worker_fn`.
    
    Doing so, we can both unsubscribe from "eof" partitions on the fly (with
    "eof" meaning when the partition has been consumed up to the high
    watermark offset at the beginning of the export), and commit ALL offsets
    that need to be, but only after proper execution of the `worker_fn`
    callback.
    
    This should guarantee proper and consistent behavior (famous last
    word...).

commit a507024c5ff4989720044aa3fc42ebbc341659e0
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon Sep 13 15:04:51 2021 +0200

    Add a JournalClientOffsetRanges.unsubscribe() method
    
    to make the code a bit clearer.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/18/ for more details.

Harbormaster completed remote builds in B23557: Diff 22623.Sep 13 2021, 4:07 PM

In D6234#161506, @vlorentz wrote:

In D6234#161491, @douardda wrote:

So either I kill this diff or it stays "intricate" with the setup of the consumer (so the whole journalprocessor.py)

Note: this feature is mainly useful for testing purpose IMHO, so I suppose it's not that critical to keep it, I just find it handy when "playing" with swh dataset export

Meh. How much easier does it make testing, compared to using Kafka's CLI (from the linked comment)?

Now I remember, that's one reason I did wrote this, in my docker env, I cannot execute this command (generates an error about a port being already in use...) but that's a valid remark indeed.
Let me check executing this command from a different container than kafka itself

You could also add a command in swh-dataset's entrypoint.sh that calls whatever Kafka's script does

In D6234#161606, @vlorentz wrote:

You could also add a command in swh-dataset's entrypoint.sh that calls whatever Kafka's script does

well it works just fine when executed from another container:

docker-compose run kafka kafka-consumer-groups.sh --bootstrap-server kafka:9092 --group swh-dataset-export-test20 --topic swh.journal.objects.origin  --reset-offsets --to-earliest --execute 
Starting docker_zookeeper_1 ... done

GROUP                          TOPIC                          PARTITION  NEW-OFFSET     
swh-dataset-export-test20      swh.journal.objects.origin     4          0              
swh-dataset-export-test20      swh.journal.objects.origin     14         0              
swh-dataset-export-test20      swh.journal.objects.origin     9          0              
swh-dataset-export-test20      swh.journal.objects.origin     8          0              
swh-dataset-export-test20      swh.journal.objects.origin     10         0              
swh-dataset-export-test20      swh.journal.objects.origin     13         0              
swh-dataset-export-test20      swh.journal.objects.origin     12         0              
swh-dataset-export-test20      swh.journal.objects.origin     1          0              
swh-dataset-export-test20      swh.journal.objects.origin     11         0              
swh-dataset-export-test20      swh.journal.objects.origin     6          0              
swh-dataset-export-test20      swh.journal.objects.origin     3          0              
swh-dataset-export-test20      swh.journal.objects.origin     2          0              
swh-dataset-export-test20      swh.journal.objects.origin     15         0              
swh-dataset-export-test20      swh.journal.objects.origin     7          0              
swh-dataset-export-test20      swh.journal.objects.origin     5          0              
swh-dataset-export-test20      swh.journal.objects.origin     0          0

so let's forget about this diff (and maybe add some documentation somewhere)

It's not worth the trouble, and there is a better solution (server-side)

Revision Contents
Changeset List

Path

Size

Coverage (All)

Coverage (Touched)

swh/

dataset/

cli.py

9 lines

56%

journalprocessor.py

26 lines

20%

Diff 22556

View Options

swh/dataset/cli.py

View Options

swh/dataset/journalprocessor.py

Add a --reset option to export_graph cli toolAbandonedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D6234 (id=22556)

Changes applied before test

Patch application report for D6234 (id=22582)

Changes applied before test

Patch application report for D6234 (id=22606)

Changes applied before test

Patch application report for D6234 (id=22623)

Changes applied before test

Revision ContentsChangeset List

Diff 22556

swh/dataset/cli.py

swh/dataset/journalprocessor.py

Add a --reset option to export_graph cli tool
AbandonedPublic
Actions

Revision Contents
Changeset List