Page MenuHomeSoftware Heritage

Improve the progress reporting look and feel
ClosedPublic

Authored by douardda on Mar 29 2022, 5:45 PM.

Details

Summary
  • use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456')
  • reduce the description strings and align them.

The result looks like:

Exporting release:

  • Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
  • Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

Depends on D7463

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7464 (id=27049)

Could not rebase; Attempt merge onto 5a8a8a7847...

Updating 5a8a8a7..2b92bff
Fast-forward
 swh/dataset/exporters/orc.py    |  98 +++++++++++++++++++++++++++++++++------
 swh/dataset/journalprocessor.py |  28 +++++++++--
 swh/dataset/relational.py       |  10 ++++
 swh/dataset/test/test_orc.py    | 100 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 208 insertions(+), 28 deletions(-)
Changes applied before test
commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 5c652bb058e2c1b59bafefd6817f392fdc171a20
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 45c8124b7a310963a868eb6602ea24e240d761e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/101/ for more details.

vlorentz added a subscriber: vlorentz.

I don't understand why unit_scale doesn't default to True...

This revision is now accepted and ready to land.Mar 29 2022, 6:01 PM

Build has FAILED

Patch application report for D7464 (id=27068)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..7a1d494
Fast-forward
 swh/dataset/exporters/orc.py    | 99 ++++++++++++++++++++++++++++++++---------
 swh/dataset/journalprocessor.py | 28 +++++++++---
 swh/dataset/relational.py       | 58 ++++++++++++++----------
 swh/dataset/test/test_orc.py    | 96 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 225 insertions(+), 56 deletions(-)
Changes applied before test
commit 7a1d494b6d5c65e63d79f7d2e900f4f66670c8e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit 58493130ee7c13591d26069f91c3cd4ec1ccdc37
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit c728c05e630cf02f35081e810279a5e5b24ebf98
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 850ee3be47cf3b1e0ab53f8820cc5e4c86b94f38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/107/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/107/console

Build is green

Patch application report for D7464 (id=27077)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..830b2b7
Fast-forward
 swh/dataset/exporters/orc.py    | 99 ++++++++++++++++++++++++++++++++---------
 swh/dataset/journalprocessor.py | 28 +++++++++---
 swh/dataset/relational.py       | 58 ++++++++++++++----------
 swh/dataset/test/test_orc.py    | 96 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 225 insertions(+), 56 deletions(-)
Changes applied before test
commit 830b2b721420f604de4134ed409c52ee775126df
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit 9f7c128eb8aba748e24eaa682269d74dcefb98c6
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/112/ for more details.

Build is green

Patch application report for D7464 (id=27081)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..58b9f92
Fast-forward
 swh/dataset/exporters/orc.py    | 99 ++++++++++++++++++++++++++++++++---------
 swh/dataset/journalprocessor.py | 36 ++++++++++++---
 swh/dataset/relational.py       | 58 ++++++++++++++----------
 swh/dataset/test/test_orc.py    | 96 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 233 insertions(+), 56 deletions(-)
Changes applied before test
commit 58b9f92e1d4df026c2f82e8d139c36a78d79449f
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit be3c5da2bb44c73fe614e93073727ffd50665a9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/115/ for more details.