Page MenuHomeSoftware Heritage

Add support for blob in content export
ClosedPublic

Authored by douardda on Mar 29 2022, 5:45 PM.

Details

Summary

this feature requires a config parameter with_data=true and an objstorage
configuration.

Depends on D7464.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D7465 (id=27051)

Could not rebase; Attempt merge onto 5a8a8a7847...

Updating 5a8a8a7..4139383
Fast-forward
 swh/dataset/exporters/orc.py    | 116 ++++++++++++++++++++---
 swh/dataset/journalprocessor.py |  28 +++++-
 swh/dataset/relational.py       |  11 +++
 swh/dataset/test/test_orc.py    | 199 +++++++++++++++++++++++++++++++---------
 4 files changed, 289 insertions(+), 65 deletions(-)
Changes applied before test
commit 4139383faf3db6ed902a5be917e80fadd3260885
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 5c652bb058e2c1b59bafefd6817f392fdc171a20
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 45c8124b7a310963a868eb6602ea24e240d761e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/102/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/102/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 29 2022, 5:46 PM
Harbormaster failed remote builds in B27966: Diff 27051!

Build is green

Patch application report for D7465 (id=27061)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..86f2b03
Fast-forward
 requirements-swh.txt            |   1 +
 swh/dataset/exporters/orc.py    | 101 +++++++++++++++----
 swh/dataset/journalprocessor.py |  28 +++++-
 swh/dataset/relational.py       |   1 +
 swh/dataset/test/test_orc.py    | 214 ++++++++++++++++++++++++++++++----------
 5 files changed, 271 insertions(+), 74 deletions(-)
Changes applied before test
commit 86f2b03632320eb44868a2a74ec17cceee32058b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/103/ for more details.

Build has FAILED

Patch application report for D7465 (id=27069)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..f31a113
Fast-forward
 requirements-swh.txt            |   1 +
 swh/dataset/exporters/orc.py    | 115 +++++++++++++++++----
 swh/dataset/journalprocessor.py |  28 +++++-
 swh/dataset/relational.py       |  59 ++++++-----
 swh/dataset/test/test_orc.py    | 216 ++++++++++++++++++++++++++++++----------
 5 files changed, 319 insertions(+), 100 deletions(-)
Changes applied before test
commit f31a11337fa25f620533325229ca30f7fb6c9a50
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

commit 7a1d494b6d5c65e63d79f7d2e900f4f66670c8e8
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit 58493130ee7c13591d26069f91c3cd4ec1ccdc37
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit c728c05e630cf02f35081e810279a5e5b24ebf98
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 850ee3be47cf3b1e0ab53f8820cc5e4c86b94f38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/108/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/108/console

Build is green

Patch application report for D7465 (id=27078)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..54c4ed2
Fast-forward
 requirements-swh.txt            |   1 +
 swh/dataset/exporters/orc.py    | 115 +++++++++++++++++----
 swh/dataset/journalprocessor.py |  28 +++++-
 swh/dataset/relational.py       |  59 ++++++-----
 swh/dataset/test/test_orc.py    | 216 ++++++++++++++++++++++++++++++----------
 5 files changed, 319 insertions(+), 100 deletions(-)
Changes applied before test
commit 54c4ed2804d6800a114003866852471769916875
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

commit 830b2b721420f604de4134ed409c52ee775126df
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit 9f7c128eb8aba748e24eaa682269d74dcefb98c6
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/113/ for more details.

Build is green

Patch application report for D7465 (id=27082)

Could not rebase; Attempt merge onto fd3f9aa61d...

Updating fd3f9aa..681b451
Fast-forward
 requirements-swh.txt            |   1 +
 swh/dataset/exporters/orc.py    | 115 +++++++++++++++++----
 swh/dataset/journalprocessor.py |  36 ++++++-
 swh/dataset/relational.py       |  59 ++++++-----
 swh/dataset/test/test_orc.py    | 216 ++++++++++++++++++++++++++++++----------
 5 files changed, 327 insertions(+), 100 deletions(-)
Changes applied before test
commit 681b45135f1d54e9fb62a609fb45e66e602b006b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

commit 58b9f92e1d4df026c2f82e8d139c36a78d79449f
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:58:55 2022 +0100

    Improve the progress reporting look and feel
    
    - use shortened values in the progress bar (eg. '11.3M/566M' instead of
      something like '11310201/566123456')
    - reduce the description strings and align them.
    
    The result looks like:
    
    Exporting release:
      - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
      - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

commit be3c5da2bb44c73fe614e93073727ffd50665a9c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:52:26 2022 +0100

    Delay the unsubscribe to the end of handle_messages
    
    to prevent some possible race condition leading to kafka errors like:
    
      rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
    
    This could occur when, at the time of unsubscribing from a partition,
    another partition is also depleted. Since the unsubscription consist in
    resubscribing to all partitions except unsubscribes ones, we could try
    to subscribe to such a depleted parition, leading to the error message
    listed above.

commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 25 15:43:18 2022 +0100

    Move exporter config entries in dedicated sections
    
    eg. orc exporter specific exporter config entries are now under the
    'orc' section, like:
    
      journal:
        brokers: [...]
    
      orc:
        remove_pull_requests: true
        max_rows:
          revision: 100000
          directory: 10000

commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/116/ for more details.

Build is green

Patch application report for D7465 (id=27132)

Rebasing onto 76eba6595b...

Current branch diff-target is up to date.
Changes applied before test
commit 880f4b4af339a4188827befc09524f68ebf4276d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/120/ for more details.

vlorentz added a subscriber: vlorentz.

LGTM, but we should eventually replace self.objstorage.get() with self.objstorage.get_batch(), it is much faster with the Azure backend (and hopefully more backends in the future, inc. Winery!)

This revision is now accepted and ready to land.Apr 5 2022, 4:01 PM

LGTM, but we should eventually replace self.objstorage.get() with self.objstorage.get_batch(), it is much faster with the Azure backend (and hopefully more backends in the future, inc. Winery!)

unfortunately, with the way the dataset exporter currently works, it's not easy to make batch queries; objects are handled one at a time (from the kafka journal).

Build is green

Patch application report for D7465 (id=27278)

Rebasing onto f282e88a90...

Current branch diff-target is up to date.
Changes applied before test
commit 9d97f0c0826dc4be4505e79f7f1b5eed18a05dff
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 17:01:19 2022 +0100

    Add support for blob in content export
    
    this feature requires a config parameter `with_data=true` and an objstorage
    configuration.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/123/ for more details.