this feature requires a config parameter with_data=true and an objstorage
configuration.
Depends on D7464.
Differential D7465
Add support for blob in content export douardda on Mar 29 2022, 5:45 PM. Authored by
Details
this feature requires a config parameter with_data=true and an objstorage Depends on D7464.
Diff Detail
Event TimelineComment Actions Build has FAILED Patch application report for D7465 (id=27051)Could not rebase; Attempt merge onto 5a8a8a7847... Updating 5a8a8a7..4139383 Fast-forward swh/dataset/exporters/orc.py | 116 ++++++++++++++++++++--- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 11 +++ swh/dataset/test/test_orc.py | 199 +++++++++++++++++++++++++++++++--------- 4 files changed, 289 insertions(+), 65 deletions(-) Changes applied before testcommit 4139383faf3db6ed902a5be917e80fadd3260885 Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:58:55 2022 +0100 Improve the progress reporting look and feel - use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4] commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:52:26 2022 +0100 Delay the unsubscribe to the end of handle_messages to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above. commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:43:18 2022 +0100 Move exporter config entries in dedicated sections eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000 commit e8ccb166a6aaa82f5917388f9b995c830499170a Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 16:35:52 2022 +0100 Add support for limited row numbers in ORC files Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options). commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 18 12:24:31 2022 +0100 Add the raw_manifest column for revision, release and directory ORC files commit 5c652bb058e2c1b59bafefd6817f392fdc171a20 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 18 12:22:47 2022 +0100 Export revision extra headers in a dedicated ORC file commit 45c8124b7a310963a868eb6602ea24e240d761e4 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 18 12:20:20 2022 +0100 Add the type fields for revision and origin_visit_status ORC table Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/102/ Comment Actions Build is green Patch application report for D7465 (id=27061)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..86f2b03 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 101 +++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 1 + swh/dataset/test/test_orc.py | 214 ++++++++++++++++++++++++++++++---------- 5 files changed, 271 insertions(+), 74 deletions(-) Changes applied before testcommit 86f2b03632320eb44868a2a74ec17cceee32058b Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:58:55 2022 +0100 Improve the progress reporting look and feel - use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4] commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:52:26 2022 +0100 Delay the unsubscribe to the end of handle_messages to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above. commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:43:18 2022 +0100 Move exporter config entries in dedicated sections eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000 commit e8ccb166a6aaa82f5917388f9b995c830499170a Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 16:35:52 2022 +0100 Add support for limited row numbers in ORC files Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options). See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/103/ for more details. Comment Actions Build has FAILED Patch application report for D7465 (id=27069)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..f31a113 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 319 insertions(+), 100 deletions(-) Changes applied before testcommit f31a11337fa25f620533325229ca30f7fb6c9a50 Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. commit 7a1d494b6d5c65e63d79f7d2e900f4f66670c8e8 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:58:55 2022 +0100 Improve the progress reporting look and feel - use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4] commit 58493130ee7c13591d26069f91c3cd4ec1ccdc37 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:52:26 2022 +0100 Delay the unsubscribe to the end of handle_messages to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above. commit c728c05e630cf02f35081e810279a5e5b24ebf98 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:43:18 2022 +0100 Move exporter config entries in dedicated sections eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000 commit 850ee3be47cf3b1e0ab53f8820cc5e4c86b94f38 Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 16:35:52 2022 +0100 Add support for limited row numbers in ORC files Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options). Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/108/ Comment Actions Build is green Patch application report for D7465 (id=27078)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..54c4ed2 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 319 insertions(+), 100 deletions(-) Changes applied before testcommit 54c4ed2804d6800a114003866852471769916875 Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. commit 830b2b721420f604de4134ed409c52ee775126df Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:58:55 2022 +0100 Improve the progress reporting look and feel - use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4] commit 9f7c128eb8aba748e24eaa682269d74dcefb98c6 Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:52:26 2022 +0100 Delay the unsubscribe to the end of handle_messages to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above. commit e01daba4d601733a86ce7401fe54247908d03e5c Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:43:18 2022 +0100 Move exporter config entries in dedicated sections eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000 commit 3df08fd71759487e963e6569c8dfd0c502b060de Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 16:35:52 2022 +0100 Add support for limited row numbers in ORC files Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options). See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/113/ for more details. Comment Actions Build is green Patch application report for D7465 (id=27082)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..681b451 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 36 ++++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 327 insertions(+), 100 deletions(-) Changes applied before testcommit 681b45135f1d54e9fb62a609fb45e66e602b006b Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. commit 58b9f92e1d4df026c2f82e8d139c36a78d79449f Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:58:55 2022 +0100 Improve the progress reporting look and feel - use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4] commit be3c5da2bb44c73fe614e93073727ffd50665a9c Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:52:26 2022 +0100 Delay the unsubscribe to the end of handle_messages to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above. commit e01daba4d601733a86ce7401fe54247908d03e5c Author: David Douard <david.douard@sdfa3.org> Date: Fri Mar 25 15:43:18 2022 +0100 Move exporter config entries in dedicated sections eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000 commit 3df08fd71759487e963e6569c8dfd0c502b060de Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 16:35:52 2022 +0100 Add support for limited row numbers in ORC files Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options). See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/116/ for more details. Comment Actions Build is green Patch application report for D7465 (id=27132)Rebasing onto 76eba6595b... Current branch diff-target is up to date. Changes applied before testcommit 880f4b4af339a4188827befc09524f68ebf4276d Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/120/ for more details. Comment Actions LGTM, but we should eventually replace self.objstorage.get() with self.objstorage.get_batch(), it is much faster with the Azure backend (and hopefully more backends in the future, inc. Winery!) Comment Actions unfortunately, with the way the dataset exporter currently works, it's not easy to make batch queries; objects are handled one at a time (from the kafka journal). Comment Actions Build is green Patch application report for D7465 (id=27278)Rebasing onto f282e88a90... Current branch diff-target is up to date. Changes applied before testcommit 9d97f0c0826dc4be4505e79f7f1b5eed18a05dff Author: David Douard <david.douard@sdfa3.org> Date: Wed Mar 23 17:01:19 2022 +0100 Add support for blob in content export this feature requires a config parameter `with_data=true` and an objstorage configuration. See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/123/ for more details. |