this feature requires a config parameter with_data=true and an objstorage
configuration.
Depends on D7464.
Differential D7465
Add support for blob in content export Authored by douardda on Mar 29 2022, 5:45 PM.
Details
this feature requires a config parameter with_data=true and an objstorage Depends on D7464.
Diff Detail
Event TimelineComment Actions Build has FAILED Patch application report for D7465 (id=27051)Could not rebase; Attempt merge onto 5a8a8a7847... Updating 5a8a8a7..4139383 Fast-forward swh/dataset/exporters/orc.py | 116 ++++++++++++++++++++--- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 11 +++ swh/dataset/test/test_orc.py | 199 +++++++++++++++++++++++++++++++--------- 4 files changed, 289 insertions(+), 65 deletions(-) Changes applied before testcommit 4139383faf3db6ed902a5be917e80fadd3260885
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.
commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:58:55 2022 +0100
Improve the progress reporting look and feel
- use shortened values in the progress bar (eg. '11.3M/566M' instead of
something like '11310201/566123456')
- reduce the description strings and align them.
The result looks like:
Exporting release:
- Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
- Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:52:26 2022 +0100
Delay the unsubscribe to the end of handle_messages
to prevent some possible race condition leading to kafka errors like:
rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.
commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:43:18 2022 +0100
Move exporter config entries in dedicated sections
eg. orc exporter specific exporter config entries are now under the
'orc' section, like:
journal:
brokers: [...]
orc:
remove_pull_requests: true
max_rows:
revision: 100000
directory: 10000
commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 16:35:52 2022 +0100
Add support for limited row numbers in ORC files
Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).
This can be set by configuration only (no extra cli options).
commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 18 12:24:31 2022 +0100
Add the raw_manifest column for revision, release and directory ORC files
commit 5c652bb058e2c1b59bafefd6817f392fdc171a20
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 18 12:22:47 2022 +0100
Export revision extra headers in a dedicated ORC file
commit 45c8124b7a310963a868eb6602ea24e240d761e4
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 18 12:20:20 2022 +0100
Add the type fields for revision and origin_visit_status ORC tableLink to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/102/ Comment Actions Build is green Patch application report for D7465 (id=27061)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..86f2b03 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 101 +++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 1 + swh/dataset/test/test_orc.py | 214 ++++++++++++++++++++++++++++++---------- 5 files changed, 271 insertions(+), 74 deletions(-) Changes applied before testcommit 86f2b03632320eb44868a2a74ec17cceee32058b
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.
commit 2b92bffb38b57391a9fd7e04492b9f7a5b69bb35
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:58:55 2022 +0100
Improve the progress reporting look and feel
- use shortened values in the progress bar (eg. '11.3M/566M' instead of
something like '11310201/566123456')
- reduce the description strings and align them.
The result looks like:
Exporting release:
- Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
- Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
commit bc0a7ce6986eb5242bd32054db13277ca0f1e1bb
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:52:26 2022 +0100
Delay the unsubscribe to the end of handle_messages
to prevent some possible race condition leading to kafka errors like:
rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.
commit ebb5a89f95d73f52e87c456b872ec6c529d80fe3
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:43:18 2022 +0100
Move exporter config entries in dedicated sections
eg. orc exporter specific exporter config entries are now under the
'orc' section, like:
journal:
brokers: [...]
orc:
remove_pull_requests: true
max_rows:
revision: 100000
directory: 10000
commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 16:35:52 2022 +0100
Add support for limited row numbers in ORC files
Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).
This can be set by configuration only (no extra cli options).See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/103/ for more details. Comment Actions Build has FAILED Patch application report for D7465 (id=27069)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..f31a113 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 319 insertions(+), 100 deletions(-) Changes applied before testcommit f31a11337fa25f620533325229ca30f7fb6c9a50
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.
commit 7a1d494b6d5c65e63d79f7d2e900f4f66670c8e8
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:58:55 2022 +0100
Improve the progress reporting look and feel
- use shortened values in the progress bar (eg. '11.3M/566M' instead of
something like '11310201/566123456')
- reduce the description strings and align them.
The result looks like:
Exporting release:
- Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
- Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
commit 58493130ee7c13591d26069f91c3cd4ec1ccdc37
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:52:26 2022 +0100
Delay the unsubscribe to the end of handle_messages
to prevent some possible race condition leading to kafka errors like:
rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.
commit c728c05e630cf02f35081e810279a5e5b24ebf98
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:43:18 2022 +0100
Move exporter config entries in dedicated sections
eg. orc exporter specific exporter config entries are now under the
'orc' section, like:
journal:
brokers: [...]
orc:
remove_pull_requests: true
max_rows:
revision: 100000
directory: 10000
commit 850ee3be47cf3b1e0ab53f8820cc5e4c86b94f38
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 16:35:52 2022 +0100
Add support for limited row numbers in ORC files
Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).
This can be set by configuration only (no extra cli options).Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/108/ Comment Actions Build is green Patch application report for D7465 (id=27078)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..54c4ed2 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 28 +++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 319 insertions(+), 100 deletions(-) Changes applied before testcommit 54c4ed2804d6800a114003866852471769916875
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.
commit 830b2b721420f604de4134ed409c52ee775126df
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:58:55 2022 +0100
Improve the progress reporting look and feel
- use shortened values in the progress bar (eg. '11.3M/566M' instead of
something like '11310201/566123456')
- reduce the description strings and align them.
The result looks like:
Exporting release:
- Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
- Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
commit 9f7c128eb8aba748e24eaa682269d74dcefb98c6
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:52:26 2022 +0100
Delay the unsubscribe to the end of handle_messages
to prevent some possible race condition leading to kafka errors like:
rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.
commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:43:18 2022 +0100
Move exporter config entries in dedicated sections
eg. orc exporter specific exporter config entries are now under the
'orc' section, like:
journal:
brokers: [...]
orc:
remove_pull_requests: true
max_rows:
revision: 100000
directory: 10000
commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 16:35:52 2022 +0100
Add support for limited row numbers in ORC files
Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).
This can be set by configuration only (no extra cli options).See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/113/ for more details. Comment Actions Build is green Patch application report for D7465 (id=27082)Could not rebase; Attempt merge onto fd3f9aa61d... Updating fd3f9aa..681b451 Fast-forward requirements-swh.txt | 1 + swh/dataset/exporters/orc.py | 115 +++++++++++++++++---- swh/dataset/journalprocessor.py | 36 ++++++- swh/dataset/relational.py | 59 ++++++----- swh/dataset/test/test_orc.py | 216 ++++++++++++++++++++++++++++++---------- 5 files changed, 327 insertions(+), 100 deletions(-) Changes applied before testcommit 681b45135f1d54e9fb62a609fb45e66e602b006b
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.
commit 58b9f92e1d4df026c2f82e8d139c36a78d79449f
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:58:55 2022 +0100
Improve the progress reporting look and feel
- use shortened values in the progress bar (eg. '11.3M/566M' instead of
something like '11310201/566123456')
- reduce the description strings and align them.
The result looks like:
Exporting release:
- Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
- Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
commit be3c5da2bb44c73fe614e93073727ffd50665a9c
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:52:26 2022 +0100
Delay the unsubscribe to the end of handle_messages
to prevent some possible race condition leading to kafka errors like:
rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.
commit e01daba4d601733a86ce7401fe54247908d03e5c
Author: David Douard <david.douard@sdfa3.org>
Date: Fri Mar 25 15:43:18 2022 +0100
Move exporter config entries in dedicated sections
eg. orc exporter specific exporter config entries are now under the
'orc' section, like:
journal:
brokers: [...]
orc:
remove_pull_requests: true
max_rows:
revision: 100000
directory: 10000
commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 16:35:52 2022 +0100
Add support for limited row numbers in ORC files
Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).
This can be set by configuration only (no extra cli options).See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/116/ for more details. Comment Actions Build is green Patch application report for D7465 (id=27132)Rebasing onto 76eba6595b... Current branch diff-target is up to date. Changes applied before testcommit 880f4b4af339a4188827befc09524f68ebf4276d
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/120/ for more details. Comment Actions LGTM, but we should eventually replace self.objstorage.get() with self.objstorage.get_batch(), it is much faster with the Azure backend (and hopefully more backends in the future, inc. Winery!) Comment Actions unfortunately, with the way the dataset exporter currently works, it's not easy to make batch queries; objects are handled one at a time (from the kafka journal). Comment Actions Build is green Patch application report for D7465 (id=27278)Rebasing onto f282e88a90... Current branch diff-target is up to date. Changes applied before testcommit 9d97f0c0826dc4be4505e79f7f1b5eed18a05dff
Author: David Douard <david.douard@sdfa3.org>
Date: Wed Mar 23 17:01:19 2022 +0100
Add support for blob in content export
this feature requires a config parameter `with_data=true` and an objstorage
configuration.See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/123/ for more details. |