Details

Reviewers

vlorentz

Group Reviewers

Reviewers

Commits

rDDATASETfd3f9aa61de3: Add the raw_manifest column for revision, release and directory ORC files

Summary

Depends on D7388.

Diff Detail

Repository

rDDATASET Datasets

Branch

orc

Lint

Lint Skipped

Unit

Unit Tests Skipped

Build Status

Buildable 27610
Build 43212: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 43211: arc lint + arc unit

Event Timeline

douardda created this revision.Mar 18 2022, 2:13 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 18 2022, 2:13 PM

Build is green

Patch application report for D7389 (id=26691)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..38ea24a
Fast-forward
 mypy.ini                        |  6 +++
 requirements-swh.txt            |  6 +--
 requirements.txt                |  3 +-
 swh/dataset/cli.py              | 10 ++++-
 swh/dataset/exporters/orc.py    | 96 +++++++++++++++++++++++++++++------------
 swh/dataset/journalprocessor.py | 32 +++++++++-----
 swh/dataset/relational.py       | 29 +++++++++----
 swh/dataset/test/test_orc.py    | 90 +++++++++++++++++++++-----------------
 8 files changed, 180 insertions(+), 92 deletions(-)

Changes applied before test

commit 38ea24afd21a3bc313cbc43ab9f434556ca291d6
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit b2a24fe5809b72e8f92c06629a5adb97814003a1
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 1b055088e2822e640b07f3c16f54184123e9ad5e
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

commit 1b6f52e5df07963fa058994fe4f39b7054b0a620
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:16:40 2022 +0100

    Use the same 'id' column name everywhere in ORC files
    
    namely rename 'snapshot_id' and 'directory_id' columns by 'id'.

commit 4c90956495f50c7fafc2d123378e5fe82d37b65b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:09:57 2022 +0100

    Write related ORC files in the same directory using the same UUID
    
    related ORC files being ORC files involved in the serialization of a given
    object type, namely:
    
    - snapshot and snapshot_branches,
    - revision and revision_history,
    - directory and directory_entry.
    
    Also include the object_type in the generated file name (in place of the
    static 'graph').
    
    So the result will typically be like:
    
      output/orc/shapshot/
        snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
        snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

commit 66db13eab592c399350c129bcc03c71ace7196a7
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit a05df6abe5cad5cf6b3e207fa9b317ebc0a1c32f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 986885abd6d3ab95df76f02746c016e00a53ba2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 048f273838cdba3c8dfbebfb9768fc675f205d74
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 8cae6adb5c63af22bc798ebe7072a33978f3037e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 8c2b5e951c1a1195c9ec3e700cb9da60711a96ab
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/39/ for more details.

Harbormaster completed remote builds in B27571: Diff 26691.Mar 18 2022, 2:15 PM

douardda requested review of this revision.Mar 18 2022, 2:15 PM

vlorentz added a subscriber: vlorentz.Mar 18 2022, 2:50 PM

vlorentz added inline comments.

swh/dataset/exporters/orc.py
212–215	base64 isn't supported?

rebase

Build is green

Patch application report for D7389 (id=26714)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..4681843
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 +++-
 swh/dataset/exporters/orc.py    |  99 +++++++++++++++++++++++----------
 swh/dataset/journalprocessor.py |  34 ++++++++----
 swh/dataset/relational.py       |  29 +++++++---
 swh/dataset/test/test_orc.py    | 118 ++++++++++++++++++++++++++--------------
 6 files changed, 205 insertions(+), 91 deletions(-)

Changes applied before test

commit 4681843947798e115f74700c65b4adbc0ae005ef
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit c1f8d3fc165d343fb789f568f44c62182d954707
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 9cb43f003c3ae9fadfd6df42d5f257e5c23a8029
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

commit 6ca992376a004aeaf0a3a7be9d7f99d553c6b013
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:16:40 2022 +0100

    Use the same 'id' column name everywhere in ORC files
    
    namely rename 'snapshot_id' and 'directory_id' columns by 'id'.

commit 9a96367ae0c0c664dd4415e80479107300dea1bf
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:09:57 2022 +0100

    Write related ORC files in the same directory using the same UUID
    
    related ORC files being ORC files involved in the serialization of a given
    object type, namely:
    
    - snapshot and snapshot_branches,
    - revision and revision_history,
    - directory and directory_entry.
    
    Also include the object_type in the generated file name (in place of the
    static 'graph').
    
    So the result will typically be like:
    
      output/orc/shapshot/
        snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
        snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

commit 0bab36224ff89b37c539fff4956b830be46eef81
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit eb135d55876346b6381e9c0a7965fe2827872322
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit f7fe626c2db82dc25b58d01e84c6f017867da7c9
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 4f14a95aaadabc1d2036a9c31c18e6a78befb44d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 316d51b6da36719bca767c78ad04402c609d5abe
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit ae440431049470ecac6aca0e8cbed4a51cde0c09
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/51/ for more details.

Harbormaster completed remote builds in B27610: Diff 26714.Mar 18 2022, 3:58 PM

douardda added inline comments.Mar 18 2022, 4:11 PM

swh/dataset/exporters/orc.py
212–215	base64 isn't supported? what do you mean?

vlorentz accepted this revision.Mar 21 2022, 10:21 AM

vlorentz added inline comments.

swh/dataset/exporters/orc.py
212–215	I don't know what I meant. I must have misread the code.

This revision is now accepted and ready to land.Mar 21 2022, 10:21 AM

rebase

Build has FAILED

Patch application report for D7389 (id=26782)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..03a7fa1
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 ++-
 swh/dataset/exporters/orc.py    | 142 ++++++++++++++++++++++++++++++++--------
 swh/dataset/journalprocessor.py |  34 ++++++----
 swh/dataset/relational.py       |  29 +++++---
 swh/dataset/test/test_orc.py    | 131 ++++++++++++++++++++++++------------
 6 files changed, 260 insertions(+), 92 deletions(-)

Changes applied before test

commit 03a7fa18754f3cbf8f576a1ec41eceb0a40a0f31
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 858554ba86027b0d3228c616ba97e7e79e6aea23
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit d73070f5dd42329a238412e9c3776c97fe466b3c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

commit 9d08fc0a8e4011925fdce7aec003df8de85a221a
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:16:40 2022 +0100

    Use the same 'id' column name everywhere in ORC files
    
    namely rename 'snapshot_id' and 'directory_id' columns by 'id'.

commit ef90997443aaff8a085a61c0cdde6a4f1f4f681e
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:09:57 2022 +0100

    Write related ORC files in the same directory using the same UUID
    
    related ORC files being ORC files involved in the serialization of a given
    object type, namely:
    
    - snapshot and snapshot_branches,
    - revision and revision_history,
    - directory and directory_entry.
    
    Also include the object_type in the generated file name (in place of the
    static 'graph').
    
    So the result will typically be like:
    
      output/orc/shapshot/
        snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
        snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

commit fc59bf8cff8722af17652f5c76961e9ef920d5fc
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit e120502b262b190364a65d6f1af9f23f2f19b8ed
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit c7cc72902575271d5f1dbef77f94faf255f86868
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit c508c673043458a419f2eeb0d5a2fb60b12aa007
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit d49db10f0bf7174ea4f2742d5d5ac8c8e25b707a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 69e806698bbb6df42bfa3520681e0203f91d8a65
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/62/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/62/console

Harbormaster failed remote builds in B27703: Diff 26782!Mar 22 2022, 3:51 PM

rebase

Build is green

Patch application report for D7389 (id=26804)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..34fe494
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 ++-
 swh/dataset/exporters/orc.py    | 144 ++++++++++++++++++++++++++++++++--------
 swh/dataset/journalprocessor.py |  34 +++++++---
 swh/dataset/relational.py       |  29 +++++---
 swh/dataset/test/test_orc.py    | 129 +++++++++++++++++++++++------------
 6 files changed, 261 insertions(+), 91 deletions(-)

Changes applied before test

commit 34fe49430c3dd316ee575f04fbc97735d4463165
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 809c6df5c5b0067c83232dbb540414d5edaef2c2
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit fea81cce879955218fdf02d1342b68073d428b3b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

commit 7ff473cba0113b8480e45421ba165024b5b27b34
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:16:40 2022 +0100

    Use the same 'id' column name everywhere in ORC files
    
    namely rename 'snapshot_id' and 'directory_id' columns by 'id'.

commit 4aa860deb508f6388c6e46877097435c944a5909
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:09:57 2022 +0100

    Write related ORC files in the same directory using the same UUID
    
    related ORC files being ORC files involved in the serialization of a given
    object type, namely:
    
    - snapshot and snapshot_branches,
    - revision and revision_history,
    - directory and directory_entry.
    
    Also include the object_type in the generated file name (in place of the
    static 'graph').
    
    So the result will typically be like:
    
      output/orc/shapshot/
        snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
        snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

commit 85d95ac3701a90f7ab7d12de5b7ff3af9bf8f45f
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit a92a271fec1c8f99ea8b47fed89d0c3e447de1dc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 7c775ce88ebb840d32aa70dbe6feb1a6835db518
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 55cf5ac5cc0cb818ae23b5df4af416e57794469c
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 70d9d3182de1420ba545f2f507ade8f59b2c2f33
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 09d2840dbd4db6e1a3dd976c44b3c628b9174741
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/74/ for more details.

Harbormaster completed remote builds in B27725: Diff 26804.Mar 22 2022, 5:35 PM

rebase

Build is green

Patch application report for D7389 (id=27002)

Could not rebase; Attempt merge onto f588e20a41...

Updating f588e20..4400a4d
Fast-forward
 requirements-swh.txt            |  2 +-
 swh/dataset/cli.py              | 10 ++++-
 swh/dataset/exporters/orc.py    | 63 ++++++++++++++++++++++++++----
 swh/dataset/journalprocessor.py | 34 ++++++++++------
 swh/dataset/relational.py       | 14 ++++++-
 swh/dataset/test/test_orc.py    | 86 ++++++++++++++++++++++-------------------
 6 files changed, 147 insertions(+), 62 deletions(-)

Changes applied before test

commit 4400a4d150458acfb29fe6ca2c3b615ccc80c2cc
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 24795192159a6c7afbec0a6cb8ca5177e9c97b04
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 8bf91113b3c3f0096c5be2298e467d3552c896d8
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

commit 959e85c741106dea4cfaeb378e2a26cc82bd9eca
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:16:40 2022 +0100

    Use the same 'id' column name everywhere in ORC files
    
    namely rename 'snapshot_id' and 'directory_id' columns by 'id'.

commit 225f87573d52cfb13a876899bb59ef004fd1af84
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:09:57 2022 +0100

    Write related ORC files in the same directory using the same UUID
    
    related ORC files being ORC files involved in the serialization of a given
    object type, namely:
    
    - snapshot and snapshot_branches,
    - revision and revision_history,
    - directory and directory_entry.
    
    Also include the object_type in the generated file name (in place of the
    static 'graph').
    
    So the result will typically be like:
    
      output/orc/shapshot/
        snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
        snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
        snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

commit 0a593e3478cc982b95e0ec6ac23a0ba52063ae73
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit f8211b934774ded5ea83948a02140af446788e9a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 4508da3a91ba634e3b5def318f4509a688168c9e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit de114c20f105c0b888eb92625f4e073f97a94ae8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit a8442bcf7c4311a28bea0898a01dc9475889efc7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/86/ for more details.

Harbormaster completed remote builds in B27918: Diff 27002.Mar 29 2022, 3:25 PM

rebase

Build is green

Patch application report for D7389 (id=27045)

Could not rebase; Attempt merge onto 5a8a8a7847...

Updating 5a8a8a7..fd3f9aa
Fast-forward
 swh/dataset/exporters/orc.py | 21 ++++++++++++++++++++-
 swh/dataset/relational.py    | 10 ++++++++++
 swh/dataset/test/test_orc.py |  6 +++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

Changes applied before test

commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 5c652bb058e2c1b59bafefd6817f392fdc171a20
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 45c8124b7a310963a868eb6602ea24e240d761e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/97/ for more details.

Harbormaster completed remote builds in B27960: Diff 27045.Mar 29 2022, 5:43 PM

douardda added a child revision: D7461: Add support for limited row numbers in ORC files.Mar 29 2022, 5:43 PM

Closed by commit rDDATASETfd3f9aa61de3: Add the raw_manifest column for revision, release and directory ORC files (authored by douardda). · Explain WhyMar 29 2022, 5:46 PM

This revision was automatically updated to reflect the committed changes.

douardda added a commit: rDDATASETfd3f9aa61de3: Add the raw_manifest column for revision, release and directory ORC files.

Add the raw_manifest column for revision, release and directory ORC files
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D7389 (id=26691)

Changes applied before test

Patch application report for D7389 (id=26714)

Changes applied before test

Patch application report for D7389 (id=26782)

Changes applied before test

Patch application report for D7389 (id=26804)

Changes applied before test

Patch application report for D7389 (id=27002)

Changes applied before test

Patch application report for D7389 (id=27045)

Changes applied before test

Revision Contents
Changeset List

Diff 26714

swh/dataset/exporters/orc.py

swh/dataset/relational.py

swh/dataset/test/test_orc.py

Add the raw_manifest column for revision, release and directory ORC filesClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D7389 (id=26691)

Changes applied before test

Patch application report for D7389 (id=26714)

Changes applied before test

Patch application report for D7389 (id=26782)

Changes applied before test

Patch application report for D7389 (id=26804)

Changes applied before test

Patch application report for D7389 (id=27002)

Changes applied before test

Patch application report for D7389 (id=27045)

Changes applied before test

Revision ContentsChangeset List

Diff 26714

swh/dataset/exporters/orc.py

swh/dataset/relational.py

swh/dataset/test/test_orc.py

Add the raw_manifest column for revision, release and directory ORC files
ClosedPublic
Actions

Revision Contents
Changeset List