Page MenuHomeSoftware Heritage

Add some user metadata in generated ORC files
ClosedPublic

Authored by douardda on Mar 18 2022, 2:11 PM.

Details

Summary

add:

  • object type
  • uuid
  • version of swh.model used at file generation time,
  • version of swh.dataset used at file generation time.

Depends on D7383.

Diff Detail

Event Timeline

Build is green

Patch application report for D7384 (id=26686)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..66db13e
Fast-forward
 mypy.ini                        |  6 +++
 requirements-swh.txt            |  6 +--
 requirements.txt                |  3 +-
 swh/dataset/cli.py              | 10 ++++-
 swh/dataset/exporters/orc.py    | 41 ++++++++++----------
 swh/dataset/journalprocessor.py | 32 ++++++++++------
 swh/dataset/relational.py       | 15 +++++---
 swh/dataset/test/test_orc.py    | 83 ++++++++++++++++++++++-------------------
 8 files changed, 115 insertions(+), 81 deletions(-)
Changes applied before test
commit 66db13eab592c399350c129bcc03c71ace7196a7
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit a05df6abe5cad5cf6b3e207fa9b317ebc0a1c32f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 986885abd6d3ab95df76f02746c016e00a53ba2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 048f273838cdba3c8dfbebfb9768fc675f205d74
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 8cae6adb5c63af22bc798ebe7072a33978f3037e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 8c2b5e951c1a1195c9ec3e700cb9da60711a96ab
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/34/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/dataset/exporters/orc.py
138–139

pkg_resources is provided by setuptools, which we already depend on.

This revision now requires changes to proceed.Mar 18 2022, 2:41 PM

use pkg_resources instead of pkginfo

douardda added inline comments.
swh/dataset/exporters/orc.py
138–139

pkg_resources is provided by setuptools, which we already depend on.

thx for the tip

Build is green

Patch application report for D7384 (id=26694)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..0ec0e7c
Fast-forward
 requirements-swh.txt            |  6 +--
 swh/dataset/cli.py              | 10 ++++-
 swh/dataset/exporters/orc.py    | 39 ++++++++++---------
 swh/dataset/journalprocessor.py | 32 ++++++++++------
 swh/dataset/relational.py       | 15 +++++---
 swh/dataset/test/test_orc.py    | 83 ++++++++++++++++++++++-------------------
 6 files changed, 105 insertions(+), 80 deletions(-)
Changes applied before test
commit 0ec0e7ccc6f7de2b6ac8731d178414345002215b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit a05df6abe5cad5cf6b3e207fa9b317ebc0a1c32f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 986885abd6d3ab95df76f02746c016e00a53ba2b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 048f273838cdba3c8dfbebfb9768fc675f205d74
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 8cae6adb5c63af22bc798ebe7072a33978f3037e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 8c2b5e951c1a1195c9ec3e700cb9da60711a96ab
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/40/ for more details.

This revision is now accepted and ready to land.Mar 18 2022, 3:00 PM
douardda marked an inline comment as done.

rebase

Build is green

Patch application report for D7384 (id=26707)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..0bab362
Fast-forward
 requirements-swh.txt            |   6 +--
 swh/dataset/cli.py              |  10 +++-
 swh/dataset/exporters/orc.py    |  44 ++++++++--------
 swh/dataset/journalprocessor.py |  34 ++++++++----
 swh/dataset/relational.py       |  15 +++---
 swh/dataset/test/test_orc.py    | 111 ++++++++++++++++++++++++++--------------
 6 files changed, 140 insertions(+), 80 deletions(-)
Changes applied before test
commit 0bab36224ff89b37c539fff4956b830be46eef81
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit eb135d55876346b6381e9c0a7965fe2827872322
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit f7fe626c2db82dc25b58d01e84c6f017867da7c9
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 4f14a95aaadabc1d2036a9c31c18e6a78befb44d
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 316d51b6da36719bca767c78ad04402c609d5abe
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit ae440431049470ecac6aca0e8cbed4a51cde0c09
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/46/ for more details.

Build has FAILED

Patch application report for D7384 (id=26777)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..fc59bf8
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 +++-
 swh/dataset/exporters/orc.py    |  87 ++++++++++++++++++++++------
 swh/dataset/journalprocessor.py |  34 +++++++----
 swh/dataset/relational.py       |  15 +++--
 swh/dataset/test/test_orc.py    | 124 ++++++++++++++++++++++++++--------------
 6 files changed, 195 insertions(+), 81 deletions(-)
Changes applied before test
commit fc59bf8cff8722af17652f5c76961e9ef920d5fc
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit e120502b262b190364a65d6f1af9f23f2f19b8ed
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit c7cc72902575271d5f1dbef77f94faf255f86868
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit c508c673043458a419f2eeb0d5a2fb60b12aa007
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit d49db10f0bf7174ea4f2742d5d5ac8c8e25b707a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 69e806698bbb6df42bfa3520681e0203f91d8a65
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/57/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/57/console

Build is green

Patch application report for D7384 (id=26799)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..85d95ac
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 +++-
 swh/dataset/exporters/orc.py    |  89 +++++++++++++++++++++++------
 swh/dataset/journalprocessor.py |  34 +++++++----
 swh/dataset/relational.py       |  15 +++--
 swh/dataset/test/test_orc.py    | 122 ++++++++++++++++++++++++++--------------
 6 files changed, 196 insertions(+), 80 deletions(-)
Changes applied before test
commit 85d95ac3701a90f7ab7d12de5b7ff3af9bf8f45f
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit a92a271fec1c8f99ea8b47fed89d0c3e447de1dc
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 7c775ce88ebb840d32aa70dbe6feb1a6835db518
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 55cf5ac5cc0cb818ae23b5df4af416e57794469c
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit 70d9d3182de1420ba545f2f507ade8f59b2c2f33
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit 09d2840dbd4db6e1a3dd976c44b3c628b9174741
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/69/ for more details.

Build is green

Patch application report for D7384 (id=26995)

Could not rebase; Attempt merge onto 68f9bd2028...

Updating 68f9bd2..0a593e3
Fast-forward
 requirements-swh.txt            |   6 +-
 swh/dataset/cli.py              |  10 +++-
 swh/dataset/exporters/orc.py    |  89 +++++++++++++++++++++++------
 swh/dataset/journalprocessor.py |  34 +++++++----
 swh/dataset/relational.py       |   3 +
 swh/dataset/test/test_orc.py    | 122 ++++++++++++++++++++++++++--------------
 6 files changed, 190 insertions(+), 74 deletions(-)
Changes applied before test
commit 0a593e3478cc982b95e0ec6ac23a0ba52063ae73
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit f8211b934774ded5ea83948a02140af446788e9a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 4508da3a91ba634e3b5def318f4509a688168c9e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit de114c20f105c0b888eb92625f4e073f97a94ae8
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit a8442bcf7c4311a28bea0898a01dc9475889efc7
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

commit f588e20a41af4b1b8042f9b5f0e88a1f1dc91e59
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/81/ for more details.

Build is green

Patch application report for D7384 (id=27032)

Could not rebase; Attempt merge onto 31081e4121...

Updating 31081e4..729ae64
Fast-forward
 requirements-swh.txt            |  2 +-
 swh/dataset/cli.py              | 10 +++++-
 swh/dataset/exporters/orc.py    |  8 +++++
 swh/dataset/journalprocessor.py | 34 ++++++++++++------
 swh/dataset/test/test_orc.py    | 79 +++++++++++++++++++++--------------------
 5 files changed, 82 insertions(+), 51 deletions(-)
Changes applied before test
commit 729ae64f36cd4f2d78bbdd0952a801b5cba5b462
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:00:41 2022 +0100

    Add some user metadata in generated ORC files
    
    add:
    - object type
    - uuid
    - version of swh.model used at file generation time,
    - version of swh.dataset used at file generation time.

commit 2298fb3422804688bd7bbcef3155cd0e9a80a00e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 16 10:46:37 2022 +0100

    Implement test_orc exporter as a simple function instead of a fixture
    
    and split it in 2 parts (needed for changes to come).

commit 68899901c7e596471cb1a9e769504919c6a19881
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:50:15 2021 +0100

    Make the kafka group_id prefix configurable in the config file
    
    rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
    value from the 'journal' section of the config file as prefix, if given
    9otherwise default to the former value).
    
    This is needed because current auth policy of swh kafka cluster only allow
    group_id to start with the actual login for authenticated connection.
    So we need to be able to specify this group_id prefix.

commit 769b6a77d250123ee25d8576bc1fe4a9340616f4
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:41:28 2021 +0100

    Use a named logger for journalprocessor.py
    
    and add a few more debug logging statements.

commit d7c332e4e7e1d5ee531a914b302f98c11503663e
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Dec 15 16:44:49 2021 +0100

    Update JournalClientOffsetRanges for swh.journal 0.9
    
    deserialize_message() now takes an optional 'object_type' argument.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/92/ for more details.