Page MenuHomeSoftware Heritage

Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
ClosedPublic

Authored by douardda on Mar 18 2022, 2:09 PM.

Details

Summary

instead of using the ORC Timestamp format, since we cannot always encode
them in this format.

The offset is encoded as binary (byte string), following recent evolutions
of swh-model.

This makes swh-dataset compatible with swh-model 5.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7379 (id=26681)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit 8c2b5e951c1a1195c9ec3e700cb9da60711a96ab
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/29/ for more details.

vlorentz added inline comments.
requirements-swh.txt
4

FWIW, swh.model >= 4.3.0 should work

swh/dataset/exporters/orc.py
57–62

I assume the lack of swh.model objects below is for performance reasons? If so, you can also reimplement this function by inlining https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/model.py;bc3831fddcd309789a3f5e81eba37d4f28219201$407-412 in case offset_bytes is missing from the dict, so you don't have to build two temporary timestamp objects.

swh/dataset/test/test_orc.py
114

just in case we ever make the misguided decision of making TimestampWithTimezone falsy on 0 dates.

117

ditto

requirements-swh.txt
4

yeah I was not sure (and lazy to double check) but (v5 definitely breaks swh.dataset :-) )

swh/dataset/exporters/orc.py
57–62

I assume the lack of swh.model objects below is for performance reasons? If so, you can also reimplement this function by inlining https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/model.py;bc3831fddcd309789a3f5e81eba37d4f28219201$407-412 in case offset_bytes is missing from the dict, so you don't have to build two temporary timestamp objects.

In my understanding, this code consumes directly dicts from the kafka journal (without instanciating model objects), probably for performance reasons yes.

swh/dataset/test/test_orc.py
114

just in case we ever make the misguided decision of making TimestampWithTimezone falsy on 0 dates.

that would be bad...

Do not use TimestampWithTimezone object in swh_date_to_tuple()

as suggested by vlorentz

Build is green

Patch application report for D7379 (id=26702)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit ae440431049470ecac6aca0e8cbed4a51cde0c09
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (sec, usec, offset) in ORC file
    
    instead of using the ORC Timestamp format, since we cannot always encode
    them in this format.
    
    The offset is encoded as binary (byte string), following recent evolutions
    of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/41/ for more details.

Update the encoding of TimestampWithTimezone to keep the timestamp part as an ORC timestamp

Build has FAILED

Patch application report for D7379 (id=26772)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit 69e806698bbb6df42bfa3520681e0203f91d8a65
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/52/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/52/console

Build has FAILED

Patch application report for D7379 (id=26787)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit 103b16546273e068131e0fb61644a90bdafd652b
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/63/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/63/console

attempt to please mypy

the ugly way...

Build is green

Patch application report for D7379 (id=26794)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit 09d2840dbd4db6e1a3dd976c44b3c628b9174741
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/64/ for more details.

add a couple of missing 'is not None' as asked by volrentz

Build is green

Patch application report for D7379 (id=26809)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit ac1ec1667d27ac6893a0d81e09c293f3cd04a2ee
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/75/ for more details.

This revision is now accepted and ready to land.Mar 24 2022, 1:25 PM

Rename date fields back to their previous names and use smallint for offsets

Build is green

Patch application report for D7379 (id=26990)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit f588e20a41af4b1b8042f9b5f0e88a1f1dc91e59
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 11:46:37 2022 +0100

    Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file
    
    ie. use the standard ORC Timestamp format (aka a couple
    (seconds, nanoseconds)) with 2 extra fields for the offset.
    
    The offset is stored as an integer (in minutes), but the raw offset
    value is also present as a binary string representation, following
    recent evolutions of swh-model.
    
    This makes swh-dataset compatible with swh-model 5.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/76/ for more details.