Page MenuHomeSoftware Heritage

Consider switching timestamp offset storage to strings/byte arrays
Closed, MigratedEdits Locked

Description

Our current TimestampWithTimezone data type, which has three fields:

  • a timestamp
  • a timezone offset in minutes
  • a boolean to support "negative utc" -0000 timezone offsets

is a recurrent cause of grief:

  • the latest example of such grief is the discussion around D3263.
  • Some (legacy) timezone offsets don't match full minutes and can't be stored
  • Some buggy data overflows the capacity of the current (smallint) field and is rejected artificially
  • Some objects we're importing don't have timezone information at all and force us to add some bogus data
  • Analysis of timezone-related data is more of a curiosity and wouldn't be much hampered by relaxing constraints on the field.

I propose that we turn the timezone offset and negative utc boolean fields into a unified, nullable, free-form bytestring.

The recommended format for the bytestring would be ascii [+-]HHMM, where HH and MM are 0-padded integers for the hours and minutes of the timezone offset. Other values would be supported (and their interpretation left to end users, allowing for SWHID-preserving imports of data from VCS such as git with lax validation).

This is fully backwards-compatible with the current SWHID computation, which turns the combination of boolean/int into a string with the given format for identifier computation. The computation of SWHIDs would be modified so that null values of the field just trim the space after the timestamp at the end of the "authorship" line. Objects generated from a single timestamp with no timezone data would be stored as such.

Event Timeline

olasd triaged this task as Low priority.Jun 12 2020, 1:00 PM
olasd created this task.

(ping @zack who has done some actual analysis on the timezone-related data in the archive)

Yeah, for having played with it quite a bit in recent times, the current state of timestamp offsets isn't great. I'm fine with the idea of switching them to bytestrings as proposed.

As part of this, we should revamp the doc around how to interpret our time[stamp] values, because it is really hard to grok for outsiders. I've revamped some table column comments recently, but it's nowhere near enough. Happy to review/give a hand on that.