Page MenuHomeSoftware Heritage

Handle malformed author and committer dates
Open, NormalPublic

Description

All errrors reported by the git loader of type psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block [1] correspond to the processing of malformed dates.

This is usually due to a revision whose author or commit date is located far in the future, see for instance:

This results in an invalid computed timezone offset whose value overflows the smallint postgres type,
resulting in the following exception being thrown in swh-storage:

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/antoine/swh/swh-environment/swh-storage/swh/storage/db.py", line 201, in writer
    tblname, ', '.join(columns)), f)
psycopg2.DataError: ERREUR:  la valeur « 24193125 » est en dehors des limites pour le type smallint
CONTEXT:  COPY tmp_revision, ligne 19448, colonne date_offset : « 24193125 »

We should handle these corner cases. The simplest solution would be to check if the computed timezone offset lies in the adequate bounds [UTC−14:00, UTC+14:00]
and set it to 0 if not. This could be handled directly in swh-storage [2] in case other loaders encounter a similar issue.

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/22195930-d36e-11e8-913b-077937c6a5ef?_g=(refreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-60d%2Cmode%3Aquick%2Cto%3Anow))

[2] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/converters.py$125

Event Timeline

anlambert triaged this task as Normal priority.
zack added a subscriber: zack.Nov 14 2018, 12:03 PM

The simplest solution would be to check if the computed timezone offset lies in the adequate bounds [UTC−14:00, UTC+14:00] and set it to 0 if not.

Unless I'm missing something, if we do that we would lose information wrt the repos to archive and hence also the ability to check the integrity of persistent IDs wrt the archived content.

We should rather extend/generalize the underlying SQL-based implementation to make sure we can represent this, well, shitty data that exist in the world :-)

Indeed, you're right the timezone offset is used to compute a revision identifier so even if its value is incorrect it should be stored anyway.

Maybe using a dedicated table to store that bogus timezone values is the simplest solution here.
In that case, the date_offset columns in the revision table could be set to None to indicate that
the values should be fetched in the bogus timezone values table instead.

anlambert updated the task description. (Show Details)Nov 14 2018, 3:39 PM