Page MenuHomeSoftware Heritage

document DB encoding requirements
Closed, MigratedEdits Locked

Description

In a fresh created SWH DB, with SQL_ASCII encoding and C ctype/collate, Git loading failed for me at the first revision ingestion like this:

2018-01-06 19:19:35,719 9439 Sending 100000 revisions
Exception in thread Thread-2417:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/db.py", line 185, in writer
    tblname, ', '.join(columns)), f)
psycopg2.DataError: unsupported Unicode escape sequence
DETAIL:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
CONTEXT:  JSON data, line 1: {"extra_headers": [["mergetag",...
COPY tmp_revision, line 540, column metadata: "{"extra_headers": [["mergetag", "object 7333b5aca412d6ad02667b5a513485838a91b136\ntype commit\ntag p..."


2018-01-06 19:19:40,757 9439 Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 896, in load
    self.store_data()
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1001, in store_data
    self.send_batch_revisions(self.get_revisions())
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 681, in send_batch_revisions
    send_in_packets(revisions, self.send_revisions, packet_size)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 42, in send_in_packets
    sender(formatted_objects)
  File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 450, in send_revisions
    self.storage.revision_add(revision_list)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/storage.py", line 550, in revision_add
    db.revision_add_from_temp(cur)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/db.py", line 38, in _meth
    self._cursor(cur).execute('SELECT %s()' % stored_proc)
psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block

2018-01-06 19:19:40,766 9439 Updating origin_visit for origin 1 with status partial
2018-01-06 19:19:40,768 9439 Done updating origin_visit for origin 1 with status partial
{'status': 'failed'}

For comparison, the in-production DB has encoding UTF8 and C.UTF8 ctype/collate.

Do we actually require an UTF8 encoded-DB or, at least, a non-ASCII one?

If so, I'd like to updated sql/bin/db-init accordingly and document this requirement.

Event Timeline

To use the full features of jsonb, we indeed need the database encoding to be UTF8.

References:

db-init has been updated a while ago to force UTF8 encoding, and it is now the documented way to initialize the DB, so there is no need to further specify DB encoding requirements elsewhere