Page MenuHomeSoftware Heritage

converters: Work on model objects instead of dicts on the "not-db" side.
ClosedPublic

Authored by vlorentz on Aug 12 2020, 10:56 AM.

Details

Summary

This is a change internal to the pg storage, that will be needed to make
revision_get, revision_log, and release_get return model objects.

Diff Detail

Repository
rDSTO Storage manager
Branch
converters-typed
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14507
Build 22313: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 22312: arc lint + arc unit

Unit TestsFailed

TimeTest
20,284 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_backfill::test_backfiller
swh_storage_backend_config = {'cls': 'local', 'db': 'postgresql://postgres@127.0.0.1:12458/tests', 'journal_writer': {'brokers': ['127.0.0.1:49109'], 'client_id': 'kafka_writer-1', 'cls': 'kafka', 'prefix': 'qtkovorxux-1'}, 'objstorage': {'args': {}, 'cls': 'memory'}} kafka_prefix = 'qtkovorxux', kafka_consumer_group = 'test-consumer-qtkovorxux' kafka_server = '127.0.0.1:49109'
1 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_converters::test_date_to_db
def test_date_to_db(): date_to_db = converters.date_to_db assert date_to_db(None) == {"timestamp": None, "offset": 0, "neg_utc_offset": None}
1 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_converters::test_db_to_author
def test_db_to_author(): # when actual_author = converters.db_to_author(b"fullname", b"name", b"email")
1 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_converters::test_db_to_release
def test_db_to_release(): # when actual_release = converters.db_to_release(
1 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_converters::test_db_to_revision
ValueError: 'rev' is not a valid RevisionType During handling of the above exception, another exception occurred:
View Full Test Results (5 Failed · 710 Passed · 9 Skipped)

Event Timeline

Build has FAILED

Patch application report for D3763 (id=13243)

Rebasing onto 6675286c4f...

Current branch diff-target is up to date.
Changes applied before test
commit f073aaf2d31bb12c06e9ebceaf6adc509dbac9ab
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Aug 12 10:55:06 2020 +0200

    converters: Work on model objects instead of dicts on the "not-db" side.
    
    This is a change internal to the pg storage, that will be needed to make
    revision_get, revision_log, and release_get return model objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/734/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/734/console

ardumont added inline comments.
swh/storage/converters.py
35

Make that a Person object.

41

Make that a TimestampWithTimezone...
(and rework the function below to only manipulate our model objects, it's more consistent).

What do you think?

57

Drop the indirection now or make it consistent so that it returns a Person object?

I don't understand. _to_db() functions should return dicts, not model objects, it's the whole point.

And DEFAULT_AUTHOR and DEFAULT_DATE are only used as return values of these functions.

swh/storage/converters.py
176

here you are manipulating both dict and model objects...
That now becomes inconsistent and annoying... (well for me ;)

I don't understand. _to_db() functions should return dicts, not model objects, it's the whole point.

It's not the whole point.

All this stuff was when we had only dicts to manipulate because dicts, at that time, represented our data model...

Now that we are using model objects for this representation (dict no more), my take is to manipulate those...
Our model objects should be used instead.

Following this reasoning, in the long run, i foresee the the converters module(s) should subside...
And the backend should simply know how to manipulate model objets to serialize into the backend... (and read from it).

And DEFAULT_AUTHOR and DEFAULT_DATE are only used as return values of these functions.

yeah, i know, see me previous point...


I'm merely suggesting to go that way as i found it makes more sense.
But that could be done at another time if at all ;)

Cheers,

It still doesn't make sense. What you're saying is that db.py should take model objects instead of dictionaries as arguments.

But if we do that, we still need to convert between model objects and postgresql-compatible types somewhere; and you're just moving converters calls from storage.py to db.py, but converters.py still needs to convert between model objects and database types. And database types do not fit in model objects.

It still doesn't make sense. What you're saying is that db.py should take model objects instead of dictionaries as arguments.

isn't it kinda what's done already with cassandra/cql?

You just added types very/extremely similar to the swh.model but slightly differentl (for some reason which escape me right now) within cql.
Granted for 'read' as far as i remember. nethertheless, it could be consistently done both ways.

...
but meh, like i said "if at all".
...

This revision is now accepted and ready to land.Aug 13 2020, 12:07 AM

It still doesn't make sense. What you're saying is that db.py should take model objects instead of dictionaries as arguments.

isn't it kinda what's done already with cassandra/cql?

You just added types very/extremely similar to the swh.model but slightly differentl (for some reason which escape me right now) within cql.

Yes, that could be done; but not with the model objects themselves. For Cassandra the differences are:

  1. no enum
  2. serialized JSON
  3. directory entries / snapshot branches / revision parents in the main objects (because separate tables)

for postgres, point 1 would apply and possibly point 3. Additionally for postgres:

  1. no TimestampWithTimezone object, we'll have to flatten them
  2. persons will have to be joined too

Build has FAILED

Patch application report for D3763 (id=13383)

Could not rebase; Attempt merge onto 3a713dd74a...

Updating 3a713dd7..1b033e4b
Fast-forward
 swh/storage/converters.py            | 201 +++++++++++++++++------------------
 swh/storage/storage.py               |  18 ++--
 swh/storage/tests/test_converters.py | 112 ++++++++++---------
 3 files changed, 169 insertions(+), 162 deletions(-)
Changes applied before test
commit 1b033e4b118df8ae5d92592474ad0e626daab6f1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Aug 12 10:55:06 2020 +0200

    converters: Work on model objects instead of dicts on the "not-db" side.
    
    This is a change internal to the pg storage, that will be needed to make
    revision_get, revision_log, and release_get return model objects.

commit 89656c9b1e6885ee158956d4138e98eb7c52e896
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 17 09:45:09 2020 +0200

    test_converters: Fix test data to match actual values.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/815/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/815/console

convert back in backfill.py

Build is green

Patch application report for D3763 (id=13384)

Rebasing onto 3a713dd74a...

Current branch diff-target is up to date.
Changes applied before test
commit 038a219f84d6b8a4f02b48f9ad3c5d823d097790
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Aug 12 10:55:06 2020 +0200

    converters: Work on model objects instead of dicts on the "not-db" side.
    
    This is a change internal to the pg storage, that will be needed to make
    revision_get, revision_log, and release_get return model objects.

commit 89656c9b1e6885ee158956d4138e98eb7c52e896
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 17 09:45:09 2020 +0200

    test_converters: Fix test data to match actual values.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/816/ for more details.