Page MenuHomeSoftware Heritage

backend: Make origin_visit_stats_upsert a batch api
ClosedPublic

Authored by ardumont on Jan 15 2021, 8:52 AM.

Details

Summary

Depends on D4866

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18391
Build 28418: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28417: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D4868 (id=17242)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..254b5af
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  74 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  89 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 349 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      |  93 +++++++-
 8 files changed, 593 insertions(+), 36 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 254b5af2a44e96ac65eb939dfd5dece9f5d54f20
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api

commit 7f4d22a42213378b1e5e85e921c61bfa62be2868
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/126/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 15 2021, 8:54 AM
Harbormaster failed remote builds in B18384: Diff 17242!

you can filter out duplicates in the python code

Work python side on the same origin_visit_stats so upsert is happy

Build is green

Patch application report for D4868 (id=17246)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..737432a
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  74 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  95 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 349 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      |  93 +++++++-
 8 files changed, 599 insertions(+), 36 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 737432a58b3a09b01c2039beb5b8140e6dbc4618
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit 7f4d22a42213378b1e5e85e921c61bfa62be2868
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/127/ for more details.

vlorentz added inline comments.
swh/scheduler/backend.py
807–813

TIL execute_values, this is great

swh/scheduler/journal_client.py
60–94

similar comment as on the other diff; could you use attr.evolve directly on the object, instead of converting to dict then back to the object?

This revision now requires changes to proceed.Jan 15 2021, 11:30 AM

Thanks for handling this! I hope this will make the simulator happier!

I don't see any tests directly calling the upsert function with a list of visit_stats. I guess that's handled through the journal client tests? The straight port of the tests that used the "unary" form of the function is definitely useful, but it might make sense to have some small tests calling that function with multiple arguments directly, ideally with data that would trigger the cardinality violation.

Thanks for handling this! I hope this will make the simulator happier!

Yes, that's what we are aiming at ;)

I don't see any tests directly calling the upsert function with a list of visit_stats. I guess that's handled through the journal client tests? The straight port of the tests that used the "unary" form of the function is definitely useful, but it might make sense to have some small tests calling that function with multiple arguments directly, ideally with data that would trigger the cardinality violation.

Indeed, i just upgraded the existing ones.
Thanks for pointing it out.

I foresee 2 new tests now.
One passing to deal with multiple inserts in one call.
Another failing with the cardinality thing.

Thanks again.

swh/scheduler/journal_client.py
60–94

Those dicts are intermediary objects.

It's not clear whether adding creation of intermediary objects (attr.evolve) is beneficial here instead of simply mutating dicts.

ardumont added inline comments.
swh/scheduler/journal_client.py
60–94

jsyk @douardda is trying this out

His basic tryouts shows that the time order is microsecond par dict mutation
vs "ms" for object "evolution" so it's more costly.

Add the 2 missing tests discussed earlier.

failing test on the cardinality exception is a PITA, i'm currently stuck on and i'm forced to do ugly stuff [1]

E           swh.core.api.RemoteException: <RemoteException 500 CardinalityViolation: ['ON CONFLICT DO UPDATE command cannot affect row a second time\nHINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.\n']>

[1] D4868#inline-33256

ardumont added inline comments.
swh/scheduler/api/client.py
41

ugly stuff.

Build is green

Patch application report for D4868 (id=17249)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..affad0c
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/api/client.py                |  26 ++-
 swh/scheduler/backend.py                   |  74 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  95 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 349 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 157 ++++++++++++-
 9 files changed, 685 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit affad0c73e7bba77aaee65687cda235b9d9dc911
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit 7f4d22a42213378b1e5e85e921c61bfa62be2868
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/128/ for more details.

swh/scheduler/api/client.py
37
ardumont edited the test plan for this revision. (Show Details)

Rebase

Build is green

Patch application report for D4868 (id=17255)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..2c2b0d1
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/api/client.py                |  26 ++-
 swh/scheduler/backend.py                   |  73 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 157 ++++++++++++-
 9 files changed, 694 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 2c2b0d1478d2d9a1b32da86b80af093e43081aad
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/130/ for more details.

Build is green

Patch application report for D4868 (id=17256)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..ae5d743
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/api/client.py                |  24 +-
 swh/scheduler/backend.py                   |  73 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 157 ++++++++++++-
 9 files changed, 692 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit ae5d7434eee201df689353c3bf6fe1dd59e68ee7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/131/ for more details.

Drop unwanted indentation change

Build is green

Patch application report for D4868 (id=17257)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..2ab2aa9
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/api/client.py                |  24 +-
 swh/scheduler/backend.py                   |  73 +++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 155 ++++++++++++-
 9 files changed, 691 insertions(+), 39 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 2ab2aa95674d624791f42fc6e12b8c30ce52282b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/132/ for more details.

vlorentz added inline comments.
swh/scheduler/api/client.py
23–42

It's not enough because this code is missing from swh/scheduler/api/server.py:

@app.errorhandler(CardinalityViolation)                          
def argument_error_handler(exception):                           
    return error_handler(exception, encode_data, status_code=400)

However, CardinalityViolation is an SQL-specific error, so it should not leak through the API.

You should make the backend raise a subclass of SchedulerException instead.

This revision now requires changes to proceed.Jan 15 2021, 1:15 PM
swh/scheduler/api/client.py
23–42

to be more specific, reraise_exceptions is only read if it's a 4xx HTTP status; but without the above code, the RPC server returns a 5xx error because it's an unexpected exception from its point of view.

swh/scheduler/api/client.py
23–42

Thanks for the heads up for the first part i was missing.

Your last suggestion sounds better to my ears indeed.

I'll adapt.

Thanks

adapt according to review

Raise SchedulerException instead of leaking a postgres exception.

This voids the needs to tamper with api.client code.

Build is green

Patch application report for D4868 (id=17258)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..83b3bca
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  85 ++++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 156 ++++++++++++-
 8 files changed, 679 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 83b3bcacfa65204117fbc9b877e7d7d52cd86644
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/133/ for more details.

swh/scheduler/backend.py
825

;)

Use simpler repr call

swh/scheduler/backend.py
825

yes, i thought of it.
I waited for some feedback to decide which to use ¯\_(ツ)_/¯.

I guess the thought process must be, if a standard known already exists, then go with it ;)

Thanks.

Build is green

Patch application report for D4868 (id=17262)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..5a1600e
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  85 ++++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 156 ++++++++++++-
 8 files changed, 679 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 5a1600e22032082765b9624c647c96a268f5c14e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/134/ for more details.

Build is green

Patch application report for D4868 (id=17270)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..5a1600e
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  85 ++++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 156 ++++++++++++-
 8 files changed, 679 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit 5a1600e22032082765b9624c647c96a268f5c14e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/136/ for more details.

Build is green

Patch application report for D4868 (id=17271)

Could not rebase; Attempt merge onto ca45d40f2a...

Updating ca45d40..a5fb291
Fast-forward
 sql/updates/19.sql                         |   4 +
 swh/scheduler/backend.py                   |  85 ++++---
 swh/scheduler/interface.py                 |   4 +-
 swh/scheduler/journal_client.py            |  98 ++++++++
 swh/scheduler/model.py                     |  10 +
 swh/scheduler/sql/30-schema.sql            |   6 +-
 swh/scheduler/tests/test_journal_client.py | 356 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py      | 155 ++++++++++++-
 8 files changed, 678 insertions(+), 40 deletions(-)
 create mode 100644 swh/scheduler/journal_client.py
 create mode 100644 swh/scheduler/tests/test_journal_client.py
Changes applied before test
commit a5fb291703d2f5de52450c326eebe1df07639714
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jan 14 18:38:05 2021 +0100

    backend: Make origin_visit_stats_upsert a batch api
    
    Related to T2967

commit 608aa2075619bbc9f5c7dab56d1c37a560f13e64
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/137/ for more details.

This revision is now accepted and ready to land.Jan 15 2021, 3:17 PM