Page MenuHomeSoftware Heritage

Implement basic storage of lister information
ClosedPublic

Authored by olasd on Jun 12 2020, 11:22 AM.

Details

Summary

This adds a pair a functions to the backend:

  • get_or_create_lister pulls the record for a given lister from the database
  • update_lister updates the record for a given lister in the database

This is one of the basic building blocks for the integration of lister
information directly in the scheduler database.

Related to T2442.

Test Plan

new tox tests for both new functions

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
feature/origin-scheduler-overhaul
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 12856
Build 19571: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 19570: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D3271 (id=11595)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..38b41f6
Fast-forward
 requirements.txt                       |  2 ++
 swh/scheduler/api/client.py            |  4 +++
 swh/scheduler/api/serializers.py       | 28 +++++++++++++++++
 swh/scheduler/api/server.py            |  4 ++-
 swh/scheduler/backend.py               | 29 ++++++++++++++++++
 swh/scheduler/interface.py             |  9 ++++++
 swh/scheduler/model.py                 | 55 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 14 +++++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          | 18 +++++++++++
 swh/scheduler/tests/test_api_client.py |  1 +
 swh/scheduler/tests/test_scheduler.py  | 17 ++++++++++-
 13 files changed, 183 insertions(+), 2 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
Changes applied before test
commit 38b41f6d39c6171c2679270830245aead487e49f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:11:46 2020 +0200

    Implement basic storage of lister information
    
    This adds a `get_or_create_lister` function, which stores basic information
    about the state of lister instances in the scheduler database.
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database (ref. T2442).

commit 2eb7a2e07197faa5423b47d7f08111e553db0c71
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/19/ for more details.

Why are the default values of current_state and updated set to None in the model, but {} and now() in SQL?

And do we need default values in SQL, anyway?

ardumont added a subscriber: ardumont.

Looks good to me.

I got a couple of questions.

Don't we want to add a migration script as well?

swh/scheduler/tests/test_scheduler.py
630

why not directly

assert lister == lister_get_again

?

This revision is now accepted and ready to land.Jun 12 2020, 11:46 AM

Improve on top of updated D3270 and apply some comments

Improve on top of updated D3270 and apply some comments

Build is green

Patch application report for D3271 (id=11605)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..12afd85
Fast-forward
 requirements.txt                       |  2 +
 swh/scheduler/api/client.py            |  4 ++
 swh/scheduler/api/serializers.py       | 28 ++++++++++
 swh/scheduler/api/server.py            |  4 +-
 swh/scheduler/backend.py               | 33 ++++++++++++
 swh/scheduler/interface.py             |  9 ++++
 swh/scheduler/model.py                 | 95 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 16 ++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          | 18 +++++++
 swh/scheduler/tests/test_api_client.py |  1 +
 swh/scheduler/tests/test_model.py      | 87 +++++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py  | 17 +++++-
 14 files changed, 316 insertions(+), 2 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
 create mode 100644 swh/scheduler/tests/test_model.py
Changes applied before test
commit 12afd85a3b0202819462e61be1176c5fcfa45e24
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 12:24:20 2020 +0200

    Implement basic storage of lister information
    
    This adds a `get_or_create_lister` function, which stores basic information
    about the state of lister instances in the scheduler database.
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database (ref. T2442).

commit c6054ae5f029f1c5c73120bb7252e7968f81b9fc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.
    
    This also introduces some mild ORM-like logic so we can keep the actual SQL a
    little DRYer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/21/ for more details.

Build is green

Patch application report for D3271 (id=11606)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..027d482
Fast-forward
 requirements.txt                       |  2 +
 swh/scheduler/api/client.py            |  4 ++
 swh/scheduler/api/serializers.py       | 28 ++++++++++
 swh/scheduler/api/server.py            |  4 +-
 swh/scheduler/backend.py               | 33 ++++++++++++
 swh/scheduler/interface.py             |  9 ++++
 swh/scheduler/model.py                 | 95 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 16 ++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          | 18 +++++++
 swh/scheduler/tests/test_api_client.py |  1 +
 swh/scheduler/tests/test_model.py      | 87 +++++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py  | 17 +++++-
 14 files changed, 316 insertions(+), 2 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
 create mode 100644 swh/scheduler/tests/test_model.py
Changes applied before test
commit 027d48231dde5172445744eabd960f2ba02c6cf6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 12:24:20 2020 +0200

    Implement basic storage of lister information
    
    This adds a `get_or_create_lister` function, which stores basic information
    about the state of lister instances in the scheduler database.
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database (ref. T2442).

commit c6054ae5f029f1c5c73120bb7252e7968f81b9fc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.
    
    This also introduces some mild ORM-like logic so we can keep the actual SQL a
    little DRYer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/22/ for more details.

use plain equality between attr.s objects

olasd requested review of this revision.Jun 12 2020, 1:12 PM
olasd marked an inline comment as done.

Build is green

Patch application report for D3271 (id=11607)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..9985b0c
Fast-forward
 requirements.txt                       |  2 +
 swh/scheduler/api/client.py            |  4 ++
 swh/scheduler/api/serializers.py       | 28 ++++++++++
 swh/scheduler/api/server.py            |  4 +-
 swh/scheduler/backend.py               | 33 ++++++++++++
 swh/scheduler/interface.py             |  9 ++++
 swh/scheduler/model.py                 | 95 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 16 ++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          | 18 +++++++
 swh/scheduler/tests/test_api_client.py |  1 +
 swh/scheduler/tests/test_model.py      | 87 +++++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py  | 17 +++++-
 14 files changed, 316 insertions(+), 2 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
 create mode 100644 swh/scheduler/tests/test_model.py
Changes applied before test
commit 9985b0c332fb70edf6781425e721d5f431c4880f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 12:24:20 2020 +0200

    Implement basic storage of lister information
    
    This adds a `get_or_create_lister` function, which stores basic information
    about the state of lister instances in the scheduler database.
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database (ref. T2442).

commit c6054ae5f029f1c5c73120bb7252e7968f81b9fc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.
    
    This also introduces some mild ORM-like logic so we can keep the actual SQL a
    little DRYer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/23/ for more details.

douardda added inline comments.
swh/scheduler/backend.py
137

why a single 'get_or_create' function here?

At first sight, this looks a weird API design. I understand we need to cover a variety of cases (especially github-style v.s. gitlab-style, i.e. singleton vs. multiple instance), but it looks weird to me to handle this like that.

Do we expect to have several gitlab listers with the same "human readable identifier" (namely name + arguments["instance"])?

swh/scheduler/backend.py
137

The intent is to let listers call this function once on every run so they can retrieve their UUID and "stored state", from their name and relevant arguments.

In the case of github / gitlab, the "stored state" would be, for instance, the numeric id of the latest repository lister (so an incremental lister could pick it up).

Having two separate columns for "lister name" and "lister arguments" felt cleaner, as it makes it easier to, e.g., count the instances of a given lister "family".

The long(er) term goal is to get rid of the lister-specific databases, at least for listers that only send simple loading tasks to the scheduler.

What alternative API design would you be proposing?

I'm not sure I understand the gitlab lister question, but no, we don't expect several instances of listers with the same (name, arguments) tuple.

swh/scheduler/backend.py
137

(basically, the API would have one "get or create" called at the start of listing, and one "update" called at the end of listing)

Don't we want to add a migration script as well?

I'll add migration scripts once I have a full set of schema changes I'll want to use in prod; as I don't intend to touch existing tables, only add a set of new ones, this shouldn't be an issue.

Add update of lister information to the commit; rebase on top of D3283.

Build is green

Patch application report for D3271 (id=11644)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..0e5bd23
Fast-forward
 requirements.txt                       |  2 +
 swh/scheduler/api/client.py            |  7 +++
 swh/scheduler/api/serializers.py       | 28 ++++++++++
 swh/scheduler/api/server.py            | 10 +++-
 swh/scheduler/backend.py               | 63 +++++++++++++++++++++-
 swh/scheduler/exc.py                   | 17 ++++++
 swh/scheduler/interface.py             | 17 ++++++
 swh/scheduler/model.py                 | 95 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 16 ++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          | 18 +++++++
 swh/scheduler/tests/test_api_client.py |  2 +
 swh/scheduler/tests/test_model.py      | 77 +++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py  | 38 +++++++++++++-
 15 files changed, 391 insertions(+), 3 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/exc.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
 create mode 100644 swh/scheduler/tests/test_model.py
Changes applied before test
commit 0e5bd23269a4c898f111c8de16e09304deb0f2a6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 12:24:20 2020 +0200

    Implement basic storage and retrieval of lister information
    
    This adds a pair a functions to the backend:
     - `get_or_create_lister` pulls the record for a given lister from the database
     - `update_lister` updates the record for a given lister in the database
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database.
    
    Related to T2442.

commit 466ac5912e347b9b2da4254849b2d3192819fd2a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 15 14:46:28 2020 +0200

    Introduce a SchedulerException base class
    
    This allows us to automatically serialize/deserialize exceptions under this base
    class within our RPC framework.

commit c509a12df5f1c25be3e492c693be11eae382862a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.
    
    This also introduces some mild ORM-like logic so we can keep the actual SQL a
    little DRYer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/26/ for more details.

Update after conversation with @douardda.

We only pass a distinguishing "instance_name" string to get_or_create_lister,
instead of fully free-form arguments; This avoids a potential conflation between
listing task arguments (which could change even if the instance hasn't changed,
e.g. if the API has moved), and identification of different instances of the
same lister.

Build is green

Patch application report for D3271 (id=11648)

Could not rebase; Attempt merge onto 4c0c37bab0...

Updating 4c0c37b..1c93e55
Fast-forward
 requirements.txt                       |  2 +
 swh/scheduler/api/client.py            |  7 +++
 swh/scheduler/api/serializers.py       | 28 +++++++++++
 swh/scheduler/api/server.py            | 10 +++-
 swh/scheduler/backend.py               | 69 +++++++++++++++++++++++++-
 swh/scheduler/exc.py                   | 17 +++++++
 swh/scheduler/interface.py             | 24 +++++++++
 swh/scheduler/model.py                 | 90 ++++++++++++++++++++++++++++++++++
 swh/scheduler/sql/10-swh-init.sql      |  1 +
 swh/scheduler/sql/30-swh-schema.sql    | 16 ++++++
 swh/scheduler/sql/60-swh-indexes.sql   |  3 ++
 swh/scheduler/tests/common.py          |  9 ++++
 swh/scheduler/tests/test_api_client.py |  2 +
 swh/scheduler/tests/test_model.py      | 77 +++++++++++++++++++++++++++++
 swh/scheduler/tests/test_scheduler.py  | 39 ++++++++++++++-
 15 files changed, 391 insertions(+), 3 deletions(-)
 create mode 100644 swh/scheduler/api/serializers.py
 create mode 100644 swh/scheduler/exc.py
 create mode 100644 swh/scheduler/model.py
 create mode 100644 swh/scheduler/sql/10-swh-init.sql
 create mode 100644 swh/scheduler/tests/test_model.py
Changes applied before test
commit 1c93e553a16aa093f9611dc9b8842bab038f1d29
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 12:24:20 2020 +0200

    Implement basic storage and retrieval of lister information
    
    This adds a pair a functions to the backend:
     - `get_or_create_lister` pulls the record for a given lister from the database
     - `update_lister` updates the record for a given lister in the database
    
    This is one of the basic building blocks for the integration of lister
    information directly in the scheduler database.
    
    Related to T2442.

commit 466ac5912e347b9b2da4254849b2d3192819fd2a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 15 14:46:28 2020 +0200

    Introduce a SchedulerException base class
    
    This allows us to automatically serialize/deserialize exceptions under this base
    class within our RPC framework.

commit c509a12df5f1c25be3e492c693be11eae382862a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jun 12 11:03:26 2020 +0200

    Introduce some scaffolding for an attrs-based BaseSchedulerModel
    
    Alongside swh.model.model, this allows us to define data models for the objects
    the scheduler is working with, and to serialize/deserialize these objects
    transparently at the RPC layer.
    
    This also introduces some mild ORM-like logic so we can keep the actual SQL a
    little DRYer.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/27/ for more details.

This revision is now accepted and ready to land.Jun 15 2020, 4:13 PM