Page MenuHomeSoftware Heritage

Make the grab_next_visits sql query modular
ClosedPublic

Authored by vlorentz on Jan 20 2021, 5:44 PM.

Details

Summary

This will allow us to easily plug new scheduling policies in that
function.

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4896 (id=17408)

Could not rebase; Attempt merge onto 7905a6bea4...

Updating 7905a6b..8bab1ba
Fast-forward
 .pre-commit-config.yaml                     |   1 +
 docs/index.rst                              |   1 +
 docs/simulator.rst                          |  65 +++++++++++
 mypy.ini                                    |   6 +
 requirements-simulator.txt                  |   2 +
 setup.py                                    |  34 +++---
 swh/scheduler/backend.py                    |  48 +++++---
 swh/scheduler/cli/__init__.py               |   2 +-
 swh/scheduler/cli/simulator.py              |  68 ++++++++++++
 swh/scheduler/simulator/__init__.py         | 163 ++++++++++++++++++++++++++++
 swh/scheduler/simulator/common.py           | 132 ++++++++++++++++++++++
 swh/scheduler/simulator/origin_scheduler.py |  68 ++++++++++++
 swh/scheduler/simulator/origins.py          | 128 ++++++++++++++++++++++
 swh/scheduler/simulator/task_scheduler.py   |  76 +++++++++++++
 swh/scheduler/tests/test_simulator.py       |  53 +++++++++
 15 files changed, 812 insertions(+), 35 deletions(-)
 create mode 100644 docs/simulator.rst
 create mode 100644 requirements-simulator.txt
 create mode 100644 swh/scheduler/cli/simulator.py
 create mode 100644 swh/scheduler/simulator/__init__.py
 create mode 100644 swh/scheduler/simulator/common.py
 create mode 100644 swh/scheduler/simulator/origin_scheduler.py
 create mode 100644 swh/scheduler/simulator/origins.py
 create mode 100644 swh/scheduler/simulator/task_scheduler.py
 create mode 100644 swh/scheduler/tests/test_simulator.py
Changes applied before test
commit 8bab1ba37aebbb9921e73ffbb17a9cb25a94c264
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:17:17 2021 +0100

    Make the grab_next_visits sql query modular
    
    This will allow us to easily plug new scheduling policies in that
    function.

commit 898820fac52cf6fcfb5d2770aad49f131370a5a6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 12:11:05 2021 +0100

    simulator: collect and plot scheduler metrics over time
    
    For now, only plot the known_origins and origins_never_visited metrics.

commit 9ce68f8d0e0ea69bd6672a50687079b5b1ea460c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 18:36:53 2021 +0100

    simulator: stop using get_scheduler directly
    
    This reuses the scheduler instantiated by the cli instead of hardcoding
    our own using the PG* variables.

commit 88e0b42805011bc3886f77ce5c91b3450351a16f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:32:27 2021 +0100

    simulator: Add documentation.

commit 62c6d90867bccb17ae076e1b5ee4db6fd350ad1b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:17:24 2021 +0100

    simulator: Make min_batch_size a parameter defined in the setup.

commit 9468bb9384f14e5fa0548b7d985f66fb3e36c85a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit ead7b347db9d8852b4c347729d7e6d32b72d9058
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit aecd27eee06aaa46d350e9d5b3f86ccc36a5446c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit 05067e3ecc888271507505112b48ebc9f755f5e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit 24922fe2d995ca3ffa6c3c5a19c1f5f5531db4c8
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit d5318aea0a93a94c80f8d743ce1de63592161f5a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit 22ebb7a9a4bc6639e6f52d71c2b727537baf5019
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit ad7bfbe731da64cc6d1ddaa3f5ae1ef1e3350f60
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit df34db0bfc61df418f00338345b4b46a86340f62
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit 21ce2c88dddce081bfd525d08454ca09bbf521c6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit 29204199774b40bea4d3d23ffe9407a5d090f8fa
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit 6433266106dda007d1e5304a0dcb01706c8acb42
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit c474a825336a4e4132e83982e180451b02d8f54d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit 2459badf0c05bf2cb663e66b9deabf1150638bb1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit cb12449e8f57e59ec4c7953a3c4a52c9193d202e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit 20b7f9c68f831839f4be1cae4b9ae2dce0fc2d96
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit 39ad47de2e753033c4b7114a64b5c3144b6ea821
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit 31967fa850c3afe29fc37e41cfcd53ff5408e7b9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit fc3f06bd1d77c76bfba4c05efcd62abcb5c46eea
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/195/ for more details.

1 question about 'enabled' inlined there

but otherwise, lftm.

swh/scheduler/backend.py
331

what does 'enabled' mean here?

I gather that ends up in the query like "WHERE enabled AND visit_type=%s"
but i don't know what that means exactly.

This revision is now accepted and ready to land.Jan 20 2021, 7:18 PM
swh/scheduler/backend.py
331

Whether this origin has been seen during the last listing, and visits should be scheduled

swh/scheduler/backend.py
331

I don't think the enabled field is ever updated currently. But we will, eventually.

Obviously this would deserve a comment rather than being snuck in.

Build is green

Patch application report for D4896 (id=17453)

Rebasing onto 9fb0dd6c7c...

First, rewinding head to replay your work on top of it...
Applying: Make the grab_next_visits sql query modular
Changes applied before test
commit f82680a448910a059878ea91e71715a6b9697be9
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:17:17 2021 +0100

    Make the grab_next_visits sql query modular
    
    This will allow us to easily plug new scheduling policies in that
    function.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/209/ for more details.

Build is green

Patch application report for D4896 (id=17457)

Rebasing onto 9fb0dd6c7c...

Current branch diff-target is up to date.
Changes applied before test
commit b641ac83ebbf0b4d4166034467efa7c591793d50
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:17:17 2021 +0100

    Make the grab_next_visits sql query modular
    
    This will allow us to easily plug new scheduling policies in that
    function.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/213/ for more details.

This revision was automatically updated to reflect the committed changes.