Page MenuHomeSoftware Heritage

Implement some basic aggregated metrics on listed origins
ClosedPublic

Authored by vlorentz on Jan 19 2021, 2:34 PM.

Details

Summary

Metrics are computed and cached database-side by the update_metrics
function. The get_metrics function only retrieves the cached data.

Test Plan

basic tests added for each metric

Event Timeline

Build is green

Patch application report for D4880 (id=17337)

Could not rebase; Attempt merge onto 5e609d5205...

Updating 5e609d5..826094b
Fast-forward
 .pre-commit-config.yaml                     |   1 +
 docs/index.rst                              |   1 +
 docs/simulator.rst                          |  55 ++++++++++
 mypy.ini                                    |   6 ++
 requirements-simulator.txt                  |   2 +
 setup.py                                    |  34 +++---
 sql/updates/24.sql                          |  56 ++++++++++
 swh/scheduler/backend.py                    |  62 +++++++++++
 swh/scheduler/cli/__init__.py               |   2 +-
 swh/scheduler/cli/simulator.py              |  57 ++++++++++
 swh/scheduler/interface.py                  |  31 ++++++
 swh/scheduler/model.py                      |  32 ++++++
 swh/scheduler/simulator/__init__.py         | 126 ++++++++++++++++++++++
 swh/scheduler/simulator/common.py           | 102 ++++++++++++++++++
 swh/scheduler/simulator/origin_scheduler.py |  69 +++++++++++++
 swh/scheduler/simulator/origins.py          | 119 +++++++++++++++++++++
 swh/scheduler/simulator/task_scheduler.py   |  77 ++++++++++++++
 swh/scheduler/sql/30-schema.sql             |  24 ++++-
 swh/scheduler/sql/40-func.sql               |  33 ++++++
 swh/scheduler/tests/test_api_client.py      |   2 +
 swh/scheduler/tests/test_scheduler.py       | 155 +++++++++++++++++++++++++++-
 swh/scheduler/tests/test_simulator.py       |  45 ++++++++
 22 files changed, 1071 insertions(+), 20 deletions(-)
 create mode 100644 docs/simulator.rst
 create mode 100644 requirements-simulator.txt
 create mode 100644 sql/updates/24.sql
 create mode 100644 swh/scheduler/cli/simulator.py
 create mode 100644 swh/scheduler/simulator/__init__.py
 create mode 100644 swh/scheduler/simulator/common.py
 create mode 100644 swh/scheduler/simulator/origin_scheduler.py
 create mode 100644 swh/scheduler/simulator/origins.py
 create mode 100644 swh/scheduler/simulator/task_scheduler.py
 create mode 100644 swh/scheduler/tests/test_simulator.py
Changes applied before test
commit 826094bf8a4be4b4295a6b024f6b85aafad871bb
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.

commit 77362633b7485bfe3944d8c278d509eb60f0d664
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit eb7676ea2e8dcc5fa92067ad7858e5069ccc8db1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit 687e6f007cb4943ef19ff87b87953607c6f206b7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit c3f520abc55c7355dbef0d2fed1102cc30040176
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit 58042267faeaaab656c1e459b14fcfa24f300795
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit b58eb740e9d407b95a1df632eaef91bbe6c3ff8b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit 85df218106a0c29dd79900321572e87a7c90a5bd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit 2bc5187c76657b00d54b61f993aeeb2de25acf18
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit edf406dc961c8d0a77a34e73b0e19fcd511bd27d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit e7d60a996249b6827332e17e2977bec1b69eab83
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit b663e5414a09bc0b5a22c111894433d71c77f42c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit c3e8380e1aa140c8823ef76ba6d384474f160c9b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit b4b20ad406d8925cb4aa96828dbc5af14e0bda8d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit cce6ce250ee0e73cc2b486c32cae8c05265a9974
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit 7e5f99837487c3785dfa96ed28ce9fecdf25bad8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit 63c3beea168e2f41ff0cbd71fe53af95e062748a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit dd06b1bd428c15cf8ebb89873f24ee372ff363eb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit 524ec4a50a60eb45815faf49d8d675a86756955b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit 11263f58a02c9f1aa485df5ea4ac5131998f3d69
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/156/ for more details.

Rebase without pulling in the full branch

Build is green

Patch application report for D4880 (id=17357)

Rebasing onto 0a32a31195...

First, rewinding head to replay your work on top of it...
Applying: Import the journal subcommand in the main swh.scheduler cli
Applying: Introduce scaffolding for a scheduler simulator
Applying: simulator: Use datetimes instead of a floating point simulated time
Applying: simulator: Move scheduler into the simulation environment object
Applying: simulator: generate OriginVisitStatus objects in modeled visits
Applying: simulator: simulate the scheduler journal client
Applying: simulator: Remove some debug statements and lower log level
Applying: simulator: refine origin model to follow an exponential distribution
Applying: simulator: add support for a basic SimulationReport
Applying: simulator: add typing for Environment.scheduler
Applying: simulator: generate more origins in fill_data
Applying: simulator: tweak simulation environment constants
Applying: simulator: Make the run time a CLI argument
Applying: simulator: Split into smaller files in the same package
Applying: simulator: also fill data for the task-based scheduler
Applying: simulator: wrap tasks and task events in typechecked objects
Applying: simulator: Replace attrs with dataclasses for consistency
Applying: Move the simulator cli to the main cli module
Applying: simulator: implement a simulator for the "old" task-based scheduler
Applying: simulator: add basic tests for fill_test_data and run
Applying: simulator: Make min_batch_size a parameter defined in the setup.
Applying: simulator: Add documentation.
Applying: Implement some basic aggregated metrics on listed origins
Changes applied before test
commit dc1591850962e6ffe995bcc9bc4f7001f244c4a2
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.

commit 20eac21568d723c0ae724f4bf440057de3a3ab65
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:32:27 2021 +0100

    simulator: Add documentation.

commit 2d65ccfdd75a5f6e11e2d1fdee747c84703686a8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:17:24 2021 +0100

    simulator: Make min_batch_size a parameter defined in the setup.

commit 4c43a30903ff32fbfd8c51f2c6bd7701a7b548b9
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit ca0a9f4ccee9bc3305b476b13fcdaaf00e763d6d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit 643b4d56d1a6acdcad753747a0fa7275456753b2
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit 7895d2d7deb930baea73a44f5102c260f4aba0ea
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit 0007537c8738e392b16c4147e51e4108cc8249a2
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit 8d62b86da3e9ed728251fe8d1a7b032b6b64c726
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit c8d1fd7f79b770dbd0f2981b6c682a3049a898be
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit ef07de8d32cdb464545482390828fb0577514b82
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit 49b8b086c25c8fd683d7089095cd2cfaa1b61cd5
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit 5cc155fdf6292e652cfa786406d7869e4a23e871
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit 060a000cc3fd918645c01531e1f82bf53410cf17
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit d0abbe887d202622d20cf9110fc863e0eea7aad6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit 5c21f7d7aea332232823f3e68547a7a1c40f6122
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit 5274886994683a0ae2971d587a7e5cf9ae8800b4
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit 66dfb010cc4eeea072f58adf75d0d7602ad52064
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit c6aa5d0bc3d185694ca1ea20173b01530ad15118
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit dc3ab75fb4da1e63ca98281b57abddbd40e3c3af
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit ce78c550ea99d1ee6933c93188f8352faf772d53
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit 3d3058d49d116fb753456572b457dd026cd278cb
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

commit e32a3e63782d2aab5b894fa1c83122aeb199500d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 17:56:44 2021 +0100

    Import the journal subcommand in the main swh.scheduler cli
    
    This issue was masked by tox.ini using pytest with --doctest-modules,
    which imports all modules during test collection, and therefore executing
    the side-effects of swh.scheduler.cli.journal.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/166/ for more details.

Build is green

Patch application report for D4880 (id=17358)

Could not rebase; Attempt merge onto 0a32a31195...

Auto-merging setup.py
Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                     |   1 +
 docs/index.rst                              |   1 +
 docs/simulator.rst                          |  65 ++++++++++++
 mypy.ini                                    |   6 ++
 requirements-simulator.txt                  |   2 +
 setup.py                                    |  34 +++---
 sql/updates/24.sql                          |  56 ++++++++++
 swh/scheduler/backend.py                    |  62 +++++++++++
 swh/scheduler/cli/__init__.py               |   2 +-
 swh/scheduler/cli/simulator.py              |  61 +++++++++++
 swh/scheduler/interface.py                  |  31 ++++++
 swh/scheduler/model.py                      |  32 ++++++
 swh/scheduler/simulator/__init__.py         | 144 ++++++++++++++++++++++++++
 swh/scheduler/simulator/common.py           | 102 ++++++++++++++++++
 swh/scheduler/simulator/origin_scheduler.py |  68 ++++++++++++
 swh/scheduler/simulator/origins.py          | 128 +++++++++++++++++++++++
 swh/scheduler/simulator/task_scheduler.py   |  76 ++++++++++++++
 swh/scheduler/sql/30-schema.sql             |  24 ++++-
 swh/scheduler/sql/40-func.sql               |  33 ++++++
 swh/scheduler/tests/test_api_client.py      |   2 +
 swh/scheduler/tests/test_scheduler.py       | 155 +++++++++++++++++++++++++++-
 swh/scheduler/tests/test_simulator.py       |  45 ++++++++
 22 files changed, 1110 insertions(+), 20 deletions(-)
 create mode 100644 docs/simulator.rst
 create mode 100644 requirements-simulator.txt
 create mode 100644 sql/updates/24.sql
 create mode 100644 swh/scheduler/cli/simulator.py
 create mode 100644 swh/scheduler/simulator/__init__.py
 create mode 100644 swh/scheduler/simulator/common.py
 create mode 100644 swh/scheduler/simulator/origin_scheduler.py
 create mode 100644 swh/scheduler/simulator/origins.py
 create mode 100644 swh/scheduler/simulator/task_scheduler.py
 create mode 100644 swh/scheduler/tests/test_simulator.py
Changes applied before test
commit c4a3a5444c7171b66f67212dd8a03581b39c1d57
Merge: 0a32a31 b0a3369
Author: Jenkins user <jenkins@localhost>
Date:   Tue Jan 19 17:12:16 2021 +0000

    Merge branch 'diff-target' into HEAD

commit b0a3369157c83dbb5e3dab0e7ef9e2803edbdefe
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.

commit ed044415e625080cb4bc67b2656743d92ed4c884
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:32:27 2021 +0100

    simulator: Add documentation.

commit 186aebeb12905dc98cc370b360a8b3f5c4db3186
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:17:24 2021 +0100

    simulator: Make min_batch_size a parameter defined in the setup.

commit 6150c764616d3c25ee13eb08bea6b4c9d1c2bc0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit 5bee207dca74bb2c70611b3308c93bc522d48247
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit 6ec79c18b7b0e8be7b086aa79e87de81a8dbd06a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit b25874a7066c95460f7d24c132f32f4dabf055a7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit cb0bc27be55cf384c68b834ae3c89dd93434fbba
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit 947aecb14cdb4c6dd2da178f53599b4a41c8245b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit ff6cd0669e0d75afbd2c63424db66bf8d1e91bee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit b232135cb982f4fc8e5fb6242a88012d732e252d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit 24e93d8aa72107bf953f884df4c9b15ea9cbeb2c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit 9885d12cd708a26878cd9aa70ab590223589e8d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit a1d80fec0f5760d136857fb893232b1baec35b64
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit 6a9ec5f38133fe232da1ca98ff30ef44b12a4c12
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit 5d0e2aee4182df9476934349ad20da5dafc8b61f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit 7934c2f90191615db69b50dc27744ec73704f896
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit d0ed751eca9f2ff0464b795edb9e9bb2a0305649
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit c9b0728955e683748f8b03a22f91d501b64aad67
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit 56c8d1dd66d8a993c8bc7c7bcc4e3fb3704f6864
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit 1659aa17fe0510030fb24d3b7867d2c4a366b5dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit ef241dd84c400f9be0d92396867587d47216e385
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

commit 49a14792b0329049b51cbc6ed9c48006e9ff1a73
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 17:56:44 2021 +0100

    Import the journal subcommand in the main swh.scheduler cli
    
    This issue was masked by tox.ini using pytest with --doctest-modules,
    which imports all modules during test collection, and therefore executing
    the side-effects of swh.scheduler.cli.journal.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/167/ for more details.

Add test updating metrics twice in a row

Build is green

Patch application report for D4880 (id=17360)

Could not rebase; Attempt merge onto 0a32a31195...

Auto-merging setup.py
Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                     |   1 +
 docs/index.rst                              |   1 +
 docs/simulator.rst                          |  65 +++++++++++
 mypy.ini                                    |   6 +
 requirements-simulator.txt                  |   2 +
 setup.py                                    |  34 +++---
 sql/updates/24.sql                          |  64 +++++++++++
 swh/scheduler/backend.py                    |  62 +++++++++++
 swh/scheduler/cli/__init__.py               |   2 +-
 swh/scheduler/cli/simulator.py              |  61 ++++++++++
 swh/scheduler/interface.py                  |  31 ++++++
 swh/scheduler/model.py                      |  32 ++++++
 swh/scheduler/simulator/__init__.py         | 144 ++++++++++++++++++++++++
 swh/scheduler/simulator/common.py           | 102 +++++++++++++++++
 swh/scheduler/simulator/origin_scheduler.py |  68 ++++++++++++
 swh/scheduler/simulator/origins.py          | 128 +++++++++++++++++++++
 swh/scheduler/simulator/task_scheduler.py   |  76 +++++++++++++
 swh/scheduler/sql/30-schema.sql             |  24 +++-
 swh/scheduler/sql/40-func.sql               |  40 +++++++
 swh/scheduler/tests/test_api_client.py      |   2 +
 swh/scheduler/tests/test_scheduler.py       | 166 +++++++++++++++++++++++++++-
 swh/scheduler/tests/test_simulator.py       |  45 ++++++++
 22 files changed, 1136 insertions(+), 20 deletions(-)
 create mode 100644 docs/simulator.rst
 create mode 100644 requirements-simulator.txt
 create mode 100644 sql/updates/24.sql
 create mode 100644 swh/scheduler/cli/simulator.py
 create mode 100644 swh/scheduler/simulator/__init__.py
 create mode 100644 swh/scheduler/simulator/common.py
 create mode 100644 swh/scheduler/simulator/origin_scheduler.py
 create mode 100644 swh/scheduler/simulator/origins.py
 create mode 100644 swh/scheduler/simulator/task_scheduler.py
 create mode 100644 swh/scheduler/tests/test_simulator.py
Changes applied before test
commit a3865fa4dad3ff3b8d4f0feff3757a0f46b0512f
Merge: 0a32a31 071671a
Author: Jenkins user <jenkins@localhost>
Date:   Tue Jan 19 17:18:25 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 071671aae22e726bfce5a00ce2e1c6ddfe850d33
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.

commit ed044415e625080cb4bc67b2656743d92ed4c884
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:32:27 2021 +0100

    simulator: Add documentation.

commit 186aebeb12905dc98cc370b360a8b3f5c4db3186
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:17:24 2021 +0100

    simulator: Make min_batch_size a parameter defined in the setup.

commit 6150c764616d3c25ee13eb08bea6b4c9d1c2bc0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit 5bee207dca74bb2c70611b3308c93bc522d48247
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit 6ec79c18b7b0e8be7b086aa79e87de81a8dbd06a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit b25874a7066c95460f7d24c132f32f4dabf055a7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit cb0bc27be55cf384c68b834ae3c89dd93434fbba
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit 947aecb14cdb4c6dd2da178f53599b4a41c8245b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit ff6cd0669e0d75afbd2c63424db66bf8d1e91bee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit b232135cb982f4fc8e5fb6242a88012d732e252d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit 24e93d8aa72107bf953f884df4c9b15ea9cbeb2c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit 9885d12cd708a26878cd9aa70ab590223589e8d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit a1d80fec0f5760d136857fb893232b1baec35b64
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit 6a9ec5f38133fe232da1ca98ff30ef44b12a4c12
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit 5d0e2aee4182df9476934349ad20da5dafc8b61f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit 7934c2f90191615db69b50dc27744ec73704f896
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit d0ed751eca9f2ff0464b795edb9e9bb2a0305649
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit c9b0728955e683748f8b03a22f91d501b64aad67
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit 56c8d1dd66d8a993c8bc7c7bcc4e3fb3704f6864
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit 1659aa17fe0510030fb24d3b7867d2c4a366b5dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit ef241dd84c400f9be0d92396867587d47216e385
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

commit 49a14792b0329049b51cbc6ed9c48006e9ff1a73
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 17:56:44 2021 +0100

    Import the journal subcommand in the main swh.scheduler cli
    
    This issue was masked by tox.ini using pytest with --doctest-modules,
    which imports all modules during test collection, and therefore executing
    the side-effects of swh.scheduler.cli.journal.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/169/ for more details.

Build is green

Patch application report for D4880 (id=17363)

Could not rebase; Attempt merge onto 0a32a31195...

Auto-merging setup.py
Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                     |   1 +
 docs/index.rst                              |   1 +
 docs/simulator.rst                          |  65 +++++++++++
 mypy.ini                                    |   6 +
 requirements-simulator.txt                  |   2 +
 setup.py                                    |  34 +++---
 sql/updates/24.sql                          |  64 +++++++++++
 swh/scheduler/backend.py                    |  62 +++++++++++
 swh/scheduler/cli/__init__.py               |   2 +-
 swh/scheduler/cli/simulator.py              |  68 ++++++++++++
 swh/scheduler/interface.py                  |  31 ++++++
 swh/scheduler/model.py                      |  32 ++++++
 swh/scheduler/simulator/__init__.py         | 147 ++++++++++++++++++++++++
 swh/scheduler/simulator/common.py           | 102 +++++++++++++++++
 swh/scheduler/simulator/origin_scheduler.py |  68 ++++++++++++
 swh/scheduler/simulator/origins.py          | 128 +++++++++++++++++++++
 swh/scheduler/simulator/task_scheduler.py   |  76 +++++++++++++
 swh/scheduler/sql/30-schema.sql             |  24 +++-
 swh/scheduler/sql/40-func.sql               |  40 +++++++
 swh/scheduler/tests/test_api_client.py      |   2 +
 swh/scheduler/tests/test_scheduler.py       | 166 +++++++++++++++++++++++++++-
 swh/scheduler/tests/test_simulator.py       |  53 +++++++++
 22 files changed, 1154 insertions(+), 20 deletions(-)
 create mode 100644 docs/simulator.rst
 create mode 100644 requirements-simulator.txt
 create mode 100644 sql/updates/24.sql
 create mode 100644 swh/scheduler/cli/simulator.py
 create mode 100644 swh/scheduler/simulator/__init__.py
 create mode 100644 swh/scheduler/simulator/common.py
 create mode 100644 swh/scheduler/simulator/origin_scheduler.py
 create mode 100644 swh/scheduler/simulator/origins.py
 create mode 100644 swh/scheduler/simulator/task_scheduler.py
 create mode 100644 swh/scheduler/tests/test_simulator.py
Changes applied before test
commit 459a5b2a3bd9974e3b1f9d0eb8d79a5cc076c798
Merge: 0a32a31 2bf1109
Author: Jenkins user <jenkins@localhost>
Date:   Tue Jan 19 17:44:49 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 2bf1109c21ac77b7f03ad993f4cf0650390b929b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.

commit e12a4f13386cdb25d366f5e2ee81044cb8e30169
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 18:36:53 2021 +0100

    simulator: stop using get_scheduler directly
    
    This reuses the scheduler instantiated by the cli instead of hardcoding
    our own using the PG* variables.

commit ed044415e625080cb4bc67b2656743d92ed4c884
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:32:27 2021 +0100

    simulator: Add documentation.

commit 186aebeb12905dc98cc370b360a8b3f5c4db3186
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Jan 19 16:17:24 2021 +0100

    simulator: Make min_batch_size a parameter defined in the setup.

commit 6150c764616d3c25ee13eb08bea6b4c9d1c2bc0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jan 18 13:51:35 2021 +0100

    simulator: add basic tests for fill_test_data and run

commit 5bee207dca74bb2c70611b3308c93bc522d48247
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:33:43 2021 +0100

    simulator: implement a simulator for the "old" task-based scheduler
    
    We extend the Task object with an autogenerated uuid allowing us to
    track the task lifetime between its creation and the generation of visit
    statuses, as the task-based scheduler does.

commit 6ec79c18b7b0e8be7b086aa79e87de81a8dbd06a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 16:31:42 2021 +0100

    Move the simulator cli to the main cli module

commit b25874a7066c95460f7d24c132f32f4dabf055a7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:37:59 2021 +0100

    simulator: Replace attrs with dataclasses for consistency

commit cb0bc27be55cf384c68b834ae3c89dd93434fbba
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 15:31:41 2021 +0100

    simulator: wrap tasks and task events in typechecked objects
    
    This allows us to extend these objects without redefining a bunch of
    type annotations.

commit 947aecb14cdb4c6dd2da178f53599b4a41c8245b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 14:47:33 2021 +0100

    simulator: also fill data for the task-based scheduler

commit ff6cd0669e0d75afbd2c63424db66bf8d1e91bee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 15 14:41:05 2021 +0100

    simulator: Split into smaller files in the same package

commit b232135cb982f4fc8e5fb6242a88012d732e252d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:50:00 2021 +0100

    simulator: Make the run time a CLI argument

commit 24e93d8aa72107bf953f884df4c9b15ea9cbeb2c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:40:16 2021 +0100

    simulator: tweak simulation environment constants

commit 9885d12cd708a26878cd9aa70ab590223589e8d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:37:00 2021 +0100

    simulator: generate more origins in fill_data

commit a1d80fec0f5760d136857fb893232b1baec35b64
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:35:01 2021 +0100

    simulator: add typing for Environment.scheduler

commit 6a9ec5f38133fe232da1ca98ff30ef44b12a4c12
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 12:00:21 2021 +0100

    simulator: add support for a basic SimulationReport
    
    For now, this collects the runtime of tasks that have run, and gets
    printed at the end of the simulation.

commit 5d0e2aee4182df9476934349ad20da5dafc8b61f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:45:23 2021 +0100

    simulator: refine origin model to follow an exponential distribution
    
    This models origins using a consistent characteristic "time between
    commits" that follows an exponential distribution between 1 second and
    10 years.
    
    From this characteristic time, and feedback from the OriginVisitStats,
    we can generate the expected run time and output status of the next
    visit of that origin.

commit 7934c2f90191615db69b50dc27744ec73704f896
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jan 15 11:43:20 2021 +0100

    simulator: Remove some debug statements and lower log level

commit d0ed751eca9f2ff0464b795edb9e9bb2a0305649
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:17:11 2021 +0100

    simulator: simulate the scheduler journal client

commit c9b0728955e683748f8b03a22f91d501b64aad67
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:12:38 2021 +0100

    simulator: generate OriginVisitStatus objects in modeled visits
    
    To be able to generate uneventful visits, we would need to store
    the last snapshot seen for a given origin. Instead of storing this
    within the simulator, which would be a concern for large scale
    simulations, we use the scheduler visit cache directly.

commit 56c8d1dd66d8a993c8bc7c7bcc4e3fb3704f6864
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:09:58 2021 +0100

    simulator: Move scheduler into the simulation environment object
    
    The scheduler is used by a lot of the simulated actors, it makes sense
    to share it all the time.

commit 1659aa17fe0510030fb24d3b7867d2c4a366b5dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 14 15:07:56 2021 +0100

    simulator: Use datetimes instead of a floating point simulated time

commit ef241dd84c400f9be0d92396867587d47216e385
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 13 16:13:01 2021 +0100

    Introduce scaffolding for a scheduler simulator
    
    This simulator will allow us to compare the behavior of the old and new
    schedulers, as well as to test the impact of scheduler policies and their
    parameters on the performance of the Software Heritage archival
    infrastructure as a whole.

commit 49a14792b0329049b51cbc6ed9c48006e9ff1a73
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 17:56:44 2021 +0100

    Import the journal subcommand in the main swh.scheduler cli
    
    This issue was masked by tox.ini using pytest with --doctest-modules,
    which imports all modules during test collection, and therefore executing
    the side-effects of swh.scheduler.cli.journal.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/172/ for more details.

douardda added a subscriber: douardda.

Looks ok to me. I'd like however to have a description of implemented metrics in the commit message (and in the documentation, but this may come later)

sql/updates/24.sql
16

I would not call this a cache, but meh

22

"never been successfully visited"

swh/scheduler/sql/30-schema.sql
200

see comment on 24.sql

206

see comment on 24.sql

This revision is now accepted and ready to land.Jan 20 2021, 10:31 AM

Rebase betweeen @douardda's changes and the simulator implementation; Apply @douardda's comments.

sql/updates/24.sql
16

It's a snapshot of said metrics which we could compute on the fly but don't because that takes a long time. I'm not sure how else to call it.

Build is green

Patch application report for D4880 (id=17378)

Rebasing onto 98526539a8...

Current branch diff-target is up to date.
Changes applied before test
commit 114ed952e513c7ad3dbb038a640e80bf079d0780
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jan 19 14:23:32 2021 +0100

    Implement some basic aggregated metrics on listed origins
    
    Metrics are computed and cached database-side by the `update_metrics`
    function. The `get_metrics` function only retrieves the cached data.
    
    The metrics are aggregated for each lister instance and visit type
    (allowing complete reaggregation by visit type for cross-cutting statistics).
    
    The following metrics have been implemented:
     - number of known origins overall
     - number of enabled origins (origins seen in the last listing)
     - number of enabled origins that have never been successfully visited
     - number of enabled origins with known activity since our last successful visit

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/178/ for more details.