Paths

Table of Contentst

simulator: add lister simulation
ClosedPublic
Actions

Authored by vlorentz on Jan 21 2021, 6:06 PM.

Details

Reviewers

olasd
douardda

Group Reviewers

Reviewers

Commits

rDSCHea068b46a89e: simulator: add simple lister simulation
rDSCH7af98e2bc048: Factor out ListedOrigin generation to use the OriginModel

Summary

Factor out ListedOrigin generation to use the OriginModel, and add a
simple lister simulation process, generating some new origins over time.

Diff Detail

Repository

rDSCH Scheduling utilities

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jan 21 2021, 6:06 PM

Build is green

Patch application report for D4909 (id=17488)

Could not rebase; Attempt merge onto 03460207a1...

Updating 0346020..72070b7
Fast-forward
 swh/scheduler/backend.py              | 46 +++++++++++++----
 swh/scheduler/interface.py            | 30 +++++++----
 swh/scheduler/model.py                | 33 +------------
 swh/scheduler/simulator/__init__.py   | 18 +++----
 swh/scheduler/simulator/origins.py    | 83 +++++++++++++++++++++++++++++--
 swh/scheduler/tests/test_scheduler.py | 93 ++++++++++++++++++++++++++++++-----
 swh/scheduler/tests/test_simulator.py |  6 +--
 7 files changed, 230 insertions(+), 79 deletions(-)

Changes applied before test

commit 72070b7bf628788b6872e90a3f8ac8f0c01b70d9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 1f1aad459c4b0740ecbe96e9809e4b31f66bf999
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

commit b93aa5be2c2d5dc2130e1027698f3e1255052d8d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 13:01:53 2021 +0100

    Make PaginatedListedOriginList a concretization of PagedResult
    
    1. consistent with swh-storage and swh-indexer-storage
    2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins.

commit 2f47936731cf438a5195978a2af3250597b693b5
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:29:16 2021 +0100

    Add scheduling policy for already visited origins with known last update
    
    This policy schedules origins by decreasing order of "visit lag" (that
    is, origins with the most lag are scheduled first).

commit acad712ad3f71f88f99e45e9b4f571ad751945dc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:25:46 2021 +0100

    Add scheduling policy for never visited origins
    
    This policy orders never visited origins by increasing date of last
    update (scheduling the "oldest" never visited origins first).

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/231/ for more details.

Harbormaster completed remote builds in B18622: Diff 17488.Jan 21 2021, 6:08 PM

vlorentz requested review of this revision.Jan 21 2021, 6:08 PM

douardda added a subscriber: douardda.Jan 22 2021, 11:05 AM

douardda added inline comments.

swh/scheduler/simulator/origins.py
37	why not call it `last_update` then, instead of `now`?
140	Does this update all existing origins? should't it be a fixed number (or a percentage) of existing origins?

douardda added inline comments.Jan 22 2021, 11:06 AM

swh/scheduler/simulator/origins.py
97	I don't understand why this "first commit at EPOCH" assumption is needed here

vlorentz added inline comments.Jan 25 2021, 10:22 AM

swh/scheduler/simulator/origins.py
97	The origin model is that there are commits every `x` seconds, so there has to be a first commit at some time `t0` if we want to know the date of each commit. We just picked EPOCH as `t0` because it's easy.
140	yes. that can be tweaked later, though.

vlorentz added inline comments.Jan 25 2021, 10:35 AM

swh/scheduler/simulator/origins.py
37	`now` is the time of the listing, while `last_update` is given by the API that would be used by a lister.

I'm really not sure to understand what the simulated model looks like in the end. Do I get it right that, including this diff:

every origin "generates" revisions at a fixed (yet somewhat random for each origin) interval.
every origin have it's first commit at EPOCH
the loading time is a constant factor of the number of commits (if so, is this constant time the same for all origins or is more/less randomly generated by origin?)
100 new origins are created each hour (yet with the first commit at EPOCH)
all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).

It looks to me that this model is pretty rough and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

In D4909#123796, @douardda wrote:

I'm really not sure to understand what the simulated model looks like in the end. Do I get it right that, including this diff:

every origin "generates" revisions at a fixed (yet somewhat random for each origin) interval.

every origin have it's first commit at EPOCH

the loading time is a constant factor of the number of commits (if so, is this constant time the same for all origins or is more/less randomly generated by origin?)

Yes to all this.

100 new origins are created each hour (yet with the first commit at EPOCH)

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).

Indeed

It looks to me that this model is pretty rough

It is.

and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

It's a WIP, we're likely to change it in the short term.

In D4909#123805, @vlorentz wrote:

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

Sure but that's a (possibly serious) bias. Because it can happen does not mean it always happen!

and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

It's a WIP, we're likely to change it in the short term.

Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).

rebase + apply comments

In D4909#123806, @douardda wrote:

In D4909#123805, @vlorentz wrote:

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

Sure but that's a (possibly serious) bias. Because it can happen does not mean it always happen!

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).

done

Build is green

Patch application report for D4909 (id=17567)

Rebasing onto 2906b4e8a0...

Current branch diff-target is up to date.

Changes applied before test

commit e5709214b4917a5fe3634d040da7a061f5978f66
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/271/ for more details.

Harbormaster completed remote builds in B18702: Diff 17567.Jan 25 2021, 2:42 PM

In D4909#123949, @vlorentz wrote:

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Which is perfectly fine to me, just make it clear and documented :-)

This revision is now accepted and ready to land.Jan 26 2021, 9:27 AM

Note that I still think there should be something in docs/simulator.rst also...

Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?

add doc on the origin model

Build is green

Patch application report for D4909 (id=17607)

Rebasing onto 2906b4e8a0...

Current branch diff-target is up to date.

Changes applied before test

commit ea068b46a89e07c60ad1233afd36afc6bb29031e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/280/ for more details.

Harbormaster completed remote builds in B18745: Diff 17607.Jan 26 2021, 1:25 PM

Closed by commit rDSCH7af98e2bc048: Factor out ListedOrigin generation to use the OriginModel (authored by vlorentz). · Explain WhyJan 29 2021, 10:00 AM

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDSCH7af98e2bc048: Factor out ListedOrigin generation to use the OriginModel.

vlorentz added a commit: rDSCHea068b46a89e: simulator: add simple lister simulation.