Page MenuHomeSoftware Heritage

simulator: add lister simulation
ClosedPublic

Authored by vlorentz on Jan 21 2021, 6:06 PM.

Details

Summary

Factor out ListedOrigin generation to use the OriginModel, and add a
simple lister simulation process, generating some new origins over time.

Event Timeline

Build is green

Patch application report for D4909 (id=17488)

Could not rebase; Attempt merge onto 03460207a1...

Updating 0346020..72070b7
Fast-forward
 swh/scheduler/backend.py              | 46 +++++++++++++----
 swh/scheduler/interface.py            | 30 +++++++----
 swh/scheduler/model.py                | 33 +------------
 swh/scheduler/simulator/__init__.py   | 18 +++----
 swh/scheduler/simulator/origins.py    | 83 +++++++++++++++++++++++++++++--
 swh/scheduler/tests/test_scheduler.py | 93 ++++++++++++++++++++++++++++++-----
 swh/scheduler/tests/test_simulator.py |  6 +--
 7 files changed, 230 insertions(+), 79 deletions(-)
Changes applied before test
commit 72070b7bf628788b6872e90a3f8ac8f0c01b70d9
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 1f1aad459c4b0740ecbe96e9809e4b31f66bf999
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

commit b93aa5be2c2d5dc2130e1027698f3e1255052d8d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 13:01:53 2021 +0100

    Make PaginatedListedOriginList a concretization of PagedResult
    
    1. consistent with swh-storage and swh-indexer-storage
    2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins.

commit 2f47936731cf438a5195978a2af3250597b693b5
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:29:16 2021 +0100

    Add scheduling policy for already visited origins with known last update
    
    This policy schedules origins by decreasing order of "visit lag" (that
    is, origins with the most lag are scheduled first).

commit acad712ad3f71f88f99e45e9b4f571ad751945dc
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jan 20 17:25:46 2021 +0100

    Add scheduling policy for never visited origins
    
    This policy orders never visited origins by increasing date of last
    update (scheduling the "oldest" never visited origins first).

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/231/ for more details.

douardda added inline comments.
swh/scheduler/simulator/origins.py
44

why not call it last_update then, instead of now?

147

Does this update all existing origins? should't it be a fixed number (or a percentage) of existing origins?

swh/scheduler/simulator/origins.py
104

I don't understand why this "first commit at EPOCH" assumption is needed here

swh/scheduler/simulator/origins.py
104

The origin model is that there are commits every x seconds, so there has to be a first commit at some time t0 if we want to know the date of each commit.

We just picked EPOCH as t0 because it's easy.

147

yes. that can be tweaked later, though.

swh/scheduler/simulator/origins.py
44

now is the time of the listing, while last_update is given by the API that would be used by a lister.

I'm really not sure to understand what the simulated model looks like in the end. Do I get it right that, including this diff:

  • every origin "generates" revisions at a fixed (yet somewhat random for each origin) interval.
  • every origin have it's first commit at EPOCH
  • the loading time is a constant factor of the number of commits (if so, is this constant time the same for all origins or is more/less randomly generated by origin?)
  • 100 new origins are created each hour (yet with the first commit at EPOCH)
  • all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).

It looks to me that this model is pretty rough and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

I'm really not sure to understand what the simulated model looks like in the end. Do I get it right that, including this diff:

  • every origin "generates" revisions at a fixed (yet somewhat random for each origin) interval.
  • every origin have it's first commit at EPOCH
  • the loading time is a constant factor of the number of commits (if so, is this constant time the same for all origins or is more/less randomly generated by origin?)

Yes to all this.

  • 100 new origins are created each hour (yet with the first commit at EPOCH)

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

  • all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).

Indeed

It looks to me that this model is pretty rough

It is.

and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

It's a WIP, we're likely to change it in the short term.

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

Sure but that's a (possibly serious) bias. Because it can happen does not mean it always happen!

and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.

It's a WIP, we're likely to change it in the short term.

Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).

Yes, but that's not inconsistent as we can discover origins that we didn't know about.

Sure but that's a (possibly serious) bias. Because it can happen does not mean it always happen!

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).

done

Build is green

Patch application report for D4909 (id=17567)

Rebasing onto 2906b4e8a0...

Current branch diff-target is up to date.
Changes applied before test
commit e5709214b4917a5fe3634d040da7a061f5978f66
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/271/ for more details.

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Which is perfectly fine to me, just make it clear and documented :-)

This revision is now accepted and ready to land.Jan 26 2021, 9:27 AM

Note that I still think there should be something in docs/simulator.rst also...

Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?

add doc on the origin model

Build is green

Patch application report for D4909 (id=17607)

Rebasing onto 2906b4e8a0...

Current branch diff-target is up to date.
Changes applied before test
commit ea068b46a89e07c60ad1233afd36afc6bb29031e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:57:42 2021 +0100

    simulator: add simple lister simulation

commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 21 14:54:53 2021 +0100

    Factor out ListedOrigin generation to use the OriginModel
    
    This generates consistent last_update values according to the model and
    simulated time.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/280/ for more details.