Factor out ListedOrigin generation to use the OriginModel, and add a
simple lister simulation process, generating some new origins over time.
Details
Diff Detail
- Repository
- rDSCH Scheduling utilities
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 18622 Build 28803: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 28802: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D4909 (id=17488)
Could not rebase; Attempt merge onto 03460207a1...
Updating 0346020..72070b7 Fast-forward swh/scheduler/backend.py | 46 +++++++++++++---- swh/scheduler/interface.py | 30 +++++++---- swh/scheduler/model.py | 33 +------------ swh/scheduler/simulator/__init__.py | 18 +++---- swh/scheduler/simulator/origins.py | 83 +++++++++++++++++++++++++++++-- swh/scheduler/tests/test_scheduler.py | 93 ++++++++++++++++++++++++++++++----- swh/scheduler/tests/test_simulator.py | 6 +-- 7 files changed, 230 insertions(+), 79 deletions(-)
Changes applied before test
commit 72070b7bf628788b6872e90a3f8ac8f0c01b70d9 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:57:42 2021 +0100 simulator: add simple lister simulation commit 1f1aad459c4b0740ecbe96e9809e4b31f66bf999 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:54:53 2021 +0100 Factor out ListedOrigin generation to use the OriginModel This generates consistent last_update values according to the model and simulated time. commit b93aa5be2c2d5dc2130e1027698f3e1255052d8d Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 13:01:53 2021 +0100 Make PaginatedListedOriginList a concretization of PagedResult 1. consistent with swh-storage and swh-indexer-storage 2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins. commit 2f47936731cf438a5195978a2af3250597b693b5 Author: Nicolas Dandrimont <nicolas@dandrimont.eu> Date: Wed Jan 20 17:29:16 2021 +0100 Add scheduling policy for already visited origins with known last update This policy schedules origins by decreasing order of "visit lag" (that is, origins with the most lag are scheduled first). commit acad712ad3f71f88f99e45e9b4f571ad751945dc Author: Nicolas Dandrimont <nicolas@dandrimont.eu> Date: Wed Jan 20 17:25:46 2021 +0100 Add scheduling policy for never visited origins This policy orders never visited origins by increasing date of last update (scheduling the "oldest" never visited origins first).
See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/231/ for more details.
swh/scheduler/simulator/origins.py | ||
---|---|---|
98 | I don't understand why this "first commit at EPOCH" assumption is needed here |
swh/scheduler/simulator/origins.py | ||
---|---|---|
38 | now is the time of the listing, while last_update is given by the API that would be used by a lister. |
I'm really not sure to understand what the simulated model looks like in the end. Do I get it right that, including this diff:
- every origin "generates" revisions at a fixed (yet somewhat random for each origin) interval.
- every origin have it's first commit at EPOCH
- the loading time is a constant factor of the number of commits (if so, is this constant time the same for all origins or is more/less randomly generated by origin?)
- 100 new origins are created each hour (yet with the first commit at EPOCH)
- all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).
It looks to me that this model is pretty rough and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.
Yes to all this.
- 100 new origins are created each hour (yet with the first commit at EPOCH)
Yes, but that's not inconsistent as we can discover origins that we didn't know about.
- all origins generated by this lister_process are recorded as updated each hour (note that contrary to what the docstring says, it does not "update existing ones", but only existing AND created by this lister_process simulation task).
Indeed
It looks to me that this model is pretty rough
It is.
and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.
It's a WIP, we're likely to change it in the short term.
Sure but that's a (possibly serious) bias. Because it can happen does not mean it always happen!
and I'd really like to get an idea whether this can be used to understand the actual behavior of a given scheduling policy...
Also it should be described in the simulator's doc. I want to be able to understand what this simulator does without having to read the code.It's a WIP, we're likely to change it in the short term.
Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).
We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.
Sure, but having this simulation model description/documentation would also makes code review much easier (i.e. not having to "reverse engineer" the simulation model).
done
Build is green
Patch application report for D4909 (id=17567)
Rebasing onto 2906b4e8a0...
Current branch diff-target is up to date.
Changes applied before test
commit e5709214b4917a5fe3634d040da7a061f5978f66 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:57:42 2021 +0100 simulator: add simple lister simulation commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:54:53 2021 +0100 Factor out ListedOrigin generation to use the OriginModel This generates consistent last_update values according to the model and simulated time.
See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/271/ for more details.
Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?
Build is green
Patch application report for D4909 (id=17607)
Rebasing onto 2906b4e8a0...
Current branch diff-target is up to date.
Changes applied before test
commit ea068b46a89e07c60ad1233afd36afc6bb29031e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:57:42 2021 +0100 simulator: add simple lister simulation commit 7af98e2bc048c6946679e7d95cf8620e4a0ee4bf Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 21 14:54:53 2021 +0100 Factor out ListedOrigin generation to use the OriginModel This generates consistent last_update values according to the model and simulated time.
See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/280/ for more details.