Page MenuHomeSoftware Heritage

Make scheduling policy used in schedule_recurrent configurable
ClosedPublic

Authored by douardda on Apr 15 2022, 6:19 PM.

Details

Summary

Add support for a configuration option "scheduling_policy" in the config
file loaded by the 'swh scheduler schedule-recurrent' command. This
config entry allows to specify the scheduling policies used by the
schedule-recurrent tool, instead of having them hardcoded in the source
code.

A visit type policy config entry should have at least a 'weight' value
for each policy.

Default values are unchanged.

Eg.:

scheduling_policy:
  git:
    - policy: already_visited_order_by_lag
      weight: 55
      tablesample: 0.5
    - policy: never_visited_oldest_update_first
      weight: 45
      tablesample: 0.5

Note: there may not be configuration entries for all visit types, but if
a visit type policy is configured, the config entry should be complete
(in other words, the merging of the configuration with the default
values is only done at first config level).

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7591 (id=27488)

Rebasing onto 5302efdafd...

Current branch diff-target is up to date.
Changes applied before test
commit e68135cb3dd2f3d87b4b1f7ffa3fe41ee91c71cb
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Apr 15 18:08:49 2022 +0200

    Make scheduling policy used in schedule_recurrent configurable
    
    Add support for a configuration option "scheduling_policy" in the config
    file loaded by the 'swh scheduler schedule-recurrent' command. This
    config entry allows to specify the scheduling policies used by the
    schedule-recurrent tool, instead of having them hardcoded in the source
    code.
    
    A visit type policy config entry should have at least a 'weight' value
    for each policy.
    
    Default values are unchanged.
    
    Eg.:
    
      scheduling_policy:
        git:
          already_visited_order_by_lag:
            weight: 55
            tablesample: 0.5
          never_visited_oldest_update_first:
            weight: 45
            tablesample: 0.5
    
    Note: there may not be configuration entries for all visit types, but if
          a visit type policy is configured, the config entry should be complete
          (in other words, the merging of the configuration with the default
          values is only done at first config level).

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/526/ for more details.

Sounds good! I would suggest making the policies for a given visit type a list:

scheduling_policy:
  git:
    - weight: 55
      policy: already_visited_order_by_lag
      tablesample: 0.5
    - weight: 45
      policy: never_visited_oldest_update_first
      tablesample: 0.5

Having two mandatory keys for each list entry (policy and weight).

This makes the structure a bit flatter and allows us to repeat the same policy with different "other" parameters, if we so choose

Use a flatter config structure

douardda retitled this revision from [wip] Make scheduling policy used in schedule_recurrent configurable to Make scheduling policy used in schedule_recurrent configurable.Apr 20 2022, 9:33 AM
douardda edited the summary of this revision. (Show Details)
douardda edited the summary of this revision. (Show Details)
douardda edited the summary of this revision. (Show Details)

Build is green

Patch application report for D7591 (id=27530)

Rebasing onto 5302efdafd...

Current branch diff-target is up to date.
Changes applied before test
commit 98ee5f91d5d5248c2ccebe54cc1dfc7dd185db88
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Apr 15 18:08:49 2022 +0200

    Make scheduling policy used in schedule_recurrent configurable
    
    Add support for a configuration option "scheduling_policy" in the config
    file loaded by the 'swh scheduler schedule-recurrent' command. This
    config entry allows to specify the scheduling policies used by the
    schedule-recurrent tool, instead of having them hardcoded in the source
    code.
    
    A visit type policy config entry should have at least a 'weight' value
    for each policy.
    
    Default values are unchanged.
    
    Eg.:
    
      scheduling_policy:
        git:
          - policy: already_visited_order_by_lag
            weight: 55
            tablesample: 0.5
          - policy: never_visited_oldest_update_first
            weight: 45
            tablesample: 0.5
    
    Note: there may not be configuration entries for all visit types, but if
          a visit type policy is configured, the config entry should be complete
          (in other words, the merging of the configuration with the default
          values is only done at first config level).

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/527/ for more details.

This revision is now accepted and ready to land.Apr 20 2022, 12:05 PM

Looks good, except for a small issue in the logic with repeated policies (which probably warrants adding a test for this usecase too).

swh/scheduler/celery_backend/recurrent_visits.py
83–86

I think that policy_cfg is a List[Dict[str, Any]], not a Dict.

106

If the policy gets repeated with different arguments, the computation will break as the last entry in the list will win.

I think we want to keep lists throughout this function instead of dicts now (and use zip in the final for loop for the ratio).

175

Same comment here, that's a List.

swh/scheduler/celery_backend/recurrent_visits.py
279

I think this warrants a ValueError with an explicit error message ;-)

douardda edited the summary of this revision. (Show Details)

Improve docstrings, better config validation, use lists in grab_next_visits_policy_weights()

Build is green

Patch application report for D7591 (id=27554)

Rebasing onto 5302efdafd...

Current branch diff-target is up to date.
Changes applied before test
commit a76bb02f0e94bf1c61124c9133d48a03f3d1a05f
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Apr 15 18:08:49 2022 +0200

    Make scheduling policy used in schedule_recurrent configurable
    
    Add support for a configuration option "scheduling_policy" in the config
    file loaded by the 'swh scheduler schedule-recurrent' command. This
    config entry allows to specify the scheduling policies used by the
    schedule-recurrent tool, instead of having them hardcoded in the source
    code.
    
    A visit type policy config entry should have at least a 'weight' value
    for each policy.
    
    Default values are unchanged.
    
    Eg.:
    
      scheduling_policy:
        git:
          - policy: already_visited_order_by_lag
            weight: 55
            tablesample: 0.5
          - policy: never_visited_oldest_update_first
            weight: 45
            tablesample: 0.5
    
    Note: there may not be configuration entries for all visit types, but if
          a visit type policy is configured, the config entry should be complete
          (in other words, the merging of the configuration with the default
          values is only done at first config level).

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/528/ for more details.

swh/scheduler/celery_backend/recurrent_visits.py
83–86

thx mypy...

279

I was pretty sure this would not pass... I tried :-)

olasd added inline comments.
swh/scheduler/celery_backend/recurrent_visits.py
108–112

We may want to explicitly allow that, at some point: if we make the existing scheduling policies more generic (e.g. by adding an argument to allow reversing the sort order, ...), repeating the same policy multiple times would make sense. I agree that it's not really needed now.