Implement the scheduling policy for the recurrent visit scheduler
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Jun 9 2020, 5:09 PM

Description

When both the lister API and the recent visit cache have been seeded, we should be able to implement the actual scheduling policy for the new scheduler.

generate the list of the "next" origin urls to load from the scheduler tables (according to the scheduling policy);
take a list of urls and generate "legacy" one-shot tasks;
"visit simulator" which updates the scheduler database according to a simulated loading time for each origin, and allows us to monitor the behavior of the full simulated scheduling/loading infrastructure.
- get a model of current loading time distribution
- determine which metrics we want to
  - optimize the scheduler policy
  - check for runaway edge cases, e.g. origins that never get loaded even if the "average" behavior is okay
  - reduce the "number of useless visits"
  - lag between actual commit and next visit
  - ...

Revisions and Commits

rDSCH Scheduling utilities
	Abandoned		D5809 Direct scheduling of origin visits in celery
	Closed		D4846 Introduce a `swh scheduler origin schedule-next` cli
	Closed		D4844 Introduce a `swh scheduler origin grab-next` cli
		D4898	rDSCHacad712ad3f7 Add scheduling policy for never visited origins
		D4899	rDSCH2f47936731cf Add scheduling policy for already visited origins with known last update
		D4881	rDSCHf8627a96fed6 Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2345 Improve handling of recurrent loading tasks in scheduler
Migrated	gitlab-migration	T2454 Stop creating tasks directly in listers
Migrated	gitlab-migration	T2444 Implement the scheduling policy for the recurrent visit scheduler
Migrated	gitlab-migration	T2442 Provide a unified API for listers to interact with the scheduler
Migrated	gitlab-migration	T2955 Port Bitbucket lister to the new Lister API
Migrated	gitlab-migration	T2956 Port PyPI lister to the new Lister API
Migrated	gitlab-migration	T2972 Port npm lister to the new Lister API
Migrated	gitlab-migration	T2979 Port debian lister to the new Lister API
Migrated	gitlab-migration	T2989 Port CRAN lister to the new Lister API
Migrated	gitlab-migration	T2990 Port GNU lister to the new Lister API
Migrated	gitlab-migration	T2991 Port packagist lister to the new Lister API
Migrated	gitlab-migration	T2992 Port launchpad lister to the new Lister API
Migrated	gitlab-migration	T2987 Port gitlab lister to the new `swh.lister.pattern.Lister` API
Migrated	gitlab-migration	T2984 Port cgit lister to the new Lister API
Migrated	gitlab-migration	T3073 Properly document the new unified API lister
Migrated	gitlab-migration	T2443 Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler
Migrated	gitlab-migration	T2963 Add visit_type field to OriginVisitStatus model object
Migrated	gitlab-migration	T2964 Adapt origin_visit_status_(get\|add) api to deal with the visit_type
Migrated	gitlab-migration	T2965 Adapt storage to actually write the visit_type in the origin_visit_status topic
Migrated	gitlab-migration	T2966 Backfill origin_visit_status with the `visit_type` field properly given
Migrated	gitlab-migration	T2967 Write journal client subcribed to origin_visit_status topics
Migrated	gitlab-migration	T2978 Deploy visit-stats journal client on staging
Migrated	gitlab-migration	T2993 Deploy visit-stats journal client on production