Page MenuHomeSoftware Heritage

scheduler.backend: Make filter_task_to_archive a paginated endpoint
ClosedPublic

Authored by ardumont on Dec 14 2019, 11:25 AM.

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
into-paginated-endpoint
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 9760
Build 14401: tox-on-jenkinsJenkins
Build 14400: arc lint + arc unit

Event Timeline

  • backend: Filter properly archive within the defined range

Now the behavior is enforced server side. It is already dealt with within the
client cli.

vlorentz added a subscriber: vlorentz.

Please use the same pagination mechanism as the other services, for consistency. eg. https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/storage/__init__.py$791-813 and https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/elasticsearch.py$123-140

also, rename limit to count, also for consistency with them.

(I'll review the rest of the diff after you're done)

This revision now requires changes to proceed.Dec 14 2019, 4:59 PM

Please use the same pagination mechanism as the other services, for consistency. eg. https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/storage/__init__.py$791-813 and https://forge.softwareheritage.org/source/swh-search/browse/master/swh/search/elasticsearch.py$123-140

Thanks for the links

also, rename limit to count, also for consistency with them.

sure

note that the storage one stills uses limit

  • backend: Align paginated endpoint consistently with others

(I initially kept the original signature so it did not touch the client part)

This also:

  • checks consistently the tasks in paginated calls
  • improve the backend's docstring method

Fix docstring

swh/scheduler/sql/40-swh-func.sql
309

Prior to this, the actual filtering ended up listing tasks completely out of the input range.
Because we have tasks with started column left null (they never started).
Those have the scheduled entry set so we use it for filtering (which was already done client side in the task cli).

Now we are consistent.
And we will finally be able to filter those out of the scheduler db.

vlorentz added inline comments.
swh/scheduler/backend.py
421

page_token should be an optional string (both for consistency, and be cause we want it to be an opaque token we can change at any time)

swh/scheduler/cli/task.py
539–557

The last_id as a loop variable should be renamed page_token, because it should be an opaque identifier.

As for its use as argument of index_data, I don't know what to do with it. Maybe add an extra parameter to filter_task_to_archive, that is ignored if page_token is not None? Or just remove it altogether if we don't actually need it?

tox.ini
9

why?

This revision now requires changes to proceed.Dec 16 2019, 1:49 PM
swh/scheduler/backend.py
421

yes, i saw it this morning when reading back your other diffs.

I'm ultimately trying to add tests on the other diff and it takes me way too long.
i'll get back to that after i'm done with those tests.

swh/scheduler/cli/task.py
539–557

we do need it, i'm trying to avoid repeating the while truc loop.
Take a look below

gen = index_data...
if cleanup:
    # do something with gen
else:
    # do something else with gen
tox.ini
9

Because i'm using tox and ipdb to try and debug the hell of what's happening in there ;)

tox -e py3-dev -- -x -s --no-cov --log-level=DEBUG -k test_backend_setup
swh/scheduler/api/client.py
101 ↗(On Diff #8676)

revert back to limit now.

swh/scheduler/cli/task.py
539–557

i meant while True ;)

  • backend: Align paginated endpoint consistently with others:
    • Revert to use limit (and not count)
    • use optional string type as opaque identifier
vlorentz added inline comments.
swh/scheduler/cli/task.py
576

You're still not using it as an opaque identifier if it comes from a CLI option documented as "task id to start from". But I guess there is no easy way other than this one

This revision is now accepted and ready to land.Dec 16 2019, 4:45 PM