Page MenuHomeSoftware Heritage

cli.task: Migrate scheduler cli to latest storage change on iter_origins
ClosedPublic

Authored by ardumont on Aug 1 2020, 10:04 AM.

Details

Summary
  • Drop --max-id, --min-id cli parameter which can no longer work
  • Expose --page-token and --limit instead

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D3682 (id=12964)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit e48885ce005c8bbf27bf3fc85883489248af7a31
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/50/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/50/console

vlorentz added inline comments.
swh/scheduler/cli/task.py
291–293

why not use islice directly?

382

wrong order of args btw

.

swh/scheduler/cli/task.py
372–376

I don't understand the issue. Just change the arguments to take the page_token as argument instead of min_id

swh/scheduler/cli/task.py
291–293

yeah, i initially took it from the doc on itertools [1] and then thought to use it directly the generator...

[1] https://docs.python.org/3/library/itertools.html#itertools-recipes

382

because my brain wanted to use islice all along ;)

ardumont added inline comments.
swh/scheduler/cli/task.py
372–376

well for one, iter_origins does not return any page_token after that...

swh/scheduler/cli/task.py
372–376

Hmmm...

We could write a variant of iter_origins that yields tokens in addition to the origins.

swh/scheduler/cli/task.py
372–376

or drop iter_origins and use directly storage.origin_list which does just that...?

swh/scheduler/cli/task.py
372–376

yeah, indeed

swh/scheduler/cli/task.py
372–376

that'd simplify even further D3681 either by:

  • dropping iter_origins altogether
  • simplifying its definition to reuse directly swh.core.api.classes.stream_results (and stop exposing the page_token)
  • Amend according to discussions
  • Add missing test which fails due to a most probably wrong origin-list implementation for the in-memory storage
ardumont edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D3682 (id=12965)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit e4e94a48f184164e015f173027a3c471dfc21389
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/51/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/51/console

Add scenario with page-token

tests need storage > v0.11.8 to pass

Build has FAILED

Patch application report for D3682 (id=12968)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit 95882a4212324f828a13c19ddc0ed8284e93362b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/53/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/53/console

ardumont edited the test plan for this revision. (Show Details)

Build has FAILED

Patch application report for D3682 (id=12972)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit ba058084f2e403a9c13371b51eff629f228ea23a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/54/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/54/console

Fix flake8 warnings, turns out we can compose f and r prefix for strings (cool)

tests still failing since they still need a storage bump

Build has FAILED

Patch application report for D3682 (id=12973)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit 7db78dd9bf7a3cdba950094ce850fb6144dedbaa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/55/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/55/console

It would be nice to print the token from time to time so the process can be resumed if interrupted, but that's good enough for now

This revision is now accepted and ready to land.Aug 3 2020, 9:13 AM

It would be nice to print the token from time to time so the process can be resumed if interrupted, but that's good enough for now

Yeah, it hit me during the night that i forgot that.
How do you envision this, is a logger.info instruction enough?

I was thinking of print(), but that would work too

I was thinking of print(), but that would work too

yeah, ok but now i'm wondering where.
I suppose it's common enough it should go in swh.core.api.classes.stream_results

If not, then i suppose, this can go in the iter_origins (from the storage diff)?

Adapt according to review/discussion, inline iter_origins within the scheduler
so it displays the page_token during pagination.

Tests should be fine now as latest storage got bumped

Build is green

Patch application report for D3682 (id=12989)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit 2cf14951a7fc4f2a68cae9e80a9f352f6e27495c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/56/ for more details.

swh/scheduler/cli/task.py
294

"Eventually" does not mean "éventuellement", but "finalement" ;)

Use "optionally" instead

Rework docstring sentences
(thx for the heads up)

Build is green

Patch application report for D3682 (id=12990)

Rebasing onto 849d063758...

Current branch diff-target is up to date.
Changes applied before test
commit 642620845ac21c4b3fbb0aae69befc87c5ac64be
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Aug 1 10:03:23 2020 +0200

    cli.task: Migrate scheduler cli to latest storage change on iter_origins
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/57/ for more details.