Page MenuHomeSoftware Heritage

Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus
ClosedPublic

Authored by douardda on Jan 19 2021, 2:49 PM.

Details

Summary

this timestamp being actually a loading-related value, it makes more
sense to keep it in the OriginVisitStatus table.

Related to T2444.

Diff Detail

Event Timeline

Build is green

Patch application report for D4881 (id=17338)

Rebasing onto 5e609d5205...

Current branch diff-target is up to date.
Changes applied before test
commit f87d9d62fd087e373d1e86fdbbc6a7b5f6c09f24
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jan 19 14:45:09 2021 +0100

    Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus
    
    this timestamp being actually a loading-related value, it makes more
    sense to keep it in the OriginVisitStatus table.
    
    Related to T2444.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/157/ for more details.

forget the migration script

Build is green

Patch application report for D4881 (id=17339)

Rebasing onto 5e609d5205...

Current branch diff-target is up to date.
Changes applied before test
commit dd905f5adeca8a4ea83f05b62b45dcc354a30d42
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jan 19 14:45:09 2021 +0100

    Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus
    
    this timestamp being actually a loading-related value, it makes more
    sense to keep it in the OriginVisitStatus table.
    
    Related to T2444.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/158/ for more details.

vlorentz added inline comments.
swh/scheduler/backend.py
311

why did you remove this line?

swh/scheduler/backend.py
311

because postgresql would not let me have it... but as I'm a noob in sql, I am very welcome to any help :-)

Nice!

I've taken the liberty of reorganizing/rewriting the SQL; I have quickly tested it on a scratch database and it seems to work.

I think we'll be able to use the same three-part query scaffolding to implement other scheduling policies, only changing the filter / sort criteria, so that feels like a big win.

swh/scheduler/backend.py
305–321

I've reorganized your query in three parts:

  • selecting all the data for the next origins to visit from a join of the lister and stats table
  • updating the stats table
  • returning the selected data directly

We don't need to chain all the data through using a returning in the update_stats CTE, we can just select all the interesting data in the first select, do the bulk update, and return the data from the first CTE directly.

315

We'll probably want to be able to set this now() timestamp from the outside, for simulation purposes.

807

Maybe this would deserve a separate commit (no need for a separate diff)?

swh/scheduler/backend.py
305–321

awesome, thx a lot

807

I can but this is really needed here as a side effect of the new implementation of grab_next_visits() (which inserts rows in origin_visit_stats)

Use olasd's version os the SQL query

Build is green

Patch application report for D4881 (id=17346)

Rebasing onto 5e609d5205...

Current branch diff-target is up to date.
Changes applied before test
commit 197dcea6448b15f39fd5406ab3b741a00de8319d
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jan 19 14:45:09 2021 +0100

    Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus
    
    this timestamp being actually a loading-related value, it makes more
    sense to keep it in the OriginVisitStatus table.
    
    Related to T2444.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/159/ for more details.

vsellier added inline comments.
sql/updates/24.sql
9

Shouldn't the indexes be added here too?

Build is green

Patch application report for D4881 (id=17352)

Rebasing onto 0a32a31195...

Current branch diff-target is up to date.
Changes applied before test
commit f8627a96fed6fa8d4485d4092159982dd5c6362b
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jan 19 14:45:09 2021 +0100

    Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus
    
    this timestamp being actually a loading-related value, it makes more
    sense to keep it in the OriginVisitStatus table.
    
    Related to T2444.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/163/ for more details.

This revision is now accepted and ready to land.Jan 20 2021, 9:02 AM