Page MenuHomeSoftware Heritage

origin_visit_get_latest: Order by visit id instead of date
ClosedPublic

Authored by vlorentz on Mar 11 2022, 1:41 PM.

Details

Summary

This allows both the postgresql and cassandra backends to make efficient
queries by using an index (resp. clustering key) instead of scanning
all visits of the given origin then sorting by date.

This does not affect the results for the last majority of cases,
as ids are always in increasing chronological, unless an origin was
re-loaded from an old archive.

This should resolve recurring timeouts on this endpoint (6.4k occurrences in a month)

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 11 2022, 1:43 PM
Harbormaster failed remote builds in B27416: Diff 26532!

Build is green

Patch application report for D7339 (id=26532)

Could not rebase; Attempt merge onto 4e78014b07...

Updating 4e78014b..6c5fbabc
Fast-forward
 swh/storage/cassandra/cql.py       |  6 ++++--
 swh/storage/cassandra/storage.py   | 30 +++++++++---------------------
 swh/storage/in_memory.py           |  4 ++--
 swh/storage/postgresql/db.py       |  8 ++++----
 swh/storage/postgresql/storage.py  | 12 ++++++------
 swh/storage/tests/storage_tests.py |  4 ++--
 6 files changed, 27 insertions(+), 37 deletions(-)
Changes applied before test
commit 6c5fbabc39dd37ff1c7560f12a9f692f562aa8fc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 13:38:51 2022 +0100

    origin_visit_get_latest: Order by visit id instead of date
    
    This allows both the postgresql and cassandra backends to make efficient
    queries by using an index (resp. clustering key) instead of scanning
    all visits of the given origin then sorting by date.
    
    This does not affect the results for the last majority of cases,
    as ids are always in increasing chronological, unless an origin was
    re-loaded from an old archive.

commit 600e87fb2351736c9e05ffb3e3d9a1641a00b846
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 13:36:07 2022 +0100

    origin_visit_get_latest: Materialize subquery on 'origin' table.
    
    postgresql's query planner does not understand the origin is unique, so it performs
    a partial index scan on origin_visit_pkey, which is inefficient on origins with
    many visits.
    
    This commit itself is not enough to make it use the proper index,
    but provides this necessary change that will be used by a future commit.

commit b0cdab58b71b25e07d40dc022a5f0b0c77298c3f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 11:36:01 2022 +0100

    postgresql: Increase timeouts that often fail
    
    According to Sentry, in the last 30 days:
    
    * directory_entry_get_by_path: 958 events, https://sentry.softwareheritage.org/share/issue/c4c2124953a145b2bd325f6f6b7df5a6/
    * revision_get: 841 events, https://sentry.softwareheritage.org/share/issue/55fbe01c6f4d4c9bbf684c7608a62ad9/
    * release_get: 14 events, https://sentry.softwareheritage.org/share/issue/37c53354541b4c4eaa1faf4e20a68418/
    * origin_visit_find_by_date: 114 events, https://sentry.softwareheritage.org/share/issue/a674c12049a941968a717661a0226559/
    * origin_get: 79 events, https://sentry.softwareheritage.org/share/issue/bf21d6bc7b24442eb18643d80d936d27/ ; 67 events, https://sentry.softwareheritage.org/share/issue/010a4b1e085a4e2089ba4897c6de6038/

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1568/ for more details.

Build is green

Patch application report for D7339 (id=26533)

Could not rebase; Attempt merge onto 4e78014b07...

Updating 4e78014b..ccde0975
Fast-forward
 swh/storage/cassandra/cql.py       |  6 ++++--
 swh/storage/cassandra/storage.py   | 30 +++++++++---------------------
 swh/storage/in_memory.py           |  4 ++--
 swh/storage/postgresql/db.py       |  8 ++++----
 swh/storage/postgresql/storage.py  | 12 ++++++------
 swh/storage/tests/storage_tests.py |  4 ++--
 6 files changed, 27 insertions(+), 37 deletions(-)
Changes applied before test
commit ccde097533cd674a03e2afa6e10d994c97741d79
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 13:38:51 2022 +0100

    origin_visit_get_latest: Order by visit id instead of date
    
    This allows both the postgresql and cassandra backends to make efficient
    queries by using an index (resp. clustering key) instead of scanning
    all visits of the given origin then sorting by date.
    
    This does not affect the results for the last majority of cases,
    as ids are always in increasing chronological, unless an origin was
    re-loaded from an old archive.

commit 600e87fb2351736c9e05ffb3e3d9a1641a00b846
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 13:36:07 2022 +0100

    origin_visit_get_latest: Materialize subquery on 'origin' table.
    
    postgresql's query planner does not understand the origin is unique, so it performs
    a partial index scan on origin_visit_pkey, which is inefficient on origins with
    many visits.
    
    This commit itself is not enough to make it use the proper index,
    but provides this necessary change that will be used by a future commit.

commit b0cdab58b71b25e07d40dc022a5f0b0c77298c3f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 11 11:36:01 2022 +0100

    postgresql: Increase timeouts that often fail
    
    According to Sentry, in the last 30 days:
    
    * directory_entry_get_by_path: 958 events, https://sentry.softwareheritage.org/share/issue/c4c2124953a145b2bd325f6f6b7df5a6/
    * revision_get: 841 events, https://sentry.softwareheritage.org/share/issue/55fbe01c6f4d4c9bbf684c7608a62ad9/
    * release_get: 14 events, https://sentry.softwareheritage.org/share/issue/37c53354541b4c4eaa1faf4e20a68418/
    * origin_visit_find_by_date: 114 events, https://sentry.softwareheritage.org/share/issue/a674c12049a941968a717661a0226559/
    * origin_get: 79 events, https://sentry.softwareheritage.org/share/issue/bf21d6bc7b24442eb18643d80d936d27/ ; 67 events, https://sentry.softwareheritage.org/share/issue/010a4b1e085a4e2089ba4897c6de6038/

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1569/ for more details.

This revision is now accepted and ready to land.Mar 11 2022, 2:09 PM