Page MenuHomeSoftware Heritage

pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
ClosedPublic

Authored by vlorentz on Jul 31 2020, 10:12 AM.

Details

Summary

URL filters usually have a few matches and use the index; whereas filtering
on visits requires to scan the entire origin table first.

This makes the query considerably faster.

Credit for the idea goes to @olasd.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14125
Build 21699: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21698: arc lint + arc unit

Unit TestsFailed

TimeTest
9,067 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer
kafka_prefix = 'ftpywlqlqy', kafka_server = '127.0.0.1:40039' consumer = <cimpl.Consumer object at 0x7f23503f2158>
8,227 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer_anonymized
kafka_prefix = 'bqhkorllfh', kafka_server = '127.0.0.1:40039' consumer = <cimpl.Consumer object at 0x7f23503fb048>
9,013 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_replay::test_storage_play_with_collision
replayer_storage_and_client = (<swh.storage.in_memory.InMemoryStorage object at 0x7f23503ca1d0>, <swh.journal.client.JournalClient object at 0x7f23503ca780>) caplog = <_pytest.logging.LogCaptureFixture object at 0x7f235155e160>
10 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_content
6 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_origin
View Full Test Results (3 Failed · 750 Passed · 17 Skipped)

Event Timeline

vlorentz added a subscriber: olasd.

Build has FAILED

Patch application report for D3662 (id=12888)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..8c8b0b83
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit 8c8b0b83d7831c77143f62dfa83c142c8d112b1f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/631/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/631/console

Build has FAILED

Patch application report for D3662 (id=12889)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..df943ec2
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit df943ec25cf91c0417c23a7376d40414a429db7d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.
    
    Credit for the idea goes to @olasd.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/632/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/632/console

Build is green

Patch application report for D3662 (id=12889)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..df943ec2
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit df943ec25cf91c0417c23a7376d40414a429db7d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.
    
    Credit for the idea goes to @olasd.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/635/ for more details.

This revision is now accepted and ready to land.Jul 31 2020, 1:22 PM