Page MenuHomeSoftware Heritage

pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
ClosedPublic

Authored by vlorentz on Jul 31 2020, 10:12 AM.

Details

Summary

URL filters usually have a few matches and use the index; whereas filtering
on visits requires to scan the entire origin table first.

This makes the query considerably faster.

Credit for the idea goes to @olasd.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14126
Build 21701: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21700: arc lint + arc unit

Event Timeline

vlorentz added a subscriber: olasd.

Build has FAILED

Patch application report for D3662 (id=12888)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..8c8b0b83
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit 8c8b0b83d7831c77143f62dfa83c142c8d112b1f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/631/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/631/console

Build has FAILED

Patch application report for D3662 (id=12889)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..df943ec2
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit df943ec25cf91c0417c23a7376d40414a429db7d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.
    
    Credit for the idea goes to @olasd.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/632/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/632/console

Build is green

Patch application report for D3662 (id=12889)

Could not rebase; Attempt merge onto cf9f44e805...

Updating cf9f44e8..df943ec2
Fast-forward
 swh/storage/backfill.py                | 43 +++++++++++++++++++++++++++++++++-
 swh/storage/converters.py              | 38 ++++++++++++++++++++++++++++++
 swh/storage/db.py                      | 42 +++++++++++++++++++--------------
 swh/storage/replay.py                  |  8 ++++++-
 swh/storage/storage.py                 | 31 ++----------------------
 swh/storage/tests/test_backfill.py     |  5 +++-
 swh/storage/tests/test_kafka_writer.py |  9 +++++++
 swh/storage/tests/test_replay.py       |  9 ++++---
 8 files changed, 131 insertions(+), 54 deletions(-)
Changes applied before test
commit df943ec25cf91c0417c23a7376d40414a429db7d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jul 31 10:11:57 2020 +0200

    pg: Rewrite _origin_query to force the query planner to filter on URLs before filtering on visits.
    
    URL filters usually have a few matches and use the index; whereas filtering
    on visits requires to scan the entire origin table first.
    
    This makes the query considerably faster.
    
    Credit for the idea goes to @olasd.

commit 0c5a8e274af2aae42cdaedf3b462a9db0fdbf177
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:39:41 2020 +0200

    Add support for metadata-related object types to the backfiller and replayer.
    
    Existing tests automatically test them, using data from swh.journal.tests.

commit 24bc51dfff6c2fd825534a6dac23ff6a7e02faa0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 30 19:33:14 2020 +0200

    test_replay: update for swh.journal 0.4.1.
    
    DUPLICATE_CONTENTS now contains BaseModel objects.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/635/ for more details.

This revision is now accepted and ready to land.Jul 31 2020, 1:22 PM