Page MenuHomeSoftware Heritage

Write queries for the crossminer dataset and estimate cost
Closed, MigratedEdits Locked

Description

origin url and directory found at:

  • last visit
  • on master/HEAD branch
  • containing pom.xml or build.gradle files

With and without filtering

Event Timeline

Query with last visit, master branch and 'pom.xml' file name filtering:

EXPLAIN SELECT o.url origin, convert_from(occ.branch, 'utf-8') branch, occ.target revision, rev.directory directory, def.target, convert_from(def.name, 'utf-8'),
                              (SELECT MAX(date) FROM origin_visit
                               WHERE origin= o.id) AS date
FROM origin o
INNER JOIN occurrence occ on o.id= occ.origin
INNER JOIN revision rev on occ.target = rev.id
INNER JOIN directory dir on rev.directory=dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries) 
WHERE occ.branch='refs/heads/master' AND def.name='pom.xml';

                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=56458607.55..749330456.83 rows=1924919 width=178)
   Workers Planned: 2
   ->  Nested Loop  (cost=56457607.55..608997867.22 rows=802050 width=154)
         ->  Nested Loop  (cost=56457606.98..605695604.36 rows=802050 width=111)
               ->  Nested Loop  (cost=56457606.40..152194225.02 rows=17310424 width=168)
                     ->  Hash Join  (cost=56457605.70..77417326.13 rows=17310424 width=71)
                           Hash Cond: ((occ.target)::bytea = (rev.id)::bytea)
                           ->  Parallel Seq Scan on occurrence occ  (cost=0.00..12235749.33 rows=17310424 width=50)
                                 Filter: (branch = '\x726566732f68656164732f6d6173746572'::bytea)
                           ->  Hash  (cost=36674515.42..36674515.42 rows=929260742 width=42)
                                 ->  Seq Scan on revision rev  (cost=0.00..36674515.42 rows=929260742 width=42)
                     ->  Index Scan using directory_pkey on directory dir  (cost=0.70..4.32 rows=1 width=118)
                           Index Cond: ((id)::bytea = (rev.directory)::bytea)
               ->  Index Scan using directory_entry_file_pkey on directory_entry_file def  (cost=0.58..26.19 rows=1 width=48)
                     Index Cond: (id = ANY (dir.file_entries))
                     Filter: ((name)::bytea = '\x706f6d2e786d6c'::bytea)
         ->  Index Scan using origin_pkey on origin o  (cost=0.57..4.12 rows=1 width=51)
               Index Cond: (id = occ.origin)
   SubPlan 1
     ->  Aggregate  (cost=72.79..72.80 rows=1 width=8)
           ->  Index Scan using origin_visit_pkey on origin_visit  (cost=0.57..72.70 rows=35 width=8)
                 Index Cond: (origin = o.id)

Query with only last visit filtering:

EXPLAIN SELECT o.url origin, convert_from(occ.branch, 'utf-8') branch, occ.target revision, rev.directory directory, def.target sha1, convert_from(def.name, 'utf-8') file_name,
                               (SELECT MAX(date) FROM origin_visit
                               WHERE origin= o.id) AS date
FROM origin o
INNER JOIN occurrence occ on o.id= occ.origin
INNER JOIN revision rev on occ.target = rev.id
INNER JOIN directory dir on rev.directory=dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries) ;


                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=288420868.90..424232612766.15 rows=5727713914 width=178)
   Workers Planned: 2
   ->  Nested Loop  (cost=288419868.90..6667405981.87 rows=2386547464 width=154)
         ->  Hash Join  (cost=288419868.32..399575775.31 rows=238654747 width=211)
               Hash Cond: ((rev.directory)::bytea = (dir.id)::bytea)
               ->  Hash Join  (cost=59962799.91..96824486.71 rows=238654747 width=114)
                     Hash Cond: (occ.origin = o.id)
                     ->  Hash Join  (cost=56457613.73..83938300.38 rows=238654747 width=71)
                           Hash Cond: ((occ.target)::bytea = (rev.id)::bytea)
                           ->  Parallel Seq Scan on occurrence occ  (cost=0.00..11639112.47 rows=238654747 width=50)
                           ->  Hash  (cost=36674520.77..36674520.77 rows=929260877 width=42)
                                 ->  Seq Scan on revision rev  (cost=0.00..36674520.77 rows=929260877 width=42)
                     ->  Hash  (cost=1739299.08..1739299.08 rows=79310008 width=51)
                           ->  Seq Scan on origin o  (cost=0.00..1739299.08 rows=79310008 width=51)
               ->  Hash  (cost=120843582.96..120843582.96 rows=3577798996 width=118)
                     ->  Seq Scan on directory dir  (cost=0.00..120843582.96 rows=3577798996 width=118)
         ->  Index Scan using directory_entry_file_pkey on directory_entry_file def  (cost=0.58..26.16 rows=10 width=48)
               Index Cond: (id = ANY (dir.file_entries))
   SubPlan 1
     ->  Aggregate  (cost=72.79..72.80 rows=1 width=8)
           ->  Index Scan using origin_visit_pkey on origin_visit  (cost=0.57..72.70 rows=35 width=8)
                 Index Cond: (origin = o.id)

Query with without filtering:

EXPLAIN SELECT o.url origin, ov.date date, convert_from(occ.branch, 'utf-8') branch, occ.target revision, rev.directory directory, def.target sha1, convert_from(def.name, 'utf-8') file_name
FROM origin o
INNER JOIN origin_visit ov on o.id = ov.origin
INNER JOIN occurrence occ on o.id= occ.origin
INNER JOIN revision rev on occ.target = rev.id
INNER JOIN directory dir on rev.directory=dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries) ;

                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=480269831.87..6321750116.95 rows=14724963464 width=178)
   Merge Cond: (o.id = ov.origin)
   ->  Gather Merge  (cost=445168861.26..7385863622.55 rows=5727713914 width=162)
         Workers Planned: 2
         ->  Nested Loop  (cost=445167861.23..6724742317.93 rows=2386547464 width=162)
               ->  Merge Join  (cost=445167860.65..449063193.42 rows=238654747 width=219)
                     Merge Cond: (occ.origin = o.id)
                     ->  Sort  (cost=431086397.16..431683034.02 rows=238654747 width=168)
                           Sort Key: occ.origin
                           ->  Hash Join  (cost=284915183.38..383893494.21 rows=238654747 width=168)
                                 Hash Cond: ((rev.directory)::bytea = (dir.id)::bytea)
                                 ->  Hash Join  (cost=56457622.79..83938311.44 rows=238654747 width=71)
                                       Hash Cond: ((occ.target)::bytea = (rev.id)::bytea)
                                       ->  Parallel Seq Scan on occurrence occ  (cost=0.00..11639112.47 rows=238654747 width=50)
                                       ->  Hash  (cost=36674526.13..36674526.13 rows=929261013 width=42)
                                             ->  Seq Scan on revision rev  (cost=0.00..36674526.13 rows=929261013 width=42)
                                 ->  Hash  (cost=120843842.93..120843842.93 rows=3577806693 width=118)
                                       ->  Seq Scan on directory dir  (cost=0.00..120843842.93 rows=3577806693 width=118)
                     ->  Sort  (cost=14081448.52..14279723.54 rows=79310008 width=51)
                           Sort Key: o.id
                           ->  Seq Scan on origin o  (cost=0.00..1739299.08 rows=79310008 width=51)
               ->  Index Scan using directory_entry_file_pkey on directory_entry_file def  (cost=0.58..26.20 rows=10 width=48)
                     Index Cond: (id = ANY (dir.file_entries))
   ->  Materialize  (cost=35097022.44..36116484.12 rows=203892336 width=16)
         ->  Sort  (cost=35097022.44..35606753.28 rows=203892336 width=16)
               Sort Key: ov.origin
               ->  Seq Scan on origin_visit ov  (cost=0.00..4467662.36 rows=203892336 width=16)

From the get go: the occurrence table is stale and should not be used anymore, please use the snapshot/snapshot_branch/snapshot_branches tables instead.

I'm still not entirely convinced that this work should be done through raw SQL queries, but rather using the higher level Python APIs. This depends on the following considerations:

  • how often do you need to update the data, and for how long?
  • how many different parameters do you need to have? (pom.xml/build.gradle/...)
  • how often will the parameters change?

Please also note that the SQL schema is not fixed in stone and that raw queries are at the mercy of schema changes, while tools using the higher level swh.storage APIs will keep working (mostly).

Depending on the answer to those questions you can either go ahead with SQL, or maybe try to build tools that compose on top of existing APIs to perform the same tasks in a more future-proof manner.

If pure SQL is the right answer, you'll really want to use CTEs ("Common table expressions", syntax with $label as ($query), $label as ($query) $final_query to narrow the lookups down step by step. Your current queries do a full sequential scan on both revision, directory and occurrence, which will take several hours and impact everything else touching the database.

The query can be broken down in a few steps:

  • find the latest visit for all origins
  • find the default branch associated to the snapshot for these visits (snapshot_branch.branch = 'HEAD')
  • find the directory that this default branch points to
  • check if those directories contain the files you're looking for

I suggest that you consider storing the intermediary results as well so you can restart and slice the queries at each level, avoiding having transactions open for too long. Squinting a bit, I think those queries will mostly use the indexes that are available on the read replica, so you can work there rather than on the main database.

If you want to go ahead with the Python APIs rather than SQL, the steps will be pretty much the same. You could even mix and match explicit queries with calls to the swh.storage API.

In T975#18129, @olasd wrote:

I'm still not entirely convinced that this work should be done through raw SQL queries, but rather using the higher level Python APIs. This depends on the following considerations:

  • how often do you need to update the data, and for how long?
  • how many different parameters do you need to have? (pom.xml/build.gradle/...)
  • how often will the parameters change?

We've discussed something similar with @moranegg last week, and she has planned to discuss these points --- to get an answer --- with the CROSSMINER folks this folks. I don't know if it has happened yet or not.

Depending on the answer to those questions you can either go ahead with SQL, or maybe try to build tools that compose on top of existing APIs to perform the same tasks in a more future-proof manner.

Yep, a future-proof solution would be preferable, also because it is likely we'll have in the future to support similar kind of queries for others.
The only reason to do otherwise is IMO only if this is really a *one* shot task. @moranegg should be able to determine if this is the case or not this week.

Thank you both for your answers.

In T975#18129, @olasd wrote:

From the get go: the occurrence table is stale and should not be used anymore, please use the snapshot/snapshot_branch/snapshot_branches tables instead.

I wasn't sure if we completely migrated to snapshots, that's why I used here occurrences, but this only an idea of a draft for a query.

In T975#18131, @zack wrote:
In T975#18129, @olasd wrote:

I'm still not entirely convinced that this work should be done through raw SQL queries, but rather using the higher level Python APIs. This depends on the following considerations:

  • how often do you need to update the data, and for how long?
  • how many different parameters do you need to have? (pom.xml/build.gradle/...)
  • how often will the parameters change?

We've discussed something similar with @moranegg last week, and she has planned to discuss these points --- to get an answer --- with the CROSSMINER folks this folks. I don't know if it has happened yet or not.

I agree with you that it would be preferable to provide a long term solution and not a one time query and thus not do it with SQL.
I would also prefer to improve and deploy my revision/content metadata indexers than work on one time queries...
But as @zack suggested this depends on the CROSSMINER folks, they want a dataset to work with as soon as possible.

After yesterday's conversation the request is a filtered or non filtered one time dataset (following the details in the task description)
and that's why I checked the cost difference of the queries, but later I understood (thanks to @olasd) that I can't rely on the EXPLAIN feature.

In T975#18131, @zack wrote:
In T975#18129, @olasd wrote:

Depending on the answer to those questions you can either go ahead with SQL, or maybe try to build tools that compose on top of existing APIs to perform the same tasks in a more future-proof manner.

Yep, a future-proof solution would be preferable, also because it is likely we'll have in the future to support similar kind of queries for others.
The only reason to do otherwise is IMO only if this is really a *one* shot task. @moranegg should be able to determine if this is the case or not this week.

Maybe we can break down the solution into two steps:

  1. Provide a one time dataset and determine if it's a filtered dataset
  2. Build tools for a long term solution

! In T975#18129, @olasd wrote:
If pure SQL is the right answer, you'll really want to use CTEs ("Common table expressions", syntax with $label as ($query), $label as ($query) $final_query to narrow the lookups down step by step. Your current queries do a full sequential scan on both revision, directory and occurrence, which will take several hours and impact everything else touching the database.

The query can be broken down in a few steps:

  • find the latest visit for all origins
  • find the default branch associated to the snapshot for these visits (snapshot_branch.branch = 'HEAD')
  • find the directory that this default branch points to
  • check if those directories contain the files you're looking for

    I suggest that you consider storing the intermediary results as well so you can restart and slice the queries at each level, avoiding having transactions open for too long. Squinting a bit, I think those queries will mostly use the indexes that are available on the read replica, so you can work there rather than on the main database.

    If you want to go ahead with the Python APIs rather than SQL, the steps will be pretty much the same. You could even mix and match explicit queries with calls to the swh.storage API.

Thanks for the precious advise.
I would love to rewrite the queries more efficiently, but first I want to be sure that the queries will be executed and if we should do the filtering.

following @olasd suggestion with a CTE

WITH last_visited AS (
        SELECT o.url url,
               ov.snapshot_id snp,
               (SELECT MAX(date) FROM origin_visit
                WHERE origin= o.id) AS date
        FROM origin o
        INNER JOIN origin_visit ov on o.id = ov.origin limit 1000
     ), head_branch_revision AS (
        SELECT lv.url url, s.id snp_sha1, sb.target revision_sha1, lv.date date
        FROM last_visited lv
        INNER JOIN snapshot s on lv.snp = s.object_id
        INNER JOIN snapshot_branches sbs on  s.object_id  = sbs.snapshot_id
        INNER JOIN snapshot_branch sb on sbs.branch_id = sb.object_id
        WHERE sb.name = 'HEAD' AND sb.target_type = 'revision' limit 100
      )
SELECT hbr.url url,
       hbr.snp_sha1 snp_sha1,
       hbr.revision_sha1 revision_sha1,
       hbr.date visit_date,
       dir.id directory
FROM head_branch_revision hbr
INNER JOIN revision rev on hbr.revision_sha1 = rev.id
INNER JOIN directory dir on rev.directory = dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries)
WHERE def.name='pom.xml' limit 1;

result example

url                                              |                  snp_sha1                  |               revision_sha1                |          visit_date           |                 directory           
----------------------------------------------------+--------------------------------------------+--------------------------------------------+-------------------------------+--------------------------------------------
https://github.com/BeyongTheMemory/pop-user-center | \x1c0adc8e2413114d064abcae1b8ddde14589e284 | \x1d35accb1b6f3e36127e54a857c3c0cc411ec29d | 2018-02-16 08:21:07.307359+00 | \x678a00df685665c442e22c698145969f223b8b3a
WITH last_visited AS (
        SELECT o.url url,
               ov.snapshot_id snp,
               (SELECT MAX(date) FROM origin_visit
                WHERE origin= o.id) AS date
        FROM origin o
        INNER JOIN origin_visit ov on o.id = ov.origin 
        WHERE o.id < 1000
     ), head_branch_revision AS (
        SELECT lv.url url, s.id snp_sha1, sb.target revision_sha1, lv.date date
        FROM last_visited lv
        INNER JOIN snapshot s on lv.snp = s.object_id
        INNER JOIN snapshot_branches sbs on  s.object_id  = sbs.snapshot_id
        INNER JOIN snapshot_branch sb on sbs.branch_id = sb.object_id
        WHERE sb.name = 'HEAD' AND sb.target_type = 'revision'
      )
SELECT hbr.url url,  dir.id directory
FROM head_branch_revision hbr
INNER JOIN revision rev on hbr.revision_sha1 = rev.id
INNER JOIN directory dir on rev.directory = dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries)
WHERE def.name='pom.xml';

This has better explain output (see below):

WITH last_visited AS (
        SELECT o.url url,
               ov.snapshot_id snp,
               (SELECT MAX(date) FROM origin_visit
                WHERE origin= o.id) AS date
        FROM origin o
        INNER JOIN origin_visit ov on o.id = ov.origin 
        WHERE 0 <= o.id and o.id < 1000 order by o.id limit 1000  -- order by + limit mentioned below [1]
     ), head_branch_revision AS (
        SELECT lv.url url, s.id snp_sha1, sb.target revision_sha1, lv.date date
        FROM last_visited lv
        INNER JOIN snapshot s on lv.snp = s.object_id
        INNER JOIN snapshot_branches sbs on  s.object_id  = sbs.snapshot_id
        INNER JOIN snapshot_branch sb on sbs.branch_id = sb.object_id
        WHERE sb.name = 'HEAD' AND sb.target_type = 'revision'  -- an index on snapshot_branch(name, target_type) was created temporarily to check 
                                                                -- if it's beneficial (or not)
      )
SELECT hbr.url url,  dir.id directory
FROM head_branch_revision hbr
INNER JOIN revision rev on hbr.revision_sha1 = rev.id
INNER JOIN directory dir on rev.directory = dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries)
WHERE def.name='pom.xml';

Note that my understanding is that the query is run in a loop in charge of moving forward the range from [0, 1000], [1000, 2000], etc...

Here is the explain output's main cost per different forms:

1000:
cost=6028594.48.. 104650980.19 (order by + limit as demontrated [1])
cost=4967360.59.. 103590488.28 (order by + limit + new index that avoids the seq scan on snapshot_branch) -> data seems to come back sooner-ish with the new index
cost=7794867.38.. 253557652.74 (basis)

10000:
cost=76940605.74..943933501.12 (order by + limit)
cost=76321609.50..957651428.79 (order by + limit + new index that avoids the seq scan on snapshot_branch) -> that new index seems not that interesting here
cost=93342140.01..2252940107.13 (basis)

I just noticed a potential error that explains an inconsistency.

You retrieve all visits from origin (modifying the date with the last visit date for the origin).
I think you mean to retrieve only the last visit.
So here is such a query (now you can take as you want either 1000 or 10000, the explain output is consistent, ratio included (10x) [1]

WITH last_visited AS (
        SELECT o.url url,
               ov.snapshot_id snp, date
        FROM origin o
        INNER JOIN origin_visit ov on o.id = ov.origin
        WHERE 0 <= o.id AND
              o.id < 1000 AND
              ov.visit = (select max(visit) FROM origin_visit
                          where origin=o.id)
        order by o.id limit 1000  -- order by + limit mentioned below [1]
     ), head_branch_revision AS (
        SELECT lv.url url, s.id snp_sha1, sb.target revision_sha1, lv.date date
        FROM last_visited lv
        INNER JOIN snapshot s on lv.snp = s.object_id
        INNER JOIN snapshot_branches sbs on  s.object_id  = sbs.snapshot_id
        INNER JOIN snapshot_branch sb on sbs.branch_id = sb.object_id
        WHERE sb.name = 'HEAD' AND sb.target_type = 'revision'  -- an index on snapshot_branch(name, target_type) was created temporarily to check 
                                                                -- if it's beneficial (or not)
      )
SELECT hbr.url url,  dir.id directory
FROM head_branch_revision hbr
INNER JOIN revision rev on hbr.revision_sha1 = rev.id
INNER JOIN directory dir on rev.directory = dir.id
INNER JOIN directory_entry_file def on def.id = any(dir.file_entries)
WHERE def.name='pom.xml';

[1] explain output cost:

  • 1000: (cost=102001.92..1284496.76 rows=1293 width=53)
  • 10000: (cost=995800.31..12508842.89 rows=12606 width=53)