Details

Rather than depend on the counter table, which is only updated every so often and which doesn't account for potential holes in origin ids, we should use something more reliable to get a random row like the SQL 2003 tablesample clause implemented in PostgreSQL.

I also think that we should directly look for visits, as we'll want to filter on data from these (notably to get the visit type information, as well as whether the origin was "recently successful") anyway.

Something along the lines of:

select
  origin.url,
  origin_visit.*
 from origin_visit tablesample bernoulli (0.1)
 inner join origin 
   on (origin_visit.origin = origin.id)
 where 
   type='git'
   and status='full'
   and date > now() - '3 months'::interval
 limit 1;

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

This revision now requires changes to proceed.Dec 6 2019, 9:15 PM

Agreed.
I'll see to the changes.

Thanks for the reasoning behind ;)

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

Playing a bit with this, i noticed that the first query is actually slow (order of minutes to answer something on somerset for example).
The next queries (for the same type and same interval) are instantaneous (in comparison).

And for some origin type, that returns nothing (type svn for example). Which i did not account for here.
And it's always slow (worse case scenario on a seq scan).

for information, I'm currently fighting the tests now ;)

As the filtering is done after the sampling [1], i'm trying to find the right way to add enough data for tests to go green (and not slow too much things down).
I'll update the diff when i have reproducible green tests.

[1] article's note:

One important thing to consider is that any conditionals in the WHERE clause will be applied after the sampling. Which means that the requested number of rows will be picked first and then be filtered using the conditionals expressed in the WHERE clause.

storage: Prefer sampling to counter table

ardumont edited the summary of this revision. (Show Details)Dec 9 2019, 1:48 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/817/ for more details.

Harbormaster completed remote builds in B9604: Diff 8520.Dec 9 2019, 1:49 PM

Improve sentence phrasing

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/818/ for more details.

Harbormaster completed remote builds in B9605: Diff 8521.Dec 9 2019, 1:57 PM

ardumont mentioned this in T2120: Loaders: standalone task tests.Dec 10 2019, 10:15 AM

ardumont added inline comments.Dec 10 2019, 11:34 AM

swh/storage/db.py
664 ↗	(On Diff #8521)	I have kept the 3 months interval as is. Tell me if you prefer to have this as a parameter instead. I thought it ok to leave this as is as this is mostly opened for test purposes.

In D2404#56910, @ardumont wrote:

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

Playing a bit with this, i noticed that the first query is actually slow (order of minutes to answer something on somerset for example).
The next queries (for the same type and same interval) are instantaneous (in comparison).

And for some origin type, that returns nothing (type svn for example). Which i did not account for here.
And it's always slow (worse case scenario on a seq scan).

Did we really do any svn loads in the last three months? Maybe there's really nothing to show. In that case, the query _will_ do a sequential scan of the whole table.

What we could do is:

add a btree index on the tuple (origin_visit.type, origin_visit.status, origin_visit.date)
do a CTE to scan the table for the given origin type, status and date interval (which should end up being an index scan)
do the bernoulli sampling on the CTE

This should allievate the speed concerns, and increase the probability of getting a result.

Did we really do any svn loads in the last three months? Maybe there's really nothing to show. In that case, the query _will_ do a sequential scan of the whole table.

Yes, that's my understanding.

Just saying that initially, i thought it would always return something.
I was just wrong :)

add a btree index on the tuple (origin_visit.type, origin_visit.status, origin_visit.date)
do a CTE to scan the table for the given origin type, status and date interval (which should end up being an index scan)
do the bernoulli sampling on the CTE
This should allievate the speed concerns, and increase the probability of getting a result.

Indeed. Sounds like a good plan.
I will adapt accordingly.

Would have been nice.
Tablesample only works on table and materialized views.

explain with visits as (
  select origin.url, origin_visit.*
  from origin_visit
  inner join origin
  on origin_visit.origin = origin.id
  where origin_visit.status='full' and
        origin_visit.type='deb' and
        origin_visit.date > now() - '3 months'::interval
)
select *
from visits tablesample bernoulli(1)
limit 1;
ERROR:  TABLESAMPLE clause can only be applied to tables and materialized views
LINE 11: from visits tablesample bernoulli(1)
              ^

(tested on staging)

In any case, the btree index can actually help the current query anyway (a tad late though).

materialized views.

We could implement a materialized view on a sliding window of the last 3 months visits.
And take a sample out of it.

pros:

simpler? (i'm not that convinced ;)

cons:

"more" maintenance (when and how to refresh it ~> periodic cron every night?)
less parametric (current code is not, but we could adapt the current code to parametrize the current 3 months interval)

In D2404#57201, @ardumont wrote:

materialized views.

We could implement a materialized view on a sliding window of the last 3 months visits.
And take a sample out of it.

pros:

simpler? (i'm not that convinced ;)

cons:

"more" maintenance (when and how to refresh it ~> periodic cron every night?)

less parametric (current code is not, but we could adapt the current code to parametrize the current 3 months interval)