Page MenuHomeSoftware Heritage

storage: Add endpoint to randomly pick an origin
ClosedPublic

Authored by ardumont on Dec 6 2019, 12:52 PM.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 9600
Build 14134: tox-on-jenkinsJenkins
Build 14133: arc lint + arc unit

Event Timeline

Refresh counters prior to random get on origins

Fix db storage test, it's actually working as initially expected ;)

olasd requested changes to this revision.Dec 6 2019, 9:15 PM

Rather than depend on the counter table, which is only updated every so often and which doesn't account for potential holes in origin ids, we should use something more reliable to get a random row like the SQL 2003 tablesample clause implemented in PostgreSQL.

I also think that we should directly look for visits, as we'll want to filter on data from these (notably to get the visit type information, as well as whether the origin was "recently successful") anyway.

Something along the lines of:

select
  origin.url,
  origin_visit.*
 from origin_visit tablesample bernoulli (0.1)
 inner join origin 
   on (origin_visit.origin = origin.id)
 where 
   type='git'
   and status='full'
   and date > now() - '3 months'::interval
 limit 1;

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

This revision now requires changes to proceed.Dec 6 2019, 9:15 PM

Agreed.
I'll see to the changes.

Thanks for the reasoning behind ;)

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

Playing a bit with this, i noticed that the first query is actually slow (order of minutes to answer something on somerset for example).
The next queries (for the same type and same interval) are instantaneous (in comparison).

And for some origin type, that returns nothing (type svn for example). Which i did not account for here.
And it's always slow (worse case scenario on a seq scan).

for information, I'm currently fighting the tests now ;)

As the filtering is done after the sampling [1], i'm trying to find the right way to add enough data for tests to go green (and not slow too much things down).
I'll update the diff when i have reproducible green tests.

[1] article's note:

One important thing to consider is that any conditionals in the WHERE clause will be applied after the sampling. Which means that the requested number of rows will be picked first and then be filtered using the conditionals expressed in the WHERE clause.
  • storage: Prefer sampling to counter table

Improve sentence phrasing

swh/storage/db.py
663

I have kept the 3 months interval as is.
Tell me if you prefer to have this as a parameter instead.
I thought it ok to leave this as is as this is mostly opened for test purposes.

tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found).

Playing a bit with this, i noticed that the first query is actually slow (order of minutes to answer something on somerset for example).
The next queries (for the same type and same interval) are instantaneous (in comparison).

And for some origin type, that returns nothing (type svn for example). Which i did not account for here.
And it's always slow (worse case scenario on a seq scan).

Did we really do any svn loads in the last three months? Maybe there's really nothing to show. In that case, the query _will_ do a sequential scan of the whole table.

What we could do is:

  • add a btree index on the tuple (origin_visit.type, origin_visit.status, origin_visit.date)
  • do a CTE to scan the table for the given origin type, status and date interval (which should end up being an index scan)
  • do the bernoulli sampling on the CTE

This should allievate the speed concerns, and increase the probability of getting a result.

Did we really do any svn loads in the last three months? Maybe there's really nothing to show. In that case, the query _will_ do a sequential scan of the whole table.

Yes, that's my understanding.

Just saying that initially, i thought it would always return something.
I was just wrong :)

add a btree index on the tuple (origin_visit.type, origin_visit.status, origin_visit.date)
do a CTE to scan the table for the given origin type, status and date interval (which should end up being an index scan)
do the bernoulli sampling on the CTE
This should allievate the speed concerns, and increase the probability of getting a result.

Indeed. Sounds like a good plan.
I will adapt accordingly.

Would have been nice.
Tablesample only works on table and materialized views.

explain with visits as (
  select origin.url, origin_visit.*
  from origin_visit
  inner join origin
  on origin_visit.origin = origin.id
  where origin_visit.status='full' and
        origin_visit.type='deb' and
        origin_visit.date > now() - '3 months'::interval
)
select *
from visits tablesample bernoulli(1)
limit 1;
ERROR:  TABLESAMPLE clause can only be applied to tables and materialized views
LINE 11: from visits tablesample bernoulli(1)
              ^

(tested on staging)

In any case, the btree index can actually help the current query anyway (a tad late though).

materialized views.

We could implement a materialized view on a sliding window of the last 3 months visits.
And take a sample out of it.

pros:

  • simpler? (i'm not that convinced ;)

cons:

  • "more" maintenance (when and how to refresh it ~> periodic cron every night?)
  • less parametric (current code is not, but we could adapt the current code to parametrize the current 3 months interval)

materialized views.

We could implement a materialized view on a sliding window of the last 3 months visits.
And take a sample out of it.

pros:

  • simpler? (i'm not that convinced ;)

cons:

  • "more" maintenance (when and how to refresh it ~> periodic cron every night?)
  • less parametric (current code is not, but we could adapt the current code to parametrize the current 3 months interval)

Discussing with @olasd, it's a bit overkill (in the context of end-to-end tests triggered not that much themselves).

A mix of using the cte and the original code from this diff (using random) could be enough.
Let's see ;)

Use mix of CTE (with index) and random() to make query faster

^ Actually tested on belvedere (index installed), it's faster at each tryout.

This revision is now accepted and ready to land.Dec 10 2019, 2:44 PM