Now implement a mix use of CTE (with index) and random() to make query even faster ;)
Related to T2120
Differential D2404
storage: Add endpoint to randomly pick an origin ardumont on Dec 6 2019, 12:52 PM. Authored by Tags None Subscribers None
Details
Now implement a mix use of CTE (with index) and random() to make query even faster ;) Related to T2120 tox
Diff Detail
Event TimelineComment Actions Build has FAILED Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/814/ Comment Actions Build has FAILED Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/815/ Comment Actions Build is green Comment Actions Rather than depend on the counter table, which is only updated every so often and which doesn't account for potential holes in origin ids, we should use something more reliable to get a random row like the SQL 2003 tablesample clause implemented in PostgreSQL. I also think that we should directly look for visits, as we'll want to filter on data from these (notably to get the visit type information, as well as whether the origin was "recently successful") anyway. Something along the lines of: select origin.url, origin_visit.* from origin_visit tablesample bernoulli (0.1) inner join origin on (origin_visit.origin = origin.id) where type='git' and status='full' and date > now() - '3 months'::interval limit 1; tablesample bernoulli (0.1) will do a sequential scan of the table and return 0.1% of the rows, which I think is a decent compromise between sampling and speed (of course, the limit shortcuts the query as soon as a result is found). Comment Actions
Playing a bit with this, i noticed that the first query is actually slow (order of minutes to answer something on somerset for example). And for some origin type, that returns nothing (type svn for example). Which i did not account for here. Comment Actions for information, I'm currently fighting the tests now ;) As the filtering is done after the sampling [1], i'm trying to find the right way to add enough data for tests to go green (and not slow too much things down). [1] article's note: One important thing to consider is that any conditionals in the WHERE clause will be applied after the sampling. Which means that the requested number of rows will be picked first and then be filtered using the conditionals expressed in the WHERE clause. Comment Actions Build is green Comment Actions Build is green
Comment Actions Did we really do any svn loads in the last three months? Maybe there's really nothing to show. In that case, the query _will_ do a sequential scan of the whole table. What we could do is:
This should allievate the speed concerns, and increase the probability of getting a result. Comment Actions
Yes, that's my understanding. Just saying that initially, i thought it would always return something.
Indeed. Sounds like a good plan. Comment Actions Would have been nice. explain with visits as ( select origin.url, origin_visit.* from origin_visit inner join origin on origin_visit.origin = origin.id where origin_visit.status='full' and origin_visit.type='deb' and origin_visit.date > now() - '3 months'::interval ) select * from visits tablesample bernoulli(1) limit 1; ERROR: TABLESAMPLE clause can only be applied to tables and materialized views LINE 11: from visits tablesample bernoulli(1) ^ (tested on staging) In any case, the btree index can actually help the current query anyway (a tad late though). Comment Actions
We could implement a materialized view on a sliding window of the last 3 months visits. pros:
cons:
Comment Actions Discussing with @olasd, it's a bit overkill (in the context of end-to-end tests triggered not that much themselves). A mix of using the cte and the original code from this diff (using random) could be enough. Comment Actions Use mix of CTE (with index) and random() to make query faster ^ Actually tested on belvedere (index installed), it's faster at each tryout. Comment Actions Build is green |