Page MenuHomeSoftware Heritage

Migrate origin_visit_status records to add the type value
Closed, MigratedEdits Locked

Description

Once the new scheduler pattern is deployed.
We'll need to align the origin_visit_status data records to actually have their type filled in.

That means:

  • modify the schema to add the visit_type column
  • sql file to migrate the data from origin_visit to origin_visit_status
  • Check impacts on storage clients (i.e. loaders) (already dealt with indeed)

Data to migrate 1239705800 rows out of 1259171054:

Production status (mirror):

#+begin_src sql
softwareheritage=> select now(), count(*) from origin_visit_status;
              now              |   count
-------------------------------+------------
 2021-02-02 09:47:15.904791+00 | 1259171054
(1 row)

softwareheritage=> select now(), count(*) from origin_visit_status where type is null;
              now              |   count
-------------------------------+------------
 2021-02-02 09:53:08.903614+00 | 1239705800
(1 row)
#+end_src

Deliverables migrate both:

  • staging db (as it will be used to exercise the migration script and it must be up-to-date as well)
  • checks
  • production db (details below)
  • T2968#58099: checks

Event Timeline

ardumont triaged this task as Normal priority.Jan 13 2021, 9:58 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont added a subscriber: vsellier.
ardumont changed the task status from Open to Work in Progress.Feb 2 2021, 10:26 AM
ardumont moved this task from Backlog to Weekly backlog on the System administration board.
ardumont moved this task from Weekly backlog to in-progress on the System administration board.

The naive need is:

explain update origin_visit_status as ovs
set type=ov.type
from origin_visit ov
where ov.visit=ovs.visit and ov.origin=ovs.origin
and ovs.type is null;

But that will take too much time in one big transaction so, we need to split this and run it in an incremental fashion:

So either looping over the ranges on origin (which might be more or less fast, depending
on the number of visits per origin and that can vary a bit):

explain update origin_visit_status as ovs
set type=ov.type
from origin_visit ov
where ov.visit=ovs.visit and ov.origin=ovs.origin
and ovs.type is null;
and 0 <= ov.origin and ov.origin < 10000;

or, using the following command which will only work as long as there is stuff to do:

explain with selected_origin as (
   select ov.origin, ov.visit, ov.type
   from origin_visit_status ovs
      inner join origin_visit ov using (origin, visit)
   where ovs.type is null
   limit 100000
)
update origin_visit_status as ovs
set type=s.type
from selected_origin s
where ovs.origin=s.origin and ovs.visit=s.visit;

Out of staging, status:

swh=> select ov.type, count(*)
     from origin_visit_status ovs
        inner join origin_visit ov using (origin, visit)
     where ovs.type is null
     group by ov.type;
  type   |  count
---------+----------
 deb     |  2901943
 deposit |       58
 git     |   118004
 hg      |        2
 nixguix |        3
 npm     |   165363
 pypi    | 12829210
 svn     |        1
(8 rows)

After some run of the 2nd query:

  type   |  count
---------+----------
 deb     |  2877588
 deposit |       58
 git     |   117943
 npm     |   165363
 pypi    | 12828630
 svn     |        1

  type   |  count
---------+----------
 deb     |  2367588
 deposit |       58
 git     |   117943
 npm     |   165363
 pypi    | 12828630
 svn     |        1
(6 rows)

This looks like it distributes well other the 'type' values.

Adding some timing and some tryout on the batch size, batches of 100000 origin visits
(with selected_origins) seems fast enough:

|------------+------------------------|
| batch size | timing (\timing) in ms |
|------------+------------------------|
|       1000 | ~600                   |
|      10000 | ~1100                  |
|     100000 | ~1600                  |
|------------+------------------------|

And most importantly, after some runs on staging, everything seems to stay consistent:

swh=> select ov.origin, ov.visit, ovs.type, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is not null and ovs.type != ov.type limit 10;
 origin | visit | type | type
--------+-------+------+------
(0 rows)

Projecting this, if no blocking transactions gets in the way, migration should take roughly 6 hours:

|-------------+------------+--------------------|
| environment |       rows |           time (s) |
|-------------+------------+--------------------|
| sql         |     100000 |                1.6 |
| prod        | 1239705800 |         19835.2928 |
| staging     |   15477583 | 247.64132800000002 |
|-------------+------------+--------------------|

Script resulting from the prior analysis:

#!/usr/bin/env bash

# set -x
set -e

SQL_FILE=/tmp/t2968-migrate-ovs.sql
LOG_FILE=/tmp/t2968-migrate-ovs.log
LIMIT=100000
cat > $SQL_FILE <<EOF
  \timing

  with selected_origin as (
     select ov.origin, ov.visit, ov.type
     from origin_visit_status ovs
        inner join origin_visit ov using (origin, visit)
     where ovs.type is null
     limit ${LIMIT}
  )
  update origin_visit_status as ovs
  set type=s.type
  from selected_origin s
  where ovs.origin=s.origin and ovs.visit=s.visit;
EOF

while true; do
    psql service=admin-staging-swh -f $SQL_FILE | tee $LOG_FILE
    grep -q "UPDATE 0" $LOG_FILE && break
done

Currently running on the staging db.

Run ok in 20 min for the staging db [1]

The select subquery to determine data to migrate is taking more time at each loop
it runs ([1] to see the ever increasing reading time).

So the initial 5 min projection was without taking this into account.

Nonetheless, everything got migrated in staging now without issues [2]

We may have yet some improvments on that query so it passes more smoothly in production.
But that should not block starting migrating this way first.

[1] P935

[2]

swh=> select ov.type, count(*)
     from origin_visit_status ovs
        inner join origin_visit ov using (origin, visit)
     where ovs.type is null
     group by ov.type;
 type | count
------+-------
(0 rows)

swh=> select ov.origin, ov.visit, ovs.type, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is not null and ovs.type != ov.type limit 10;
 origin | visit | type | type
--------+-------+------+------
(0 rows)

Improved version (moving the join part within the update part of the query):

#!/usr/bin/env bash

# set -x
set -e

SQL_FILE=/tmp/t2968-migrate-ovs.sql
LOG_FILE=/tmp/t2968-migrate-ovs.log

cat > $SQL_FILE <<EOF
  \timing

  with selected_origin as (
    select distinct origin, visit
    from origin_visit_status
    where type is null
    limit 100000
  )
  update origin_visit_status as ovs
  set type=ov.type
  from selected_origin s
  inner join origin_visit ov on s.origin=ov.origin and s.visit=ov.visit
  where ovs.origin=s.origin and ovs.visit=s.visit;

EOF

while true; do
    psql service=admin-swh -f $SQL_FILE | tee $LOG_FILE
    grep -q "UPDATE 0" $LOG_FILE && break
done

(slight improvement from current 30s to ~15s)

So the "improved query keeps on taking more time from 15s initially to cranking up
around 2 min now (after running from yesterday up to now).

Timing is on.
UPDATE 102309
Time: 15301.318 ms (00:15.301)
Timing is on.
UPDATE 103198
Time: 15691.569 ms (00:15.692)
Timing is on.
UPDATE 108097
Time: 15471.194 ms (00:15.471)
Timing is on.
UPDATE 108052
Time: 15596.098 ms (00:15.596)
...
... night ...
...
Timing is on.
UPDATE 107813
Time: 146027.771 ms (02:26.028)
Timing is on.
UPDATE 107355
Time: 144460.339 ms (02:24.460)
Timing is on.
UPDATE 108683
Time: 138898.824 ms (02:18.899)
Timing is on.
UPDATE 107206
Time: 143704.396 ms (02:23.704)
Timing is on.
UPDATE 108893
Time: 139246.971 ms (02:19.247)

We got ~200M rows migrated now.

So looking into the first approach (early on this ticket) and testing:

update origin_visit_status as ovs
set type=ov.type
from origin_visit ov
where ov.visit=ovs.visit and ov.origin=ovs.origin
and ovs.type is null
-- and 190000 <= ov.origin and ov.origin < 200000;  -- staging
and 100000000 <= ov.origin and ov.origin < 100010000;  -- prod

In a transaction, this was almost instantaneous

...
  UPDATE 10002
  Time: 184.027 ms

So on to some adaptations:

#!/usr/bin/env bash

# select min(origin) from origin_visit_status where type is null;
first=23000000

SQL_FILE=/tmp/t2968-migrate-ovs-v2.sql
LOG_FILE=/tmp/t2968-migrate-ovs-v2.log

inc=100000
# select max(origin) from origin_visit_status where type is null;
FULL_STOP=151900000

start=$first
while true; do
    end=$(( $start + $inc ))

    echo "### range: [$start, $end]"
    cat > $SQL_FILE <<EOF
\timing

update origin_visit_status as ovs
set type=ov.type
from origin_visit ov
where ov.visit=ovs.visit and ov.origin=ovs.origin
and ovs.type is null
and $start <= ov.origin and ov.origin < $end;

EOF
    start=$(( $start + $inc ))

    psql service=admin-swh -f $SQL_FILE | tee $LOG_FILE

    # only $FULL_STOP origins so break
    if [ $start -gt $FULL_STOP ]; then
        break
    fi
done

It's currently running and it's migrating faster now:

...
### range: [29700000, 29800000]
Timing is on.
UPDATE 842734
Time: 12607.146 ms (00:12.607)
### range: [29800000, 29900000]
Timing is on.
UPDATE 704757
Time: 11942.186 ms (00:11.942)
### range: [29900000, 30000000]
Timing is on.
UPDATE 787165
Time: 48050.921 ms (00:48.051)
### range: [30000000, 30100000]
Timing is on.

tl; dr: ETA: 12 hours [3]

Status: it's steadily updating origin-visit-status.
Now we should be able to extrapolate an ETA

Snapshot of remaining types to write:

              now              |   count
-------------------------------+-----------
 2021-02-03 13:05:48.849344+00 | 820438405
 2021-02-03 14:56:13.873901+00 | 710952313

Delta: 109486092 origin-visit-status updated in 111 min

using: select now(), count(*) from origin_visit_status where type is null; (on the replica)

Expectedly, some ranges contain lots of information to update which takes more time.
Still, the update roughly writes ~1M/min. [2]

ETA: 12 hours [3]

[1] https://grafana.softwareheritage.org/goto/py1X3EYGk

[2]

(let ((status_t0 820438405)
      (status_t1 710952313)
      (delta_t_minutes 111))
    (/ (- status_t0 status_t1) delta_t_minutes)) ; 986361 / min

[3]

(let ((speed (/ (- 820438405 710952313) 111))
      (remaining 710952313))
  (/ (/ remaining speed) 60)) ;; 12 hours

Migration ran to the end:

### range: [151800000, 151900000]
Timing is on.
UPDATE 0
Time: 3085.767 ms (00:03.086)
### range: [151900000, 152000000]
Timing is on.
UPDATE 0
Time: 6.366 ms

And the replication kept up and we have no more visit status without type:

softwareheritage=> \conninfo
You are connected to database "softwareheritage" as user "guest" on host "somerset.internal.softwareheritage.org" (address "192.168.100.103") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)

softwareheritage=> select now(), count(*) from origin_visit_status where type is null;
              now              |  count
-------------------------------+---------
 2021-02-04 07:11:23.214145+00 | 7702062
(1 row)

softwareheritage=> select now(), count(*) from origin_visit_status where type is null;
              now              | count
-------------------------------+-------
 2021-02-04 07:51:41.390543+00 |     0
(1 row)

Another final check is running to ensure we have no discrepancy between origin-visit and
origin-visit-status (run executed already once in a while with no discrepancy so far).

softwareheritage=> select now(), count(*) from origin_visit_status ovs inner join origin_visit ov using (origin, visit)
 where ovs.type != ov.type;
              now              | count
-------------------------------+-------
 2021-02-04 08:50:19.833201+00 |     0
(1 row)
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.