Page MenuHomeSoftware Heritage

Database replication lag keeps growing on somerset
Closed, ResolvedPublic

Description

It seems the softwareheritage database replication from belvedere to somerset is no more functional since at least one week.

This is what we get when we count the number of origins on belvedere:

antoine@guggenheim:~$ psql service=swh
psql (11.5 (Debian 11.5-1+deb10u1), serveur 11.3 (Debian 11.3-1.pgdg90+1))
Connexion SSL (protocole : TLSv1.2, chiffrement : ECDHE-RSA-CHACHA20-POLY1305, bits : 256, compression : désactivé)
Saisissez « help » pour l'aide.

softwareheritage=> select count(*) from origin;
  count   
----------
 90429442
(1 ligne)

While the same query on somerset returns the following:

antoine@guggenheim:~$ psql service=swh-replica
psql (11.5 (Debian 11.5-1+deb10u1), serveur 11.3 (Debian 11.3-1.pgdg90+1))
Connexion SSL (protocole : TLSv1.2, chiffrement : ECDHE-RSA-AES256-GCM-SHA384, bits : 256, compression : désactivé)
Saisissez « help » pour l'aide.

softwareheritage=> select count(*) from origin;
  count   
----------
 90233725
(1 ligne)

So we are currently missing 195717 origins in the replica. This number keeps growing as yersteday it was equal to 171809.

This lack of replication impacts the Software Heritage web application as it uses the database hosted on somerset.
For instance, all 'Save code now' requests submitted since the last week are still marked as scheduled even if they were correctly executed.
Because the newly ingested origins are not present in the replica database, no visit date can be found for them and thus the erroneous
reported status.

Event Timeline

anlambert triaged this task as High priority.Sep 25 2019, 1:37 PM
anlambert created this task.
ardumont added subscribers: douardda, olasd, ardumont.EditedSep 27 2019, 10:08 AM

Yes, there is a replication lag.

@douardda and i started investigation yesterday around noon.
It showed that at least one wal file was missing.

Taking it up with @olasd in the afternoon, it was missing around 12 wal files from belvedere.
Which made the logical replication fail.

@olasd showed me that we can use our backup system (barman) to restore the missing wal files. [1]
So we did, i don't remember exactly the command though, something along:

ssh barman@banco.i.s.o barman get-wal $server_name swh-11 > $wal_name

But the idea, roughly:

  • ask barman to restore the given missing wal files (we can see those from either somerset or belvedere's postgres log files)
  • either wait for postgres to see the missing wal files are not really missing or force the psql service to restart (we did that).

In any case, the hole is decreasing:

90466174 - 90303185
162989

[1] logical replication from postgresql 11 actually use streaming replication underneath (so wal files).

By the way, we don't really know what's the missing wal origins though.

I can just say that It's a priori not a disk issue as there are enough space either in belvedere or somerset (and this prior to unstucking the replication).

ardumont closed this task as Resolved.EditedSep 29 2019, 4:04 PM
ardumont claimed this task.

The order of magnitude (of the replication hole) is below the 100 units now.

$ bc 
90511267 - 90511175  # main - replica
92

So this can be closed now.