Page MenuHomeSoftware Heritage

Get a better estimated object count for the archive
Closed, MigratedEdits Locked

Description

The munin query listing the object count in the archive returns overestimated results, as it uses the PostgreSQL tuple count instead of an actual count query. This error propagates to the website, as we publish data from munin there.

We're running a massive reindexation of all contents to add a new hash, and we're likely to do that again in the future. For each of those updates, PostgreSQL creates a new tuple, and marks the old one for deletion once the transaction completes. The old tuple is only deleted once vacuum runs, which only happens every so often.

Maybe using the pg_stat_user_tables statistics table instead of pg_class would give better data.

Event Timeline

And of course now there's a discrepancy between the graphs (exported from statistics on the main database) and the counter (exported from real-time statistics on the replica database, on which vacuum has never been run)...

olasd raised the priority of this task from Normal to High.

Well, it turns out that pg_stat_user_tables is pretty bad as well, just in a different way than pg_class is...

Running a full count on the content table takes around 25 minutes (on the idle replica) which is not as bad as I would have expected.

I propose the following:

  • A new table, object_counts, with the following columns: object_type, object_count, last_update;
  • A new cron, that updates the object counts regularly; to avoid impacting the database, I propose to run it in phase opposition with the full database backups, e.g. on Wednesdays.

We can then update the API endpoint, and the graphs, to use those exact counts.

Sounds viable and good to me. (From what you wrote I'm assuming you plan to run the cron count on the replica also when in production.)

In T719#13453, @zack wrote:

From what you wrote I'm assuming you plan to run the cron count on the replica also when in production.

My current point of view is thus: we've been bitten by inconsistencies between primary and replica before, so I think the counts should run on the primary and get replicated through the standard means to replicas, even if that means stressing the primary a bit more.

My current point of view is thus: we've been bitten by inconsistencies between primary and replica before, so I think the counts should run on the primary and get replicated through the standard means to replicas, even if that means stressing the primary a bit more.

That's fine. AFAIR the DB is currently not a bottleneck in the ingestion process, and we can always play with the frequency of the cron if it becomes a problem.

BTW, a related problem to this is how to cure the inconsistencies in the time series on the /archive page…

The content table has a nice ctime field that will allow us to regenerate historical data. I'm looking into this now.

Further updates to the time series now use the exact same data as the (not so) live counters so won't drift apart again.

I now backfilled the rrd files in munin with historic data grabbed from the content table.

I did a manual run of all the crons on all machines to refresh the data cached on the website.

Our graphs are now as pretty as ever, and probably as accurate as ever as well.