Page MenuHomeSoftware Heritage

Keep an up to date count of the number of objects in each archive
Closed, MigratedEdits Locked

Description

We need to keep a running count of the number of objects in each of our archives, and to publish that.

Scanning the 3 billion rows of the archiver table is not a reasonable option, as it takes multiple hours: we need to do something smarter.

One proposal is to keep a running count for all the objects, bucketed by the last bytes of the id, updated via a trigger on the content_archive table.

We can then run a full count by just using a sum on the few hundred thousand entries of that table.

Event Timeline

olasd changed the task status from Open to Work in Progress.Feb 7 2017, 6:58 PM
olasd created this task.

The counting strategy has been implemented in rDSTO598114c5da.

The initial, incremental counting of the objects in each archive is running in a screen on prado.

We will then be able to add the trigger to keep the counts updated.

The initial count has been done:

mkfifo /tmp/fifo
\copy (select substring(content_id from 19) as bucket, jbe.key as archive
        from content_archive
        join lateral jsonb_each(copies) jbe on true
        where jbe.value->>'status' = 'present') to /tmp/fifo
from collections import Counter

f = open('/tmp/fifo', 'r')
c = Counter(tuple(l.strip().split()) for l in f)
f.close()

out = open('/tmp/buckets', 'w')
for (bucket, archive), count in c.items():
    print(archive, bucket, count, sep='\t', file=out)
out.close()
\copy content_archive_counts (archive, bucket, count) from '/tmp/buckets'

The trigger to update the table has been enabled.

zack added a project: Restricted Project.Feb 13 2017, 3:33 PM