FTR, the query I've used to generate the stats is:
(the encoding there is needed due to T818)
This is not reproduced on a stretch node (python3.5).
Ok, this is the empty file:
Adding some log, this is the response from the output.
After "manual" computation of the remaining hashes:
After some more manual poking, we're now in the following status:
I've done several things today to try to wrap this up, and we're ever so close (3000 or so objects left).
Schedule the rehash once the archiver did its copy to banco/azure. IIRC, the orchestration is already possible through the director's setup.
But, it's possible some little code is needed since i believe there is a slight discrepancy between the data out from the director and the data in for the rehash job.
Good call in pausing this to avoid uffizi hangs (assuming this was the cause).
We want the different object storage copies to converge, so I think waiting for the archiver to close the gap (possibly increasing resources to it if that helps) before restarting this is the right solution here.
We have reached a point where the remaining contents to rehash are only stored in uffizi (not on the other mirrors; azure, banco ; according to logs).
This is currently being solved (as in listing + scheduling those missed).
current status: overall, the contents (3.6B) are mostly rehashed.
But, some known issue (T760) incurred some missed contents (around 5M).
Closing task due to change of plan.
- take only the first 10k of the raw contents (as a possible configuration option).
I created a paste P163 instead of directly commenting the results here.
So, it turns out that sending all contents to rehash in one shot was dumb...
It cluttered the rabbitmq machine's disk (saatchi).
3262961641 contents sent by batch of 1000 so ~3.26 billion messages in swh_indexer_content_rehash queue.
Remains to update the azure workers with the latest indexer.
I'm on it.