Page MenuHomeSoftware Heritage

Update existing contents with new hash blake2s256
Closed, MigratedEdits Locked

Description

Leveraging azure infrastructure, trigger the blake2s256 update on the existing contents.

This means:

  • Provisioning azure vms (sizing -> DS2_V2: 7GB ram, 14GB ssd disk, 2 cores; 85.33E/month) -> for now 2 vms
  • code: configuration composability on storage read/write and objstorage readings adaptation
  • puppet: swh_indexer_rehash puppetization
  • Deploying the swh.indexer.rehash module (+ fix bits and pieces along the way)
  • Compute list of sha1s to rehash from swh.content table (IN PROGRESS in uffizi:/srv/storage/space/lists/contents-sha1-to-rehash.txt.gz).
  • Send all contents to the swh_indexer_rehash queue

Note:
In regards to the storage stack to use, we can:

  • either use the azure's objstorage (copy is 'complete' as in the snapshot copy). This will be the starting point.
  • or use uffizi's objstorage (or banco) as the azure's in-transit's cost is null if the cost projection is too high.
  • or use a multiplexer objstorage using azure as initial objstorage, falling back to banco if object not found, falling back to uffizi if object not found (solution used)

Related Objects

Event Timeline

ardumont updated the task description. (Show Details)

3262961641 contents sent by batch of 1000 so ~3.26 billion messages in swh_indexer_content_rehash queue.

2 workers: worker0[1-2].euwest.azure.internal.softwareheritage.org

gzip -dc /srv/storage/space/lists/contents-sha1-to-rehash.txt.gz | SWH_WORKER_INSTANCE=swh_indexer_rehash python3 -m swh.indexer.producer --batch 1000 --task-name rehash --dict-with-key sha1

Starting date: Thu Apr 27 18:26:55 CEST 2017

So, it turns out that sending all contents to rehash in one shot was dumb...
It cluttered the rabbitmq machine's disk (saatchi).

So after cleaning everything up and some more thought process, we:

  • Reworked the swh.indexer.rehash to read only the objstorage's raw content if we need to (it can be pricy depending on the configuration). That is, either the option flag explicitely imposes us to (not right now), either the needed fields to compute are not filled. This permits the job to be idempotent which is good since some have already been done (prior the incident) and we do not want to read them again.
  • Send only batch of hashes based on a first hash character basis. Currently, only the 0-prefixed hashes are sent.

Command used:

gzip -dc /srv/storage/space/lists/azure-rehash/0.gz | SWH_WORKER_INSTANCE=swh_indexer_rehash python3 -m swh.indexer.producer --batch 100 --task-name rehash --dict-with-key sha1

This sent ~2M job distributed amongst 4 azure machines working 8 tasks in parallel.

Note:
All listing files are stored in uffizi:/srv/storage/space/lists/azure-rehash/.

ardumont changed the task status from Open to Work in Progress.May 5 2017, 2:37 PM

current status: overall, the contents (3.6B) are mostly rehashed.
But, some known issue (T760) incurred some missed contents (around 5M).

I rescheduled yesterday (14/09/2017) around 5M contents (4867588 to be precise). Which is now done as well.

Supposedly some holes remains (I saw the same error occuring during those rehash computation).
This is currently being solved (as in listing + scheduling those missed).

This is currently being solved (as in listing + scheduling those missed).

There you go: 1920000 contents not rehashed.
It's currently being dealt with.

We have reached a point where the remaining contents to rehash are only stored in uffizi (not on the other mirrors; azure, banco ; according to logs).

This creates a high pressure on uffizi's objstorage to the point where the objstorage stops responding.
Uffizi itself starts hanging.

For now, I have put on hold the remaining 500k (the time to find the correct solution to this problem).

The possible solutions i foresee without any development are:

  • As the main purpose of the main storage/objstorage is to support writing through loaders, the workers setup for the objstorage are less than the storage, 16 against 96.

One possiblity would be to slightly increase the number of workers for the objstorage and decrease of the storage.

  • Another solution would be to pause the rehash computations to let the archiver fill the gap (the archiver is running). Maybe cranking up the archiver to make it go faster would be possible as well.

Other solution with development would be:

  • Schedule the rehash once the archiver did its copy to banco/azure. IIRC, the orchestration is already possible through the director's setup.

But, it's possible some little code is needed since i believe there is a slight discrepancy between the data out from the director and the data in for the rehash job.

Good call in pausing this to avoid uffizi hangs (assuming this was the cause).
We want the different object storage copies to converge, so I think waiting for the archiver to close the gap (possibly increasing resources to it if that helps) before restarting this is the right solution here.

I'm not clear on whether, at steady state, the archiver is currently capable of keeping the various copies aligned. But even if it is not, this specific issue needs "only" that the gap gets closed on the 500k contents that are still waiting for blake2 hashing.

So, no matter what, it looks like there is a way forward here.

Schedule the rehash once the archiver did its copy to banco/azure. IIRC, the orchestration is already possible through the director's setup.
But, it's possible some little code is needed since i believe there is a slight discrepancy between the data out from the director and the data in for the rehash job.

Just to be correct about my last assertion.
I checked and such orchestration is possible.
It's relative to worker (swh.archiver.worker.ArchiverToBackendWorker) and not director like i previously hinted at.
Also, indeed some little work would need to be done in regards to the flowing data structure (worker data out is id, rehash computation data in is dict)

... (assuming this was the cause).

Well, starting/stopping rehash computations lead to uffizi starting hanging (ram full, swapping, objstorage icinga check down, etc...) / being well again (after some time passes).
So i think it's the cause.

We want the different object storage copies to converge, so I think waiting for the archiver to close the gap (possibly increasing resources to it if that helps) before restarting this is the right solution here.

agreed

So...

I've done several things today to try to wrap this up, and we're ever so close (3000 or so objects left).

  1. I've manually scheduled the archiver to run on the blake2-missing contents
  2. I've started looking at deploying nginx in front of the backend API servers
  3. I've queued rehashing for the objects whose archival succeeded

The (manual) nginx deployment really helps with pipelining-related errors (T760), so I think it's worthwhile, but doesn't help with flaws intrinsic to the archiver: the archiver is really bad at archiving big objects.

For the last stragglers, I'm using a workaround : use a local objstorage instead of the api server, and push the objects for archival one by one..... Yes, it's as horrible as it sounds.

.....

softwareheritage=> select count(*) from content where blake2s256 is null;
 count 
-------
  2800
(1 row)

After some more manual poking, we're now in the following status:

softwareheritage=> select count(*) from content where blake2s256 is null;
 count 
-------
   873
(1 row)

softwareheritage=> select min(length) from content where blake2s256 is null;                                                                                                                                                                  
    min    
-----------
 350253114
(1 row)

At this point, the rehash workers on azure also fail to handle the size of the object and get nuked by the OOM killer...

I'll process the final few entries with an ad-hoc script running on uffizi so we can finally close this issue and update the blake2s256 column to not null.

After "manual" computation of the remaining hashes:

softwareheritage=> select count(*) from content where blake2s256 is null;                                                                                                                                                                     
 count 
-------
     0
(1 row)