Page MenuHomeSoftware Heritage

2020-11-18 Datacenter operations in Rocquencourt
Closed, MigratedEdits Locked

Description

In August, we have ordered a new storage enclosure from supermicro, and a new server from dell, to extend the main storage and replace the server hosting it. After a long delay in building the enclosure and testing disks on the supermicro side, the deliveries were all completed in the first week of November.

We therefore planned the installation in Rocquencourt on 2020-11-18.

As the rack with the existing storage enclosures was almost full, we've decided to go ahead with the decommission of orsay, and to move uffizi to the almost empty rack next to it to make space. The list of operations planned was:

  • decommission of orsay and the attached storage array
  • removal of uffizi from the rack
  • retrieving HBAs and NVME storage from uffizi to move them to the new server
  • rack the new server in place of uffizi
  • install the new storage enclosure, connect it to the new server, and chain the other supermicro enclosure to it
  • reinstall of uffizi in the other rack

We've also decided to use our presence in the DC to perform the following pending operations:

  • replace RAM in ceph-mon1 with the RAM bought in the summer
  • reinstall ceph-mon1 from scratch to prepare it becoming a hypervisor
  • add RAM retrieved from ceph-mon1 to db1.staging and storage1.staging

Event Timeline

olasd changed the task status from Open to Work in Progress.Nov 19 2020, 11:16 AM
olasd triaged this task as High priority.
olasd created this task.

The downtime, scheduled for the whole day of 18 November, was posted to status.io on 16 November (https://status.softwareheritage.org/pages/maintenance/578e5eddcdc0cc7951000520/5fb2ae1fbf590a04c7fdffb0)

@vsellier and @olasd joined in Rocquencourt on the morning of 18 November.

The workers were disabled; zfs pools were exported on uffizi before shutting it down, to allow for recovering them on the new host (https://docs.oracle.com/cd/E19253-01/819-5461/gbchy/index.html)

The disassembly of orsay and uffizi happened without an issue.

We had two relatively new 4TB SSDs in orsay which have been moved to uffizi.

We picked saam as name for the new server, and @vsellier prepared its inventory entry and puppet configuration.

saam had exactly the right number of free PCIe slots to receive the add-on cards:

  • 1 SAS HBA for the dell MD3460 array (Full Height, half length)
  • 2 SAS HBAs for the supermicro arrays (Full Height, half length)
  • 1 Intel Optane SSD DC P4800X card (Half Height; full height bracket available in the SWH "stock" in the DSI office)
  • 2 M.2 NVMe - PCIe adapters (Half Height cards with full height brackets; the half height brackets are in the SWH stock of the DSI office)

The server has two PCIe cards provided: the card for the Boot Storage M.2 SSDs, and a HBA for the front SAS.

We installed the rack mount rails for saam and for the new storage array, and rack-mounted both.
We also reinstalled the rack mount hardware for uffizi in the other cabinet and rack-mounted it back.

I went on to cabling saam and the new storage array while @vsellier did uffizi.

saam cabling:

  • attached the SAS cables from the MD3460 to the HBA
  • attached the SAS cables left over from old supermicro array to new supermicro array
  • attached the new SAS cables from new supermicro array to both HBAs on saam
  • reused iDRAC and SFP+ cables from uffizi

uffizi cabling:

  • attached new SFP+ cabling to the top of rack switch
  • attached new iDRAC cable to the top of rack switch

After lunch, we finished power cabling and went on to set up an OS on saam.

The boot blocked before the system setup while initializing devices. After trying again, on a hunch, I disconnected all the SAS cables from the back of the server. This allowed us to access system setup.

  • Set ip address for the iDRAC with the next ip in our range; set iDRAC password to the default Dell value
  • Disabled the boot rom on the PCI ports for the external HBAs

Before rebooting, I plugged the SAS cables back in, and popped in a Debian Installer stick.

The system booted to the Debian installer, which failed to find the USB stick (I guess it's not happy when the USB drive is /dev/sdrq1, yes, two letters). I unplugged the SAS drives again and we could finally install Debian.

After the Debian install, I plugged the SAS drives again and rebooted. At first the system failed to boot: I had disabled boot from all PCI cards, including the one with the boot storage; after enabling that again Debian booted.

The Dell MD3460 virtual devices were detected on boot, but the other SAS enclosures would time out after enumerating some of the disks. After rebooting a few times, no dice.

I then opened the manual for the SAS enclosures again, and noticed that it was very specific about how to wire multiple HBAs and multiple chained enclosures. After removing all SAS cables and doing them again one at a time, using the manufacturer mandated wiring scheme, all the disks showed up on the system properly.

Once that was done, we could zfs import the pools.

We also validated iDRAC access to saam, and recorded the credentials to the password store.

uffizi was cabled again and rebooted. The network setup is pending actions from the DSI network admins.

Once the main operations were done, we went on to the bonus track:

We've replaced the ram in ceph-mon1. @vsellier reinstalled a plain Debian so we can recycle it as a hypervisor. We validated that IPMI access was still OK after the re-racking.
We've expanded the RAM in db1.staging and storage1.staging. IPMI access was validated too.

We still have issues in boot ordering on saam, but we had done enough physical setup to be able to handle them remotely, so we decided that the physical part of the operation was complete and that we could follow up remotely.

Takeaways from the physical setup:

  • the rack mount arm of the supermicro array is a bit too small for 8 micro-SAS cables (16 conductors) + IPMI network + power; the swinging arm also interferes with the (pretty bulky) SAS connectors on the enclosure controllers, preventing the array to be fully in the rack.
  • the rack mount rails of the supermicro array protrude far enough back that they're blocking some PDU ports.
  • multipath SAS cabling is very sensitive and needs to be done carefully, which is really hard with the bulky connectors and the very small space on the back of the rack.
  • ceph-mon1 is not on pull out rails, so it needs to be supported to be taken out of the rack
  • all three supermicro servers have no cable management arms, so you need to pull some slack in the back of the rack before pulling them out

On the saam software setup:

  • we recovered the multipath config from uffizi.
  • we recovered /etc/exports from uffizi.

The boot is timing out because of a race condition between systemd-udev-settle.service and multipathd.service. udev is calling multipath -c for all drives, but that needs multipathd running, which systemd doesn't do before systemd-udev-settle returns.

It turns out we already had that issue on uffizi, and solved it by:

  • overriding the multipathd.service unit to get ordered before systemd-udev-settle.service, instead of after
  • adding an override to zfs-import-cache.service to be ordered after multipathd.socket.

Both of these are implemented in D4523.

There was an issue with the indexer storage package missing the swh.indexer setuptools metadata. I've moved the metadata to the swh.indexer.storage package in rDCIDX3809bb03

D4528 carries the local storage/objstorage configuration on uffizi to saam.
D4531/D4532 adds all the local mountpoints needed for this local config to work.

D4516 by @vsellier moves all storage/objstorage/indexer storage clients from uffizi to saam

After a few hit-and-miss reboots, the multipath configuration was set to only enable explicitly listed wwids, but the list of wwids was empty.

After pulling the list from uffizi, and adding the wwids from the new devices, and regenerating the initramfs, everything came up properly.

We're now ready for the final switch over of storage backends.

  • The configuration was applied on moma
  • a manual import was performed on worker01 :
    • the /etc/softwareheritage/loader_git.yaml config was updated:
root@worker01:/etc/softwareheritage# diff -U3 /tmp/loader_git.yml loader_git.yml 
--- /tmp/loader_git.yml	2020-11-20 08:43:18.682462213 +0000
+++ loader_git.yml	2020-11-20 08:44:00.150375756 +0000
@@ -13,7 +13,7 @@
   - cls: filter
   - cls: remote
     args:
-      url: http://uffizi.internal.softwareheritage.org:5002/
+      url: http://saam.internal.softwareheritage.org:5002/
 max_content_size: 104857600
 save_data: false
 save_data_path: "/srv/storage/space/data/sharded_packfiles"
  • the import was run on the puppet-swh-site repository:
root@worker01:/etc/softwareheritage# sudo -u swhworker SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/SoftwareHeritage/puppet-swh-site

The first try returns this exception :

swh.core.api.RemoteException: <RemoteException 500 ValueError: ["Storage class azure-prefixed is not available: No module named 'swh.objstorage.backends.azure'"]>

after applying the D4359 change on saam, the load is ok :

root@worker01:/etc/softwareheritage# sudo -u swhworker SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/SoftwareHeritage/puppet-swh-site
INFO:swh.loader.git.BulkLoader:Load origin 'https://github.com/SoftwareHeritage/puppet-swh-site' with type 'git'
Enumerating objects: 537, done.
Counting objects: 100% (537/537), done.
Compressing objects: 100% (326/326), done.
Total 19066 (delta 260), reused 445 (delta 194), pack-reused 18529
INFO:swh.loader.git.BulkLoader:Listed 3 refs for repo https://github.com/SoftwareHeritage/puppet-swh-site
{'status': 'eventful'}

the last commit of the diff is well present [1] and the file is well stored on the saam storage :

softwareheritage=> select * from content where sha1_git='\x1781d66d33737d1e422cd54add562f7f04f16b30';
-[ RECORD 1 ]------------------------------------------------------------------
sha1       | \xa12d17353c310908068110a859f9b54e618c775a
sha1_git   | \x1781d66d33737d1e422cd54add562f7f04f16b30
sha256     | \xc927814db44f633cf72ac735cc740d950e3bfe9d75dd8409564708759203f03d
length     | 235
ctime      | 2020-11-20 09:35:43.992683+00
status     | visible
object_id  | 9177836614
blake2s256 | \x42a45c33b44a8ebb6011dc3679dbc6a90b389cb3ef6b92f7651988ed72f13a93
root@saam:/srv/softwareheritage/objects/a1/a12d1# ls -alh /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775a
-rw-r--r-- 1 swhstorage swhstorage 235 Nov 20 09:35 /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775a
root@saam:/srv/softwareheritage/objects/a1/a12d1# cat /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775a
class role::swh_storage_baremetal inherits role::swh_storage {
  include profile::dar::server
  include profile::megacli
  include profile::multipath
  include profile::mountpoints

  include ::profile::swh::deploy::objstorage_cloud
}

[1] https://archive.softwareheritage.org/browse/revision/966b1d2eabd94ecb064845ab4e77ddfa4042a959/?origin_url=https://github.com/SoftwareHeritage/puppet-swh-site#swh-revision-changes

  • puppet applied on worker01
  • task by tasks tests :
    • mercurial :
swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_mercurial.yml swh loader run mercurial https://foss.heptapod.net/fluiddyn/fluidfft
INFO:swh.loader.mercurial.Bundle20Loader:Load origin 'https://foss.heptapod.net/fluiddyn/fluidfft' with type 'hg'
{'status': 'eventful'}
swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_mercurial.yml swh loader run mercurial https://hg.mozilla.org/projects/nss
INFO:swh.loader.mercurial.Bundle20Loader:Load origin 'https://hg.mozilla.org/projects/nss' with type 'hg'
WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_15_5_BETA2 (hg changeset: e5d3ec1d9a35f7cac554543d52775092de9f6a01). Skipping
WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_15_5_BETA2 (hg changeset: 0000000000000000000000000000000000000000). Skipping
WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_18_RTM (hg changeset: 0000000000000000000000000000000000000000). Skipping
WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_18_RTM (hg changeset: 0000000000000000000000000000000000000000). Skipping
WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_24_BETA3 (hg changeset: 0000000000000000000000000000000000000000). Skipping
{'status': 'eventful'}
  • svn
root@worker01:~# SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_svn.yml swh loader run svn svn://svn.appwork.org/utils
INFO:swh.loader.svn.SvnLoader:Load origin 'svn://svn.appwork.org/utils' with type 'svn'
INFO:swh.loader.svn.SvnLoader:Processing revisions [3428-3436] for {'swh-origin': 'svn://svn.appwork.org/utils', 'remote_url': 'svn://svn.appwork.org/utils', 'local_url': b'/tmp/swh.loader.svn.dojsubkd-890577/utils', 'uuid': b'21714237-3853-44ef-a1f0-ef8f03a7d1fe'}
{'status': 'eventful'}
  • npm:

ko : https://sentry.softwareheritage.org/share/issue/363ef9d218ac4817a992b7dc9bf283a6/

root@worker01:~# SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_npm.yml swh loader run npm https://www.npmjs.com/package/bootstrap-vue
WARNING:swh.storage.retry:Retry adding a batch
WARNING:swh.storage.retry:Retry adding a batch
WARNING:swh.storage.retry:Retry adding a batch
ERROR:swh.loader.package.loader:Failed loading branch releases/2.18.0 for https://www.npmjs.com/package/bootstrap-vue
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 333, in call
    result = fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/retry.py", line 117, in raw_extrinsic_metadata_add
    return self.storage.raw_extrinsic_metadata_add(metadata)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 352, in _decode_response
    self.raise_for_status(response)
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 29, in raise_for_status
    super().raise_for_status(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 342, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 TypeError: ["__init__() got an unexpected keyword argument 'id'"]>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 424, in load
    res = self._load_revision(p_info, origin)
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 577, in _load_revision
    self._load_metadata_objects([original_artifact_metadata])
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 788, in _load_metadata_objects
    self.storage.raw_extrinsic_metadata_add(metadata_objects)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 241, in wrapped_f
    return self.call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 330, in call
    start_time=start_time)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 298, in iter
    six.raise_from(retry_exc, fut.exception())
  File "<string>", line 3, in raise_from
tenacity.RetryError: RetryError[<Future at 0x7f6fe4e98cf8 state=finished raised RemoteException>]
  • deposit

same issue as npm :

swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_deposit.yml swh loader run deposit https://www.softwareheritage.org/check-deposit-2020-11-17T20:48:13.534821 deposit_id=1114
WARNING:swh.storage.retry:Retry adding a batch
WARNING:swh.storage.retry:Retry adding a batch
WARNING:swh.storage.retry:Retry adding a batch
ERROR:swh.loader.package.loader:Failed loading branch HEAD for https://www.softwareheritage.org/check-deposit-2020-11-17T20:48:13.534821
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 333, in call
    result = fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/retry.py", line 117, in raw_extrinsic_metadata_add
    return self.storage.raw_extrinsic_metadata_add(metadata)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 352, in _decode_response
    self.raise_for_status(response)
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 29, in raise_for_status
    super().raise_for_status(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 342, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 TypeError: ["__init__() got an unexpected keyword argument 'id'"]>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 424, in load
    res = self._load_revision(p_info, origin)
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 577, in _load_revision
    self._load_metadata_objects([original_artifact_metadata])
  File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 788, in _load_metadata_objects
    self.storage.raw_extrinsic_metadata_add(metadata_objects)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 241, in wrapped_f
    return self.call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 330, in call
    start_time=start_time)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 298, in iter
    six.raise_from(retry_exc, fut.exception())
  File "<string>", line 3, in raise_from
tenacity.RetryError: RetryError[<Future at 0x7fdecca288d0 state=finished raised RemoteException>]

The problem is not reproduced in staging but the worker and storage have the same package versions:

vsellier@worker0 ~ % apt list --upgradable
Listing... Done
python3-swh.deposit.client/unknown 0.6.0-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.deposit.loader/unknown 0.6.0-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.deposit/unknown 0.6.0-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.indexer.storage/unknown 0.5.0-2~swh1~bpo10+1 all [upgradable from: 0.4.2-1~swh1~bpo10+1]
python3-swh.indexer/unknown 0.5.0-2~swh1~bpo10+1 all [upgradable from: 0.4.2-1~swh1~bpo10+1]
python3-swh.journal/unknown 0.5.1-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.loader.git/unknown 0.5.0-1~swh1~bpo10+1 all [upgradable from: 0.4.1-1~swh1~bpo10+1]
python3-swh.model/unknown 0.9.0-1~swh1~bpo10+1 all [upgradable from: 0.7.3-1~swh1~bpo10+1]
python3-swh.storage/unknown 0.17.2-1~swh1~bpo10+1 all [upgradable from: 0.17.0-1~swh1~bpo10+1]
python3-swh.vault/unknown 0.3.3-1~swh1~bpo10+1 all [upgradable from: 0.3.1-1~swh1~bpo10+1]
vsellier@storage1 ~ % apt list --upgradable
Listing... Done
libpq5/buster-pgdg 13.1-1.pgdg100+1 amd64 [upgradable from: 13.0-1.pgdg100+1]
postgresql-13/buster-pgdg 13.1-1.pgdg100+1 amd64 [upgradable from: 13.0-1.pgdg100+1]
postgresql-client-13/buster-pgdg 13.1-1.pgdg100+1 amd64 [upgradable from: 13.0-1.pgdg100+1]
postgresql-client-common/buster-pgdg 223.pgdg100+1 all [upgradable from: 220.pgdg100+1]
postgresql-client/buster-pgdg 13+223.pgdg100+1 all [upgradable from: 13+220.pgdg100+1]
postgresql-common/buster-pgdg 223.pgdg100+1 all [upgradable from: 220.pgdg100+1]
postgresql/buster-pgdg 13+223.pgdg100+1 all [upgradable from: 13+220.pgdg100+1]
python3-swh.indexer.storage/unknown 0.5.0-2~swh1~bpo10+1 all [upgradable from: 0.4.2-1~swh1~bpo10+1]
python3-swh.indexer/unknown 0.5.0-2~swh1~bpo10+1 all [upgradable from: 0.4.2-1~swh1~bpo10+1]
python3-swh.journal/unknown 0.5.1-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.model/unknown 0.9.0-1~swh1~bpo10+1 all [upgradable from: 0.7.3-1~swh1~bpo10+1]
python3-swh.storage/unknown 0.17.2-1~swh1~bpo10+1 all [upgradable from: 0.17.0-1~swh1~bpo10+1]
  • after upgrading storage1.staging, the exact problem is also present
  • After upgrading the worker, everything goes well.

after upgrading the packages on worker01, the npm load was successful :

swhworker@worker01:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_npm.yml swh loader run npm https://www.npmjs.com/package/bootstrap-vue
{'status': 'eventful', 'snapshot_id': '30d32aff7fab1a2c364dc5c61503b0aec3f9fb11'}

real	0m50.440s
user	0m4.542s
sys	0m0.884s

the deposit too :

swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_deposit.yml swh loader run deposit https://www.softwareheritage.org/check-deposit-2020-11-17T20:48:13.534821 deposit_id=1114
{'status': 'eventful', 'snapshot_id': '1ed556891c9f8da6e5292973dd6aa2d3865de847'}

Automatic tasks restarted on worker01, the logs are under watch.

All the loader where restarted on all the workers :

sudo clush -b -w @swh-workers 'apt-get update && apt-get -y upgrade -V'
sudo clush -b -w @swh-workers 'puppet agent --enable && puppet agent --test'
sudo clush -b -w @swh-workers 'systemctl default'

I have also restarted the lister to be sure they are uptodate after the package upgrade :

sudo clush -b -w @swh-workers 'systemctl restart swh-worker@lister.service'

All services are in nominal shape.
Resolved the issue on status.io page.

olasd claimed this task.