Page MenuHomeSoftware Heritage

Prepare the disks and configure zfs
Closed, MigratedEdits Locked

Description

The disk are configured with a 512k block size by default

root@cassandra03:/sys/block/nvme0n1/device# lsblk -o NAME,PHY-SeC | grep nvme
nvme0n1                          512
├─nvme0n1p1                      512
└─nvme0n1p9                      512
nvme2n1                          512
├─nvme2n1p1                      512
└─nvme2n1p9                      512
nvme1n1                          512
├─nvme1n1p1                      512
└─nvme1n1p9                      512
nvme4n1                          512
├─nvme4n1p1                      512
└─nvme4n1p9                      512
nvme3n1                          512
├─nvme3n1p1                      512
└─nvme3n1p9                      512

They will have better performance with a 4k block size:

root@cassandra03:/sys/block/nvme0n1/device# ls /dev/nvme?n1 | xargs -n1 -t nvme id-ns -H | grep LBA
nvme id-ns -H /dev/nvme0n1
  [3:0] : 0	Current LBA Format Selected
  [0:0] : 0x1	Metadata as Part of Extended Data LBA Supported
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good 
LBA Format  2 : Metadata Size: 16  bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good 
LBA Format  3 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
LBA Format  4 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
LBA Format  5 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
LBA Format  6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
nvme id-ns -H /dev/nvme1n1
  [3:0] : 0	Current LBA Format Selected
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
nvme id-ns -H /dev/nvme2n1
  [3:0] : 0	Current LBA Format Selected
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
nvme id-ns -H /dev/nvme3n1
  [3:0] : 0	Current LBA Format Selected
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best 
nvme id-ns -H /dev/nvme4n1
  [3:0] : 0	Current LBA Format Selected
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

They can be reformated with this command [1] :

nvme format --lbaf=<lba format id> /dev/<device>

[1] from https://www.bjonnh.net/article/20210721_nvme4k/

Event Timeline

vsellier changed the task status from Open to Work in Progress.Aug 18 2022, 4:46 PM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier updated the task description. (Show Details)

The nvme format command didn't succeed on the write intensive disk. It never exits and the disk become unresponsive after that.

There is no problem on the mixedused disks and the command returns immediately:

root@cassandra03:~# nvme format -f --lbaf=1 /dev/nvme1n1                                                                                                                                                                                              [138/138]
Success formatting namespace:1  
root@cassandra03:~# nvme id-ns -H /dev/nvme1n1
root@cassandra03:~# nvme id-ns -H /dev/nvme1n1 | grep "in use"
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)

Doing this for the other disks

Testing the performances of the different configuration (on a zfs pool with only one disk):

  • disk block: 512k / zpool ashift:9
zpool create -o ashift=9 -O mountpoint=none mixeduse /dev/disk/by-id/nvme-MO003200KXAVU_SJA4N7938I0405A0U
zfs create -o mountpoint=/srv/cassandra/instance1/data -o atime=off -o relatime=on mixeduse/cassandra-data
cd /srv/cassandra/instance1/data
bonnie++ -d . -m cassandra04 -u nobody                                                                                                                                                                                                                                                 
Using uid:65534, gid:65534.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
cassandra04 515496M  293k  99  1.0g  99  703m  99  661k  99  1.4g  91 13717 463
Latency             48216us    7316us    8224us   23303us    7928us    1606us
Version  2.00       ------Sequential Create------ --------Random Create--------
cassandra04         -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  98 +++++ +++ 16384   8 +++++ +++ +++++ +++ 16384  99
Latency              2679us    1207us    4851ms    2850us     138us     301us
1.98,2.00,cassandra04,1,1659338044,515496M,,8192,5,293,99,1080974,99,720299,99,661,99,1488832,91,13717,463,16,,,,,28232,98,+++++,+++,2018,8,+++++,+++,+++++,+++,24821,99,48216us,7316us,8224us,23303us,7928us,1606us,2679us,1207us,4851ms,2850us,138us,301us
  • disk block: 512k / zpool ashift:12
root@cassandra04:/srv/cassandra/instance1# zpool create -o ashift=12 -O mountpoint=none mixeduse /dev/disk/by-id/nvme-MO003200KXAVU_SJA4N7938I0405A0U
root@cassandra04:/srv/cassandra/instance1# zfs create -o mountpoint=/srv/cassandra/instance1/data -o atime=off -o relatime=on mixeduse/cassandra-data
cd /srv/cassandra/instance1/data
bonnie++ -d . -m cassandra04-512m-12 -u nobody                                                                                                                                                                                                                                                 
Using uid:65534, gid:65534.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
cassandra04 515496M  217k  99  1.1g  99  742m  99  775k  99  1.5g  95 13584 470
Latency             49839us   57810us    8420us   13985us    7891us    1527us
Version  2.00       ------Sequential Create------ --------Random Create--------
cassandra04-512m-12 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  98 +++++ +++ 16384   9 16384  99 +++++ +++ 16384  99
Latency              2846us    1528us    4036ms    2980us      54us     301us
1.98,2.00,cassandra04-512m-12,1,1658730922,515496M,,8192,5,217,99,1130625,99,759750,99,775,99,1533158,95,13584,470,16,,,,,29506,98,+++++,+++,2297,9,32494,99,+++++,+++,23597,99,49839us,57810us,8420us,13985us,7891us,1527us,2846us,1528us,4036ms,2980us,54us,301us
  • disk block: 4k / zpool ashift 12
zpool create -o ashift=12 -O mountpoint=none mixedused /dev/disk/by-id/nvme-MO003200KXAVU_SJA4N7938I0405A0V
zfs create -o mountpoint=/srv/cassandra/instance1/data -o atime=off -o relatime=on mixedused/cassandra-data
cd /srv/cassandra/instance1/data
bonnie++ -d . -m cassandra03-4k-12 -u nobody  
Using uid:65534, gid:65534.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
cassandra03 515496M  225k  99  956m  99  638m  99  653k  99  1.4g  97  8358 336
Latency             61845us    7367us   93868us   16967us    6614us    1334us
Version  2.00       ------Sequential Create------ --------Random Create--------
cassandra03-4k-12   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  82 +++++ +++ 16384  10 16384  99 +++++ +++ 16384  99
Latency              2944us    1446us    3693ms    3073us     290us     299us
1.98,2.00,cassandra03-4k-12,1,1660828277,515496M,,8192,5,225,99,979351,99,653546,99,653,99,1497220,97,8358,336,16,,,,,23747,82,+++++,+++,2383,10,32113,99,+++++,+++,23371,99,61845us,7367us,93868us,16967us,6614us,1334us,2944us,1446us,3693ms,3073us,290us,299us

Recreating the zpool correctly:

# mixedused
ls /dev/disk/by-id/nvme-MO003200KXAVU* | grep -v part | xargs -t zpool create -o ashift=12 -O mountpoint=none mixeduse
zfs create -o mountpoint=/srv/cassandra/instance1/data mixeduse/cassandra-instance1-data

# write intensive
ls /dev/disk/by-id/nvme-EO000375KWJUC* | grep -v part | xargs -t -r zpool create -o ashift=12 -O mountpoint=none writeintensive
zfs create -o mountpoint=/srv/cassandra/instance1/commitlog writeintensive/cassandra-instance1-commitlog
vsellier moved this task from Backlog to done on the System administration board.

all server reconfigured and cassandra started on them:

/opt/cassandra/bin/nodetool status
Datacenter: sesi_rocquencourt
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns (effective)  Host ID                               Rack 
UN  192.168.100.184  88.65 KiB  16      34.3%             e0c24d24-6f68-4a26-8561-94e67b58211a  rack1
UN  192.168.100.181  84.71 KiB  16      31.3%             1d9b9e7d-b376-4afe-8f67-482e8412f21b  rack1
UN  192.168.100.186  69.07 KiB  16      34.2%             0dd3426d-9159-47bd-9b4e-065ff0fbb889  rack1
UN  192.168.100.183  69.08 KiB  16      37.1%             78281a92-7fa0-43bd-bc33-c5b419ee8715  rack1
UN  192.168.100.185  69.07 KiB  16      32.2%             abf9b69e-3cec-4ac3-a195-a54481e4d9da  rack1
UN  192.168.100.182  74.05 KiB  16      30.9%             eca5ea5d-8bd5-4301-9a5e-ffa01aa1b7e5  rack1