Striped Logical Volume Design and Construction: Block Stacking for Fun and Profit

For those of you well versed in UNIX-like systems, you know the logical volume manager is a big deal.  Coming from a Windows world, it is something of a very deep and powerful virtual disk service.  It’s kind of a lot cooler, too, because its purpose is to do everything useful, whereas Windows’ utility in this regard seems a lot more geared towards providing a simple interface to common functionality.  It’s not useless, by any means, and can do most things LVM can do, but I like the free feel of LVM.

Anyway, I digress.  The point is that my modest Fedora 21 system has at the moment a small 64 GB SSD which provides my root and home volumes.  I previously dug up an old 7200 1 TB WD (I think) drive and simply created a single partition and formatted it with XFS.  I like XFS, and I didn’t think too far ahead at the time because I desperately needed the space for Steam and my various virtual machine disk images.

In training for the LFCE examination, I have found the need to run three or four virtual machines simultaneously.  That really doesn’t work so well with all four virtual machines accessing images on the same 7200 RPM disk.  Trying to use yum to update two system simultaneously allows the user to watch the disk access queue up as they practically take turns installing updates.

So I don’t want to spend money if I don’t have to, and I love using older hardware.  One of the most beautiful implications of the logical volume manager’s elegant design is that logical volumes don’t need to take up entire disks.  So, I can essentially create a striped logical volume that takes up, say, 200 GB of each disk (yielding 400 GB of total storage), striping the data across them.  This allows for performance increases in I/O operations very similar to that of classic RAID 0 arrangements* without requiring that I devote the entire disk devices to the striping relationship (unlike RAID 0).  This leaves me storage space on these disk devices for other logical volumes, such as a mirrored (RAID 1) volume where I may store data in need of greater resilience and availability.

What I need to configure is as follows:

  1. Disk Devices
  2. Physical Volumes
  3. Volume Groups
  4. Logical Volumes
  5. File Systems

So let’s start from the start:

Disk Devices

My 1 TB disk devices are /dev/sdb and /dev/sdc.

Since I don’t remember what they are, let’s see if Linux can help us find out:

$ cat /sys/block/sdb/device/model 
WDC WD10EACS-00Z
$ cat /sys/block/sdc/device/model 
SAMSUNG HD103SJ

Despite the below article’s warnings, I can’t find any product spec sheets indicating whether or not they are AF drives.  They are, however, both equipped with 32 MB caches.  The kernel reports their block sizes as follows:

$ cat /sys/block/sdb/queue/physical_block_size
512
$ cat /sys/block/sdc/queue/physical_block_size
512

This article from IBM features a great explanation of the implications relevant to our current endeavor.  At the lowest level, each call to a read or write operation on a disk operates at a size of 512 bytes.  If you want to write only a single byte, the disk reads the appropriate 512 byte sector and then writes a new 512 byte sector with the single byte you wanted modified.  Now, it could be that these drives support Advanced Format 4096 byte sectors, but I don’t think they do based on some rudimentary Internet research.  The hdparm utility also seems to agree (sample output for one of the drives below):

$ sudo hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
        Model Number:       WDC WD10EACS-00ZJB0                     
        Serial Number:      WD-WCASJ1368263
        Firmware Revision:  01.01B01
        Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5
Standards:
        Supported: 8 7 6 5 
        Likely used: 8
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors: 1953525168
        Logical/Physical Sector size:           512 bytes
        device size with M = 1024*1024:      953869 MBytes
        device size with M = 1000*1000:     1000204 MBytes (1000 GB)
        cache/buffer size  = 16384 KBytes

In either case, what we’re worried about is making sure that our storage units line up so as to enable the most efficient use of the disk devices.  Since the disk will always read and write in increments of 512 bytes, we want to make sure that we don’t divide up the disk into partitions which cause data which could fit in one sector to be written across two sectors (as described in the IBM article).  That would mean our disk would be performing two operations (accessing two sectors) instead of the ideal single operation (accessing one sector) necessary to read that data, and this, when stacked on thousands and thousands of disk queries, can amount to big adverse performance consequences (as the IBM article’s benchmarks show).

So we need to make sure that we start our partitions of the disk devices at sector boundaries.  With 512 byte sectors, this isn’t a big challenge, but with a 4096 byte sector size, we wouldn’t want to partition our disks beginning on some strange position.  Take the 63rd 512 byte sector, for example; if we were to begin our partition here, our partition would be starting 512*63=32256 bytes into the disk, and 32256 bytes / 4096 byte sectors yields 7.875, meaning that our partition would start misaligned to the 4096 byte sectors by 0.125 sectors, or 64 bytes.  That means that the first 4096 bytes of data written to the partition would not be contained in one nice sector, but would rather be spread across two sectors (its first 64 bytes being in the one eighth of a sector which begins the partition, and the remainder filling out 7/8 of the next sector, requiring two disk operations where one should have sufficed to read that data.  When this kind of inefficiency is spread across all I/O operations (though not every operation will be affected since some need to read contiguous sectors anyway) thanks to this initial misalignment, severe performance effects will be noticed (as shown in the IBM article’s benchmarks).

So, to avoid any potential problem, my disks were simply partitioned with the latest iteration of cfdisk (menu-driven interface – try it out).  Since cfdisk assumes a sector size of 512 bytes, I created a single partition for each device with the default starting position of sector 2048 (which works equally well for classic 512 byte sectors and Advanced Format 4096 byte sectors, since the 2048th 512 byte sector would be the 256th 4096 byte sector – it’s evenly divisible either way), and your partx output should look like this:

$ sudo partx --show /dev/sdb
NR START        END    SECTORS   SIZE NAME UUID
 1  2048 1953525167 1953523120 931.5G      a2287534-01
$ sudo partx --show /dev/sdc
NR START        END    SECTORS   SIZE NAME UUID
 1  2048 1953523711 1953521664 931.5G      44cb72a0-01

What have we accomplished?  Relative to the start of the disk device, our disk partition begins on a 4096 byte boundary.

What will we now endeavor to accomplish?  Within the partition, our data structures will align to 4096 byte boundaries relative to the beginning of the partition (and therefore the beginning of the disk, given (1) above).

All of this is done so that the disk efficiently accesses this data when it is called upon to do so.  A 4096 byte chunk of data will be placed in a single disk sector, allowing the disk’s firmware to read it in one fell swoop (or eight contiguous fell swoops, if we really are using disks that read in 512 byte sectors).

Physical Volumes

I can now create the physical volumes to be used by the logical volume manager.  Here, I need to take into consideration the alignment of these physical volume extents (whose size I will specify during the creation of volume groups in the next step) with the underlying disk blocks.  Because I have started my disk partitions at the 1 MiB mark, I have confidence that I am beginning at a sector boundary with either 512 byte or 4096 byte sectors, so I don’t need to address any potential partition misalignment.  The warning in the pvcreate manual page harkens back to our discussion of partition alignment in that it provides a method to compensate for misaligned partitions (which we needn’t worry about here – any disk made after 2008 shouldn’t be doing this anymore, and I found a really old Google cache page wherein a person warns users that my Samsung disk specifically does not compensate for the partitioning strangeness mentioned above.  I’m going to just assume neither do).

Therefore, all we have to do is create the physical volumes without any data alignment considerations:

$ sudo pvcreate /dev/sdb /dev/sdc

Volume Group

Now, when we create the volume group, we specify the physical extent size. The implications of the physical extent size are roughly:

  1. It’s very difficult to change the physical extent size for a volume group once it’s set, so we want to get it right.
  2. We grow Logical Volumes by physical extents, so the size of the extent is the smallest increment by which we can grow a logical volume.
  3. To allow for extents to be easily read and written, the extent size needs to be a multiple of 4 KiB (4096 bytes) so that it nicely fits within disk sectors.
  4. To optimize storage use, the extent size should be a multiple of the stripe size we will eventually implement for our RAID-0-resembling logical volume.

So basically, we have to think ahead to the size of the stripes we want to create for our performance-oriented, RAID-0-resembling logical volume.  We want our stripe size to be a multiple of our disk sector size, so again we’ll take the safe choice and align along the 4096 byte boundary (which lines up with the 512 byte sector boundary as well).  These stripes will be written into the amount of physical extents we appropriate to the logical volume, so we want the physical extent size to be a multiple of our stripe size so that we don’t waste any space.

To better determine our stripe size for optimal throughput, we need to consider the kind of data we’re working with.  In my case, along with storing Steam game data, I am attempting to run virtual kernels who will treat their disk image files residing on this logical volume as block devices for their use.  Bigger stripe sizes generally offer the best performance so long as your data isn’t being predominantly handled at sizes smaller than the stripes (since that would result in a failure to spread the data between disk devices along with unnecessarily large read and write operations when handling said small chunks of data).  Small stripe sizes are generally hard on disk I/O, since they increase the amount of seek and read/write operations which must take place to deal with files larger than the stripe size (a 1 KiB stripe size would be very hard to deal with when reading a 1 MiB file, for example, since that would require 1024 stripe reads, whereas a 128 KiB stripe size would require only 8 stripe reads).  Very small stripe sizes are generally only appropriate for databases whose I/O operations are almost entirely limited to actions at or below the designated stripe size (and in these cases, small stripes really optimize performance like crazy).

Given that I only have two disks and the data access will vary (perhaps widely, depending on what I end up doing with my VMs), I’m going to go with a medium-large stripe size.  Remembering our general constraints:

  • The stripe size should be a multiple of the disk sector size
  • The extent size should be a multiple of this stripe size.

I’ll go with a 64 KiB stripe size and a 4 MiB physical extent size.  Using this default physical extent size, I simply square the stripe size to reach the extent size.  It seems pretty, at least.  Since I’m using the default extent size, I don’t need to specify it during volume group creation:

$ sudo vgcreate labor_vg /dev/sdb1 /dev/sdc1

All that thinking to just perform a default volume group creation!

Logical Volume

So now we already know what stripe size we’ll be using (64 KiB), and we simply create the volume as follows:

$ sudo lvcreate -i 2 -I 64 -L 400GiB -n workbench_lv labor_vg

Alright!  Almost done.

File System

Which file system should we use with our nifty little RAID-0-resembling logical volume?  I’m a fan of the tried-and-true XFS (and benchmarks regularly show it in the lead or among contenders for the lead, especially when it comes to writes, about which I am most concerned).  The mkfs.xfs command is extremely user-friendly, and I don’t even have to specify anything to get the correct output since it queries the logical volume for pertinent information when creating the file system:

$ sudo mkfs.xfs /dev/labor_vg/workbench_lv
 
meta-data=/dev/labor_vg/workbench_lv isize=256    agcount=16, agsize=6553584 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=104857344, imaxpct=25
         =                       sunit=16     swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=51200, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Now, the figures in which we are most interested are the sunit (stripe unit) and swidth (stripe width).  They don’t look right, though, do they?  Well that’s because even though mkfs.xfs interprets sunit and swidth as being specified in units of 512 byte sectors when provided as options on the command line, mkfs.xfs reports them during file system creation in units of whatever block size your file system ends up using, which, in this case, is the default 4096 KiB (bsize=4096).  That means that our stripe size (sunit) is 16*4096=64KiB, as expected, and our stripe width is 32*4096=128KiB, as expected (since we have two disk devices across which our stripes are spread, with 64 KiB stripes going on each device).

And now, my friends, we have reached the end of our long journey.  Time to mount the logical volume to its permanent home, make another RAID-1-resembling logical volume, mount it to its home, and be done!  I’ll let you know if the performance increase is wicked awesome.

Update:

The performance is WICKED AWESOME.  When I was running two CentOS 6.6 VMs concurrently off of the same single 7200 RPM 1 TB drive, an attempt to update them simultaneously resulted in a very obvious queue depth increase, with the yum operations on each system practically taking turns with their I/O calls in a very visible manner.  Now, I can run three systems and update them simultaneously with no noticeable performance degradation during the yum operations.  I am most pleased.

*I don’t really find the author of that article to make much sense when he talks about the differences in intent between LVM and RAID 0, just FYI.  I am thankful for the data, however.

Advertisements
This entry was posted in Information Technology and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s