Caching LVM for Pomf | Lain.La Infrablog

After the slice range improvement, IOWait (The amount of time the CPU is deadlocked waiting for storage calls to finish) across the four edge nodes uses for Pomf traffic went up quite a bit due to the need to address a larger amount of fragmented files across the slow storage cache disk. Before it wasn't really a problem, but now when there's over 100,000 slices to manage, it changes the IOPS requirements quite a bit.

Typical IOWait on a Pomf node was about 30% of the total CPU load. That's pretty high. Here's the vitals of one of these nodes before the LVM change, using my new favorite tool, dstat.

You can see here that 1gb/s of traffic is fairly intensive on a Pomf node. Nginx is quite busy. For a 2CPU machine (at the time of writing), a load average of 4 is high. You may also notice the "wai" column, which is IOWait, is also pretty high. When IOWait spikes due to a long read or write, Nginx processes start to get blocked (the "blk" column).

Each Pomf node has the following configuration as of today (January 2024):

4 Ryzen 5900X CPUs
16GB of RAM
10Gbit Networking
480GB of SSD
3TB of HDD
Debian 11

Before implementing LVM caching, reads/writes were buffered by RAM only. Everything else hit the HDD cache disk. Now, we're using 425GB of that 480GB flash disk as an LVM cache, using lvmcache (ha.) to accomplish this. This means that we now have multiple tiers of cache to help keep Pomf speedy. In order of access speed, this is how a file being served is fetched. If the file is not in the first level, the system moves to the next level:

L1: RAM
L2: LVM Cache (SSD)
L3: Pomf Cache (HDD)
L4: Pomf Itself (Cache Miss)

So what is LVM caching? Simple! You take a volume group and shove a cache PV in it, then just enable caching using that PV. In my case, because the root disk is not LVM, I'm actually using a loopback device to accomplish the job of creating the cache PV without nuking the root disk. This uses an fallocate'd 425G file as a disk. Neat, huh? Here's the list of commands I used, with my notes in italics:

apt install lvm2 (install LVM. We kinda need this)
fallocate -l 425G pvcache.img (write a 425GB file that will act as our cache PV)
losetup /dev/loop0 pvcache.img (set up the loop device for the PV)
pvcreate /dev/loop0 (create a PV from the loop device)
pvcreate /dev/sda (create a PV for the Pomf disk)
vgcreate vgcrypt /dev/sda (create a VG with the Pomf disk)
lvcreate -n lvcrypt -L 3071.9G vgcrypt (create a LV with the Pomf disk)
cryptsetup luksFormat /dev/vgcrypt/lvcrypt (this is for LUKS)
cryptsetup luksOpen /dev/vgcrypt/lvcrypt volume2 (this is for LUKS)
mkfs.ext4 -j /dev/mapper/volume2 (Put a filesystem on the LV)
mount /dev/mapper/volume2 /var/www/permcache (Mount the encrypted Pomf LV)
tune2fs -m 0.1 /dev/mapper/volume2 (Get some storage back!)
vgextend vgcrypt /dev/loop0 (Extend the VG with our cache PV)
lvcreate -n lvcache -L 424.9G vgcrypt /dev/loop0 (Create an LV for the cache PV in the same VG)
lvconvert --type cache --cachepool lvcache vgcrypt/lvcrypt (Enable caching using the cache PV)

Special thanks to the following articles: https://man.archlinux.org/man/core/lvm2/lvmcache.7.en and https://landoflinux.com/linux_lvm_example_02.html

The performance improvement has been excellent after the change, dropping IOWait to single digits.

Here's some lvdisplay stats (from when the HDD cache was just 1TB)