Caching LVM for Pomf

After the slice range improvement, IOWait (The amount of time the CPU is deadlocked waiting for storage calls to finish) across the four edge nodes uses for Pomf traffic went up quite a bit due to the need to address a larger amount of fragmented files across the slow storage cache disk. Before it wasn't really a problem, but now when there's over 100,000 slices to manage, it changes the IOPS requirements quite a bit.

Funding for Ideas - Short Version

Update 1/7/2025: BuyVM has been acquired, and this throws Lain.la's future into doubt.

BuyVM is the provider of the edge nodes that I use to publish Lain.la, and Francisco, the owner, has sold the company. While I don't expect anything to change immediately, I would expect that what I am continuing to do today (e.g. pushing 500TB to 600TB of traffic a month) is in jeopardy.

November Updates and Metrics

I'm still in one of the most aggressive infrastructure upgrade periods I've ever had for Lain.la. I spent 13 hours today just working on upgrades, servers, patching, etc, not to mention the largest maintenance window I've ever had to put up with this month. Let's go over some stuff.

October Updates and Metrics

Oh boy, how things have changed. This has been one of the most aggressive few months I've had for upgrades and changes.

Updates:
  • Stor1.lain.local is now fully operational, with a usable storage pool of 115TB.
  • Stor2.lain.local is now fully operational, with a usable storage pool of 90TB.

10/22/2022 4:00am - Incident Post-Mortem Analysis

Two incidents in a month. Ouch. And always when I'm sleeping, too.

So, at about 4:00am, the two main NYC lain.la nodes on BuyVM had their 1TB block storage mounts effectively jammed. No reads, no writes, no nothing. This wreaked all sorts of havoc - CPU deadlocked, Nginx deadlocked, nothing was going in or out of these two nodes. The other two nodes in Miami were fine. I woke up at about 10am, saw the problem, yanked the nodes, and popped open a ticket to BuyVM.

10/13/2022 7:30am - Incident Post-Mortem Analysis

At approximately 7:30am on October 13th, 2022, Pomf suffered an upstream (uncached GETs and all POSTs) failure due to a loss of storage connectivity on the hypervisor (esxi3.lain.local) it was running on. The issue was traced back to a transient NIC failure, a Mellanox ConnectX-2 10GbE card, that had timed out in responding to OS commands. This led to the software iSCSI stack crashing irrecoverably. This condition was rectified at 11:30am after a hard reboot of the host, after approximately 45 minutes of troubleshooting and VM evacuations from it.