November Updates and Metrics

I'm still in one of the most aggressive infrastructure upgrade periods I've ever had for Lain.la. I spent 13 hours today just working on upgrades, servers, patching, etc, not to mention the largest maintenance window I've ever had to put up with this month. Let's go over some stuff.

October Updates and Metrics

Oh boy, how things have changed. This has been one of the most aggressive few months I've had for upgrades and changes.

Updates:
  • Stor1.lain.local is now fully operational, with a usable storage pool of 115TB.
  • Stor2.lain.local is now fully operational, with a usable storage pool of 90TB.

10/22/2022 4:00am - Incident Post-Mortem Analysis

Two incidents in a month. Ouch. And always when I'm sleeping, too.

So, at about 4:00am, the two main NYC lain.la nodes on BuyVM had their 1TB block storage mounts effectively jammed. No reads, no writes, no nothing. This wreaked all sorts of havoc - CPU deadlocked, Nginx deadlocked, nothing was going in or out of these two nodes. The other two nodes in Miami were fine. I woke up at about 10am, saw the problem, yanked the nodes, and popped open a ticket to BuyVM.

10/13/2022 7:30am - Incident Post-Mortem Analysis

At approximately 7:30am on October 13th, 2022, Pomf suffered an upstream (uncached GETs and all POSTs) failure due to a loss of storage connectivity on the hypervisor (esxi3.lain.local) it was running on. The issue was traced back to a transient NIC failure, a Mellanox ConnectX-2 10GbE card, that had timed out in responding to OS commands. This led to the software iSCSI stack crashing irrecoverably. This condition was rectified at 11:30am after a hard reboot of the host, after approximately 45 minutes of troubleshooting and VM evacuations from it.