The DIY UPS Project - Power Outages? Not a problem!
Oh boy. Where do I begin with this article. Let's start with: Happy New Year!
Okay, maybe a table of contents is a better start:
Caching LVM for Pomf
After the slice range improvement, IOWait (The amount of time the CPU is deadlocked waiting for storage calls to finish) across the four edge nodes uses for Pomf traffic went up quite a bit due to the need to address a larger amount of fragmented files across the slow storage cache disk. Before it wasn't really a problem, but now when there's over 100,000 slices to manage, it changes the IOPS requirements quite a bit.
Funding for Ideas - Short Version
Update 1/7/2025: BuyVM has been acquired, and this throws Lain.la's future into doubt.
BuyVM is the provider of the edge nodes that I use to publish Lain.la, and Francisco, the owner, has sold the company. While I don't expect anything to change immediately, I would expect that what I am continuing to do today (e.g. pushing 500TB to 600TB of traffic a month) is in jeopardy.
Pomf Now Uses Cache Slicing!
Hello again dear reader! Pomf has continued to scale to wild heights, and so the cracks are starting to show. One such "crack" was the issue of cache refills. Read more on how I have just solved this problem, hopefully forever.
November Updates and Metrics
I'm still in one of the most aggressive infrastructure upgrade periods I've ever had for Lain.la. I spent 13 hours today just working on upgrades, servers, patching, etc, not to mention the largest maintenance window I've ever had to put up with this month. Let's go over some stuff.
Rant: EC-Council - Beware the False Prophets of Security
Preface: This article is entirely my opinion, based on my direct experiences with EC-Council courseware, training, and examinations. I currently hold an EC-Council certification. This might change if they ever read this and manage to round up enough outsourced Indians to figure out who I am.
October Updates and Metrics
Oh boy, how things have changed. This has been one of the most aggressive few months I've had for upgrades and changes.
Updates:
- Stor1.lain.local is now fully operational, with a usable storage pool of 115TB.
- Stor2.lain.local is now fully operational, with a usable storage pool of 90TB.
10/22/2022 4:00am - Incident Post-Mortem Analysis
Two incidents in a month. Ouch. And always when I'm sleeping, too.
So, at about 4:00am, the two main NYC lain.la nodes on BuyVM had their 1TB block storage mounts effectively jammed. No reads, no writes, no nothing. This wreaked all sorts of havoc - CPU deadlocked, Nginx deadlocked, nothing was going in or out of these two nodes. The other two nodes in Miami were fine. I woke up at about 10am, saw the problem, yanked the nodes, and popped open a ticket to BuyVM.
10/13/2022 7:30am - Incident Post-Mortem Analysis
At approximately 7:30am on October 13th, 2022, Pomf suffered an upstream (uncached GETs and all POSTs) failure due to a loss of storage connectivity on the hypervisor (esxi3.lain.local) it was running on. The issue was traced back to a transient NIC failure, a Mellanox ConnectX-2 10GbE card, that had timed out in responding to OS commands. This led to the software iSCSI stack crashing irrecoverably. This condition was rectified at 11:30am after a hard reboot of the host, after approximately 45 minutes of troubleshooting and VM evacuations from it.
Storage. Storage Storage Storage.
I jumped the gun on the storage server project. Decided to just go for it. And guess what? It's done! Pomf is on the new server right now!