10/22/2022 4:00am - Incident Post-Mortem Analysis

Two incidents in a month. Ouch. And always when I'm sleeping, too.

So, at about 4:00am, the two main NYC lain.la nodes on BuyVM had their 1TB block storage mounts effectively jammed. No reads, no writes, no nothing. This wreaked all sorts of havoc - CPU deadlocked, Nginx deadlocked, nothing was going in or out of these two nodes. The other two nodes in Miami were fine. I woke up at about 10am, saw the problem, yanked the nodes, and popped open a ticket to BuyVM.

(Grey is iowait in htop's extended colors, by the way)

Now you may be wondering to yourself, "[7666], don't you have a script to remove bad nodes from DNS when this kind of thing happens?" And yes, yes I do. But, with a teeny tiny lazy bit of scripting on my part, It did not function on this event despite my previous testing.

So, here's the logic behind the monitor script. Admittedly, this code has really gotten to be a mess after the years of moving around providers, adding functionality, and not refactoring it to just take a damn IP as a parameter. At a high level:

Detect failures via probing HTTP, SSH, and ping on all four nodes
If there is a failure, count and log these failures, and print what the failure was
If enough failures accumulate, yank the node from DNS only if we haven't yanked 2 nodes already (to prevent cutting ourselves off entirely), and then log a NODE#DEAD (where # is the ID of the node) entry in the kill count text file.
(New) After the cron job has removed enough strikes from the strike log (only possible if new strikes aren't detected), re-admit the node to DNS and remove the NODE#DEAD entry.

The detection code worked very well. Here's a snippet of the output:

The issue was that the script was simply not yanking the nodes. Why? Well, in the before times, I enjoyed using "wc -l" as a crutch to attach a return code to grep, without actually looking at the return code. For example, do "cat file.txt | grep HELLO | wc -l" and then look for anything that isn't zero to see if HELLO exists in a file. Crappy, yes, but reliable when done correctly. However, when rewriting the monitor script, I forgot to add a "grep" to the line that looks at the text file that keeps record of how many nodes we've killed.

KILLCOUNT=$(cat /etc/cron.witness/killcount.txt | wc -l)

Now, this was never a problem until the rewrite to add functionality that allows the script to re-admit the node in DNS after good behavior. The way I removed the "NODE#DEAD" entry in the text file was via this:

sed -i 's/NODE1DEAD//' /etc/cron.witness/killcount.txt

This particular statement successfully removes the NODE#DEAD entry from the killcount file, BUT - it leaves a newline behind. Using just basic wc -l means that even though the NODE#DEAD entry has been removed, there is a newline in there, and wc -l will pick it up for the kill count tally. Apparently, there had been two newlines in the kill count file for some time, and this meant that the script thought it had already killed two nodes, and it refused to kill another.

The fix is pretty simple, although the text file will have a bunch of newlines polluting it over time (which I'll just add a \n trimming part to the sed statement as well):

KILLCOUNT=$(cat /etc/cron.witness/killcount.txt | grep NODE | wc -l)

This ensures that only matches on NODE#DEAD entries and not newlines will show up to wc -l for a count. So, this script will now work in the future, and I tested it with this change and it yanked both nodes from DNS immediately.

Between the time of detection and writing this incident log, there has been no response from BuyVM, but it's only been an hour. It will probably get fixed today. For now, Lain.la can run just fine on two nodes.