10/13/2022 7:30am - Incident Post-Mortem Analysis

At approximately 7:30am on October 13th, 2022, Pomf suffered an upstream (uncached GETs and all POSTs) failure due to a loss of storage connectivity on the hypervisor (esxi3.lain.local) it was running on. The issue was traced back to a transient NIC failure, a Mellanox ConnectX-2 10GbE card, that had timed out in responding to OS commands. This led to the software iSCSI stack crashing irrecoverably. This condition was rectified at 11:30am after a hard reboot of the host, after approximately 45 minutes of troubleshooting and VM evacuations from it. I was getting my beauty sleep between 7:30am and 10:30am.

Now I know what you're thinking - didn't you just move Pomf to your new servers? Guess they suck, huh! Not quite, dear reader. This was a little different. The NIC on the host itself failed multiple times in the hours leading up to the event, then eventually crashed. The NIC failure may have been caused due to heavy I/O load in transferring my backups to the other new storage server during this time, but should not have crashed at all, really. Both stor1.lain.local (R730xd, pomf) and stor2.lain.local (R720xd, backups) never encountered an issue.

Here's a transient failure record:

Here's when it finally died:

In VMWare parlance, this is known as an All Paths Down event. The underlying storage for the VM was cut off, causing deadlocks in the hypervisor and in Pomf itself. With no access to storage, Pomf could not read or write anything to the main storage pool, but caching on the edge nodes was of course still functional.

So, how do we prevent this from happening in the future?

Well, the redundant networking project is certainly a start. It would prevent a single temperamental NIC from severing an entire server from the new storage systems and would have prevented this issue entirely.
Replacing this particular Mellanox card is also a good idea. I don't trust it anymore. I have 2 more Broadcom 57810's on the way, which have been battle tested in my other two servers and performed well so far. I will install them as soon as they arrive. There will be no downtime from this maintenance.
Pomf did not auto-migrate in the HA cluster because it has a higher performing database drive that was not inaccessible (because it was on the local SSD RAID on esxi3.lain.local.) This drive is used for the NCMEC lookup table. I will move this onto the new storage system to ensure this does not prevent failover migration - although Pomf is not reboot safe due to the cryptsetup partition anyway.
Monitoring completely missed this event due to the caching layer. I need to devise a solution to test Pomf's upstream server consistently instead of hitting the cache nodes.

One other side note: I was able to migrate every virtual machine to another host, even with esxi3.lain.local in its degraded state, EXCEPT pfSense. This broke because I had special low-latency settings on the VM that prevented migration. I was able to enable a compatibility override in the VM's advanced settings to allow it to migrate while preserving the CPU and RAM reservations that help it perform as best as possible.

-7666