On November 29th, 2021 at 12:51pm EST, The lain.la datacenter (ha.) suffered a total power outage after a brief moment of voltage fluctuations on the main grid. This had been the first time since I started lain.la that a real, extended power outage had occurred, and I got to do a generator cutover and learn many lessons along the way. The incident ended at 1:53pm EST when all systems were restored.
First, what went right:
- My emergency plan to cut in the generator worked! I was able to get all five UPSes connected to generator power within the given timeframe of the UPS battery capacity.
- Power was maintained to all networking systems and the critical R730 server, preventing any interruption in those resources.
- Return to main power was also a simple process with no serious problems experienced.
- My recent change of running each power lead on dual PSU systems to separate UPSes was a very smart move.
- Even though a server and NAS failure (described below) occurred, almost all systems came back up with no intervention.
Now, what went wrong:
- One UPS failed to accept generator power. This UPS is an older, cheaper AVR (automatic voltage regulation) model. I bought it at least 6 years ago. All of my UPSes are PFC (power factor correction with pure sine wave) models except this one, and this one happened to be running the NAS. This caused the NAS to drop, meaning the backing storage for my backups, UPS monitoring, one Hentai node, and most importantly Pomf, dropped. Strangely, this issue did not occur during my cutover testing at the start of the year.
- A second UPS failed to accept generator power some of the time. Eventually the battery gave way and I lost this one too. This UPS ran ESXi1 and that server had a hard power off. I believe this one was due to a configuration setting in the UPS that I set a very long time ago that forced high sensitivity to power disruption. I just set this back to defaults. If you're not aware, generator power waveforms are VERY choppy, and can piss off all sorts of delicate equipment.
- The generator did not start immediately. This is my fault since I didn't give it a good startup over the summer. It slowed me down by 5 minutes trying to get fuel into the carburetor after not running for a year.
- The run times estimated by the UPS were way over what they could actually do. I barely made it in time. If I was one minute slower I'd have lost a third UPS.
- A loose power cable on the NAS's power brick caused it to drop power a second time when cleaning up after the cutover. This is now fixed.
How do we improve?
- Some of the runtime problems were caused by yours truly. I'm Ethereum mining on my main rig (almost pays for lain.la entirely) and that is tied to the aforementioned UPS #3, causing a disproportionate load on that UPS. I will move my main rig (the Ethereum miner) over to the UPS that is having problems so that nothing actually critical fails if it dies.
- The generator will be started each time a maintenance cycle is completed for lain.la. This means it will be started approximately 4 times a year. In addition, an extended load test of each UPS will be conducted to ensure battery and converter health.
- A new UPS to replace the failing AVR model has been immediately ordered and a swapout will take place in the very near future. This may cause minor downtime as non-redundant systems will have to be power cycled to move the power lead over to the fancy new UPS. This UPS is the same model as the other two that did not experience any problems whatsoever.
- ESXi1 will be moved to this new UPS as well, just in case the second UPS mentioned above is still having problems. This will also force a reboot of ESXi1.
- Increasing the cache sizes of nodes at the edge will help keep lots of active Pomf content in available VPS storage - meaning most people won't even know there's an outage except when they go to upload something or query an old file. This is already in progress.
What risks remain?
- Any power outage carries a risk of component failure due to the harshness of generator power and battery strain, especially in fully depleted battery scenarios. Lead acid batteries don't like being run down all the way.
- Without an automatic transfer switch on a backup propane generator OR a solar system, the risks of a power outage taking down lain.la continue to be present. I'm not at the point yet where I'm willing to spend the extreme amount of money to do either project though. This can cost anywhere between $5k-$50k.
- My previous power outage mitigation plan assumed that, if I was at work, I'd have enough time to get home and cut the generator in. I no longer believe this to be the case. The run time on most of the UPSes I had was 18 minutes. It takes me at least 20 minutes to get home, and let's say 10 minutes more to cut in the generator. Therefore, if I'm not at home and there is a power outage (from, say, a winter storm), I am shit outta luck.
With these risks in mind, purchasing the new PFC UPS to replace the old AVR is probably the best cost/benefit change I can make here and this is happening now. Everything else is just a risk we have to take going forward. Remember - I do this for free :)