It sure has been a long time since we've had one of these, huh?

Overview:

At 3:13AM EST on December 18th, 2023, the Lain.la Datacenter suffered a power outage. Power outages aren't usually a cause for concern, except this one was the mother of all power outages: over 14 hours, with power returning at 5:29 PM. Never in the history of Lain.la has a power outage ever even come close to this duration, but the damage was quite extensive from a wind storm that blew through the area and required significant time for the local utility company to repair.

Lain.la's battery system is designed to handle up to 6 hour power outages, but this event was way beyond its capacity. To keep things powered required the use of the generator I keep on hand, to feed the primary UPS and therefore all of the systems attached to it. This was not as straightforward as it sounds, however, as not all devices and UPSes enjoy the rough waveform produced by a generator, so multiple secondary UPSes switched to battery power unexpectedly, requiring some nurturing to return to functionality.

One such UPS provided battery power to the optical network terminal (ONT) that converts the light carried by fiber optic cables into an RJ45 Ethernet uplink. sort of like how a coaxial modem converts coaxial signals to an Ethernet uplink. This UPS failed, looping between a powered on and off state, causing the ONT to rapidly power cycle. This disrupted WAN access from 2:50PM to 3:15PM, cutting Lain.la's servers from the outside world until I could figure out the problem and put a replacement in. The replacement UPS I installed suffered the same fate a short time later, interrupting WAN access from 4:30PM to 4:40PM. After my second replacement UPS refused to even turn on, I decided to bypass all UPSes for the ONT at this point, wiring it directly into generator power. To reverse this change after power came back, I needed to disrupt WAN access one last time, from 5:30PM to 5:35PM. All in all, through the 14 hour outage, Lain.la was only down for 40 minutes in total. The drops can be illustrated via the orange line in this graph:

Storing Power for Outages:

As you may be aware, Lain.la stores quite a large amount of power on standby for exactly this sort of incident, somewhere in the range of 8kWh, with completely automatic transfer to battery power within milliseconds of a grid disruption. This is contained in two UPS systems - The primary system, which is the custom UPS I built, and the secondary systems (Bank A and B), which are the 1500VA consumer UPSes that act as a buffer against the primary's transfer switch and possibly noisy input power from, say, a generator. The chart illustrating how this battery system works is below, as a refresher.

The primary UPS functioned 100% perfectly. Its capacity these days has been reduced to approximately 5 hours of power storage instead of 6, but that's only due to additional loads that I have placed on it since construction, such as the seedbox server and the redundant LAN components. All in all, the system pulls about 120 amps to get the job done, translating to 1.44kW. This doesn't mean Lain.la consumes 1.44kW, as there are losses in play from power conversion and various other factors, but that is what is being extracted from the batteries. That meant when the outage occurred at three in the morning, I just went back to bed.

The secondary UPSes are where things got tricky. When the primary reached about 20% power at 7am, I woke up and just knew I'd need to get the generator going. Thankfully, I was able to do so much faster than usual (turns out - applying the correct amount of pull start force gets it started quickly. Who knew?). I ran the generator's main lead to the primary UPS input and it began charging and switched into bypass mode - feeding the generator power directly to the secondaries. This was just fine, until maybe an hour later the secondaries occasionally balked at the power quality, refusing to take the generator power. Why would they do that? Let's do a little science.

Generators, Generators, Generators:

Conventional generators are not perfect. They are tied to a combustion motor which isn't always 100% perfect at maintaining speed and consistency. These little imperfections, as well as the lack of a pure sine wave output due to the way the generator functions, can wreak havoc on sensitive electronics that require clean power. See below, the difference between a pure sine wave AC input that you get from the grid, and a simulated sine wave that you get from something like a generator or cheap UPS.

The choppy, square-like pattern is what makes these devices very, very mad. In addition, for my generator in particular, voltage readout was 107 volts from the generator, somewhat low, and the input frequency was 58.7Hz, which is most definitely not the 60Hz frequency it should be. This was the root cause as to why I lost the ONT twice and had to micromanage the other secondary UPSes by swapping them between the primary UPS's remaining power and generator power to "reset" their state. I have ordered a new UPS specifically for the ONT that matches the exact model of the ones I use for the secondaries. Even though the power draw of the ONT is in the neighborhood of 5 watts, I'm buying the expensive model specifically because it largely tolerates generator power and won't drain to zero or fry like my other UPSes did.

See below, the graveyard of "lesser" UPSes zeroed out or fried during this incident. Sure glad I have all these spares!

Final Analysis:

So, with all of that science business out of the way, what can we conclude?

  • Lain.la stayed running well in excess of the rated "6 hour" off-grid survivability rating that I designed it for. This is huge.
  • Impact was limited to 40 minutes of WAN connectivity loss.
  • The only correctable impactful issue found can be fixed for just $250 for a drop in replacement (and I have already purchased it).
  • Generators suck, but ones that provide nice clean power are probably extremely expensive and not worth investing money into for an extremely rare event.
  • It is good to keep lots of gasoline spare. I was down to my last tank after emptying the one spare 5 gallon jug I keep as reserve, although I could have just sucked some through a siphon out of my car, I guess.
  • Incorrect assumptions you make about your infrastructure will turn into hard realities when you're in the shit. Always have backup plans.
  • High winds suck! Trees suck too!

I am thoroughly pleased with how my infrastructure design held up under a worst-case, extended blackout scenario. Throughout this entire event, not a single hardware or virtual server went down, requiring zero repair or restoration time for servers, and no possibility of data loss from a hard shutdown. Sure, we lost the WAN link a few times, but that only cut us off from the outside world, and the cause wasn't even the fault of a server but rather the upstream device's power source which will now be rectified for the next time this may happen. Since the blackout was for 14 hours, doubling the battery capacity from 5 hours to 10 hours (7.68kWh to 15.36kWh) isn't even a worthwhile investment, despite me mulling the idea over sometimes, so no changes are really needed there as they wouldn't have helped.

This uptime chart below could have looked a hell of a lot worse if I never built that UPS system. We'd have lost servers and had to battle with the secondary UPSes falsely tripping to battery power without any sort of "clean" power source buffer to help reset them. I predict Lain.la would have failed within a few hours.

I knew someday that project would pay off, and that day was today.

-7666