For context, you may want to read about LVM caching here - in this article, I scrap the entire idea.
Hello dear reader. It is June 2023, and here I am at my wits end, attempting to determine why my load average shot up to 20+ on yet another edge node. As usual, my only lead is "I/O", but I can't determine why. I have cycled Nginx settings like mad, rebooted like crazy, and tried to introduce chaos into a problem that will not budge. What's worse, it's only one or two nodes, not all of them. So, what gives?
(Example of overload beginning to occur:)
I'll spare you details of my nightmarish debugging, but I came up with a very plausible two-pronged theory:
- BuyVM's storage slabs are inconsistently slow. I don't know why, they just suck for performance every so often, leading to the above.
- My LVM caching architecture was introducing undue overhead on the system in some way, shape, or form. This is possible. It was complicated after all.
As a reminder, this was the cache architecture:
- L1: RAM
- L2: LVM Cache (SSD)
- L3: Pomf Cache (HDD)
- L4: Pomf Itself (Cache Miss)
Now it's:
- L1: RAM
- L2: Pomf Cache (SSD)
- L3: Pomf Itself (Cache Miss)
This means the HDD is entirely gone, and thus the LVM config and cache system gone with it. While this does mean Pomf's edge cache is now about 1/4th the size of what it used to be (270GB now, instead of 1TB), it also means that we can guarantee excellent caching performance as that entire 270GB is SSD storage that comes with the VM. Plus, with 4MB slice range caching, we aren't going to have problems with refreshing content anyhow. We'll just have a few more cache misses that have to travel over the WAN, so we'll see some increased traffic there, but not to any point of contention or anything as 4MB calls are cheap to make. Ultimately, the performance is much better.
Oh, and one more very important thing: No more HDDs in the cache means we're saving $5/mo per edge node! This means I've shaved off $20/mo in costs! (UPDATE: I added more NVMe storage at $10/mo so this actually ended up being more expensive. The reliability, in my opinion, was worth it. I'll also keep my EFF membership regardless of the cost.) I was going to stop donating to the EFF ($250/yr) this year due to Lain.la's soaring costs, but re-instated that now that we made a cost reduction equivalent to the membership fee.
I will keep an eye on the WAN traffic and the load averages on each node. Right now, they're in the "cache refill" stage, where most Pomf requests are cache misses. This means I'm pushing 600-700mbps out my internet link until that cools off, and performance will be reduced until tunnel contention ceases (approx 8-12 hrs). Poor pfSense is burning half of its CPU on AES encryption for the tunnels with all the traffic!
You can see what the cache refill looks like on the edge node. The blue line is inbound data - as more files hit the cache, less calls have to be made, so it gradually "warms up".
Here's the results, however, of Slice4, which was my testing node for this change and has a fully warmed up cache. That's quite the improvement!
-7666