4/10/22 2:00pm - Incident Post-Mortem Analysis

At approximately 2pm on April 10th, 2022, high latency was detected across all endpoint nodes:

There is evidence that this event may have began occurring as early as 12:06am on April 10th, but the full effects weren't felt until 2pm.

Lain.la maintains five endpoint nodes across two datacenters (Miami, NYC) via Frantech (BuyVM). 4 nodes handle Pomf traffic, the fifth one is a quiet node for more performance sensitive applications. This is to prevent single points of failure and spread traffic evenly across these endpoints. However, the issue here was clearly at a different single point of failure - either Frantech's Autonomous System (AS53667) or something en route to their AS such as Verizon or another transit provider (e.g. PATH, the DDoS mitigation provider).

Initial investigation did not produce an immediate source. Data observed indicated that all nodes were actually stable despite the latency injection. The baseline latency for NYC is 10ms and the baseline latency for Miami is 40ms. If you subtract 120ms from each ping check above, you get very close to the baseline, meaning we had a 120ms detour coming from somewhere.

After finding a suitable traceroute source to work with (I forgot PfSense has built in utilities for literally this kind of problem), I was able to get a traceroute with DNS resolution:

192.168.1.1 (192.168.1.1) 0.147 ms 0.143 ms 0.115 ms
2 * * *
3 B5301.NWRKNJ-LCR-22.verizon-gni.net (100.41.128.24) 11.963 ms 11.566 ms
B5301.NWRKNJ-LCR-21.verizon-gni.net (100.41.15.254) 8.167 ms
4 * * *
5 * * *
6 0.ae12.GW12.IAD8.ALTER.NET (140.222.234.29) 13.367 ms 12.095 ms
0.ae11.GW12.IAD8.ALTER.NET (140.222.234.27) 11.429 ms
7 63.88.105.94 (63.88.105.94) 21.247 ms 20.456 ms 24.014 ms
8 * * 81.52.166.77 (81.52.166.77) 13.427 ms
9 hundredgige0-1-0-6.lontr6.london.opentransit.net (193.251.241.139) 92.404 ms
hundredgige0-2-0-4.lontr6.london.opentransit.net (193.251.128.12) 92.288 ms 93.860 ms
10 ae303-0.ffttr6.frankfurt.opentransit.net (193.251.243.248) 112.943 ms 98.196 ms 98.710 ms
11 rostelecom-3.gw.opentransit.net (193.251.255.6) 100.025 ms 100.547 ms 117.617 ms
12 87.226.183.87 (87.226.183.87) 123.501 ms
87.226.181.87 (87.226.181.87) 125.159 ms
87.226.183.87 (87.226.183.87) 120.902 ms
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *

The astute observer may note a very strange router appearing here: rostelecom-3.gw.opentransit.net. It's the Russians! You can also WHOIS the 87.* IPs and see the following info:

So we've determined we're being intercepted by the Russian Federation. Great! What can we do about it? Nothing!

This is quite possibly an example of either BGP hijacking (malicious) or a BGP route leak (an oopsie woopsie) upstream from my network (Verizon). What's interesting is that the leak only affected Frantech's AS, as I couldn't replicate the leak anywhere else. Granted, I only tested a few websites, but still. The combination of Verizon as upstream and Frantech as the destination was the only way I could get traffic routed to Russia. In addition - testing lain.la inbound from multiple personal VPNs did NOT route through ROSTELECOM. It was only when it was Verizon as the source, which happens to be how all of my VPN tunnels get out to the internet, via my fiber line. I did also verify that PfSense, my DD-WRT primary router, and anything else between my servers and the Fiber ONT wasn't the source of the route snafu (I feared this might be the case).

So what was the impact? Well, the latency wasn't great. It caused significant slowdowns in latency sensitive applications and made my upload go to total crap for Pomf archive retrievals to endpoint nodes. Surprisingly nobody emailed me, but still. I can't imagine the performance was great while Putin had his sticky paws all over my traffic. Speaking of which, is this a security incident? Not really. Sure, it's concerning that the Russians technically had custody of all data leaving my datacenter, but it's all encrypted over the wire with AES128 or better. All of the VPN tunnels are negotiated with a secure PKI and pointed only at explicit IPs. Plus, mandatory HTTPS for everything entering lain.la means that even if they redirected ALL inbound traffic, all those HTTPS key exchanges would still function securely. SSH would be fine too. This was more of a nuisance and performance degradation than anything. The only thing that I know of (that I control) that would NOT have been encrypted is Xonotic maps and UDP packets, and I really doubt anybody was playing on my server at the time.

I opened a ticket with BuyVM/Frantech to see if there's anything they can glean from this incident, and will report back if I find anything more. In addition, this may make a stronger case for LTE failover for Lain.la, but that will add significant cost ($50/mo) and complexity (failover conditional engineering on DD-WRT) to my networking.

Oh and I'm sticking another Ukraine flag here as a test to see if the traffic gets re-routed again.