Hello again! I apologize for taking so long to do yet another article. I am currently doing some patches over the weekend for work servers, and procrastinating on doing this quarter's cert rotation procedure. I figure I'd weaponize that procrastination to write an article to let everyone know I'm alive, and what I'm working on.
As you may have heard, Broadcom, a despicable and greedy corporation, purchased VMWare. I just happen to use VMWare for the hypervisors for all of Lain.la because
- A. It is (or was) a relevant skill to improve upon for the job market and your career.
- B. It is dead simple and very stable.
- C. It's what I started with at the very beginning!
So seeing this acquisition represented an existential threat to the future platform of all my servers and services. Plus, I just don't like proprietary garbage in my network anymore, and that's the last piece of it. It is a very thorny and critical piece however. So, this is my journey to Lain.la v4.0 - The Proxmox conversion!
The Plan
Any good migration starts with a plan, but to plan, you have to have information! So here's a bunch of data points:
- 2 Hosts (esxi3 and esxi5)
- 123 allocated vCPUs
- 336GB of allocated memory
- 2TB of Flash in use
- 83TB of HDDs in use
- 60 VMs
- VMs Encrypted per-disk via vCenter trust provider key
- Veeam for backups on a Windows VM
Here's a picture of what vSphere sees - note that the numbers will be off due to thin allocation of RAM, GHz used as a metric, and the backups being virtualized:
And here's some requirements I've set out for myself:
- Full removal of all proprietary components from the hypervisors.
- Accommodation of existing networking schemes (VLANs, redundant networking / failover, Pfsense for the core)
- High availability features similar to vSphere HA.
- Accommodation of existing backup schemes (deduplication, offsites, incrementals, job schedules, encryption)
- Similar manageability, reporting, metrics, etc.
The very high-level plan is this:
- Acquire hardware to build a Proxmox cluster ALONGSIDE the vSphere cluster. - Done!
- Configure and install all hardware. - Still in progress.
- Load all operating systems - Proxmox Backup Server on the backup server, TrueNAS on primary storage, and PVE on the compute nodes.
- Get PBS integrated into PVE, get TrueNAS volumes integrated into PVE, get networking set up and ready to go.
- Fire up a bespoke VM to test everything above.
- Once the cluster is confirmed working, test the migration plan by decrypting ESXi5's VMs and reinstalling ESXi5's OS to be PVE and importing the VMFS6 volume on disk. Configure the above networking and such too.
- Convert the seedboxes on ESXi5 to qcow2 and boot VMs. If it works, we have a way forward.
- Test a migration from existing storage (stor1) to new storage (stor4) of a VM from vSphere and refine that process.
- Move all freebie VMs (yep, you're the guinea pigs!).
- Get backups going to my specs.
- Get offsites tested and going to my specs.
- Move Pfsense.
- Move all lain.la VMs under 1TB.
- Figure out how the hell to move the big VMs (Pomf, Minio. Probably gonna rsync, honestly). Then move them.
- Once the old cluster has been completely evacuated and everything has been working for AT LEAST two weeks, dump stor1, stor2, and esxi3's OS and reinstall with PBS, TrueNAS, and PVE and integrate them into the current cluster, creating an absolutely massive Proxmox cluster.
Hardware
Okay, so you probably saw that the hardware acquisition phase is marked done up there. Here's a picture of that.
This is a LOT of firepower. We need it. Trust me. The upper two servers are KVM1 and KVM2. The bottom two servers are STOR3 and STOR4. KVM servers are specced with the following (each):
- Dell PowerEdge R630 chassis
- 2x E5-2690v4 14 core, 2.6GHz base (3.5GHz boost) CPUs
- 384GB DDR4-2400 RAM
- Dual 240GB Kingston boot drives, RAID1
- Dual 1.6TB Intel DC S3610 SSDs, RAID1, for local use (e.g. Swap, local cache, whatever)
- Quad 10Gb Networking
- Dual PSUs
Storage servers are a new design by yours truly that can stuff 64TB more raw storage per server than the previous "generation". Here's the config:
- Dell Poweredge R730XD chassis
- 2x E5-2620 v4 8 core, 2.1GHz base (3.0GHz boost) CPUs
- 128GB of DDR4-2133 RAM (32GB for the backup server, STOR3)
- Dual 240GB Kingston boot drives, RAID1
- Dual 256GB NVMe SSDs for ZFS SLOG, RAID1
- Dual 4TB Silicon Power TLC NVMe SSDs, RAID1 (STOR4 Only - this is an analog to the 6.4TB SAS SSDs in STOR1 running remote flash workloads)
- Single 512GB NVMe SSD for ZFS L2ARC
- 8x16TB Seagate Exos SAS drives in RAIDZ2 (RAIDZ1 for the backup server, STOR3)
- Dual 10Gb networking
- Dual PSUs
All of this allows me to do a migration rather than a cutover or reinstall because I have enough capacity to run a second "lain.la" in the same rack. This should make the process much easier because I can test everything without disturbing users of my services. Then I will absorb the old hardware into the new cluster when the migration is complete to more than double its capacity, giving us a monstrous amount of resources for the future. When all is said and done, this rack will have 650TB of raw spinning storage (with the option to expand to 900TB), we'll have 74 Broadwell (or better) cores to utilize for virtualization, something like 12TB of flash storage, and well over 1TB of RAM. At this stage, all the servers are racked and networked and have passed power on testing, I'm just short a few parts to finish it all up.
As stated previously, this is a long, expensive, and arduous process. This will take time. It is the largest overhaul of Lain.la completed to date in its over four year history. I will report back when able, but just know I am making steady progress and things are moving forward in between the usual upkeep required of my services. For additional updates (and bad memes, jaded musings on life, etc.) you can follow me on the fediverse at https://comp.lain.la/users/7666 (Yes, that instance runs on my servers).
-7666