Entry № 041-3 / V-2556 · 0:00 synced

This Server Deployment was HORRIBLE

Linus Tech Tips@LinusTechTips1.8M viewsFeb 4, 202017:40
Source
YT
Views
1.8M
Subscribers
16.8M
Critic
?
Audience
?

0 up · 0 down · 0 ratings

Promos

Get $20 in free credit on your new account at linode.com Monitor and manage your PC in real-time with Pulseway! Create your free account today at lmg.gg Sub to Level1Techs!! youtube.com Our new server is TOO fast... and no, that is not a good thing. Buy AMD EPYC Rome processors on Amazon (PAID LINK): geni.us Buy Intel P4500 on Amazon (PAID LINK): geni.us Buy Crucial 32GB DDR4-2933 on Amazon (PAID LINK): geni.us Gigabyte EPYC server: gigabyte.com Purchases made through some store links may provide some compensation to Linus Media Group. Discuss on the forum: linustechtips.com Our Affiliates, Referral Programs, and Sponsors: lmg.gg Get a Displate Metal Print at lmg.gg Get a 30-day free trial of Amazon Prime at lmg.gg Linus Tech Tips merchandise at lttstore.com Our Test Benches on Amazon: amazon.com Our production gear: geni.us Come see us at LTX 2020: ltxexpo.com Twitter - twitter.com Facebook - @LinusTech Instagram - @linustech Twitch - twitch.tv Intro Screen Music Credit: Title: Laszlo - Supernova Video Link: youtube.com iTunes Download Link: itunes.apple.com Artist Link: soundcloud.com Outro Screen Music Credit: Approaching Nirvana - Sugar High youtube.com

Start
AI OverviewDefault language

The video opens with Linus describing a high performance storage server project that was intended to be the pinnacle of speed and efficiency. However, the initial experience quickly reveals a puzzling flaw: inconsistent transfer speeds on a 24 drive NVMe storage array, sometimes hitting hundreds of megabytes per second and other times dropping to mere tens. The team documents their diagnostic process, first suspecting Windows Storage Spaces and later discovering that individual drives were being reset at the PCIe level, causing timeouts and stalls in data transfers. They experiment with driver updates, power management tweaks, and different BIOS/firmware configurations, but the dropouts persist across both Windows and Linux environments, hinting at a hardware or fundamental architectural bottleneck. This section emphasizes the challenge of pushing NVMe hardware to its theoretical limits and the reality that software optimizations can only go so far when the underlying substrate struggles to keep up. The narrative then pivots to a deeper hardware-centric investigation, underscoring the importance of memory bandwidth, PCIe lane utilization, and CPU interrupt handling in multi-drive arrays. As the team shifts from Windows to Linux, they encounter the same dropout behavior, reinforcing the suspicion of a hardware issue rather than a misconfiguration. The discussion delves into the kernel's handling of interrupts, the limits of memory bandwidth, and how multi-core CPUs can complicate timely data delivery to the CPU when many NVMe devices are involved. They explain that software RAID and ZFS can exacerbate the problem under heavy parity calculations and simultaneous reads and writes, leading to instability in video editing workflows. The video then explains a crucial shift in strategy: moving away from a naive, single approach to storage toward a multi-disk software RAID setup with tuned parameters, including a new chunking strategy and virtualization-aware Proxmox in a Linux environment. This portion builds a clear picture of how theory meets messy practical constraints in high-end storage deployments. The final act focuses on the practical results and ongoing optimization. They experiment with a 64-core CPU, ultimately dialing back to 32 cores to balance cost and performance, and adjust the transition from interrupt-driven I/O to polling-based I/O to better exploit the hardware. The team reports sustained transfer rates of around 3 gigabytes per second under multi-client use, with improvements in latency that are important for a multi-user editing environment. They also highlight the value of latency over raw peak bandwidth for real-world editing tasks, where small delays can disrupt timeline scrubbing and playback. The video concludes with practical takeaways, acknowledging that even with optimized software and hardware configurations, there are limits and trade-offs but that the current setup delivers usable, stable performance for their needs. Finally, they praise a community contributor and invite viewers to explore more infrastructure-focused content on Level1Techs, while referencing additional Linus-related content and promotions.

Topics · technology · hardware · server · storage · linux · virtualization · troubleshooting

Questions answered

What caused the NVMe dropouts in the 24-drive storage array?
The dropouts were caused by a hardware interaction at the PCIe/driver level where drives reset and time out, effectively stalling data transfers and requiring recovery.
What solution strategy did the team settle on to achieve usable performance?
They migrated to a Linux based setup with proxmox, switched to a 32 core CPU, implemented a polling based I/O model, experimented with 128k chunk sizes, and used a four-disk software RAID layout to balance latency and throughput.
What performance did the final configuration achieve under multi-client use?
The final setup delivered around 3 gigabytes per second of reads/writes with three clients concurrently, with improved latency suitable for video editing workflows.