Building a ML workstation

vhsven

2024-01-17

Intro
Component selection
- GPU
  - Multi-GPU considerations
- CPU
  - CPU cooler
- Motherboard
- Other
  - RAM
  - Storage
  - PSU
  - Case
  - Cooling fans
Further reading

Intro

I have been keeping a close eye on the evolutions in AI/ML for a while now. Whenever I come across an interesting demo, I of course like to try it out. Because my main computer at home only has a weak iGPU, I often resort to running workloads in the cloud (mostly Google Colab or AWS). While that works reasonably well, there are some downsides:

risk of going over budget when an instance accidentally keep running after use
general mild inconvenience of working with remote systems
cloud defeats the purpose of running a private/local LLM
more expensive in the long run

That is why I decided to build my own ML system last month. I am not sure if I will actually end up saving money this way, but it is going to be an educational experience regardless. It is still early days, but this post contains my lessons learned so far.

Component selection

GPU

The core of any ML workstation is the GPU. Due to the ubiquity of CUDA requirements in deep learning, there is only a single viable brand: Nvidia. Their offerings can be categorized in three categories:

Architecture	Desktop	Workstation	Datacenter
Pascal (2016)	GeForce GTX 10xx	Quadro P	Tesla P4 / Tesla P100
Volta (2017)	N/A	Quadro GV100	Tesla V100
Turing (2018)	GeForce RTX 20xx	Quadro RTX	Tesla T4
Ampere (2020)	GeForce RTX 30xx	RTX A series	A100
Ada (2022)	GeForce RTX 40xx	RTX 6000 Ada	N/A?
Hopper (2022)	N/A	N/A	H100
Blackwell

If you have tens of thousands of dollars to burn, you will want to look at Nvidia's enterprise offerings and more specifically at the A100 or newer H100 GPUs. These options come with abundant VRAM (40 to 80GB), which we can put to good use in a deep learning context. Additionally, they are very power efficient with a lower TDP compared to consumer-grade GeForce cards. This translates to a smaller physical footprint, so that multiple cards can fit in a single server case. Be careful when installing these GPUs in a regular desktop case though: they only have passive (i.e., fanless) cooling so they require very intensive external ventilation as is standard in a typical, temperature-controlled data center.

Notice how I focus on VRAM memory above all else. The reasoning behind this is simple: if you do not have enough memory, your model simply will not run. The other specs will only determine how patient you will have to be to see the result.

One step down in the price range (5 000EUR - 10 000EUR) we find their workstation offerings. Here, the RTX A6000 and RTX 6000 Ada with 48GB VRAM both look appealing. These come in a "blower style" form factor instead of the more traditional "open air" form factor, meaning they exhaust hot air via the back instead of spreading it back into the case. This again makes it possible to install many cards in a limited physical space without having to worry too much about heat dissipation. Unfortunately, these cards make a lot of noise, and this type of cooling is not suitable for more power-hungry consumer-grade cards (250W+).

The price range of up to 2000EUR makes those consumer-grade cards a lot more viable for most people. Conveniently, the current and last generation flagships - RTX 4090 and RTX 3090 (Ti) respectively - both have 24GB VRAM. In conclusion, buying a (lightly) used RTX 3090 (700EUR - 800EUR) might be the most budget-friendly option out there. Downsides of these consumer grade cards are their high power usage and their unwieldy form factor. (You will have a hard time fitting an RTX 4090 in a 4U server case.)

While Nvidia produces all of its own cards in the datacenter and workstation segments, there is a lot more competition in the consumer space. While Nvidia releases Founders Edition (FE) cards for soms of its own GPU chips, many other companies (Asus, MSI, Gigabyte, ...) build their own alternative cards around those same chips. They all have their own peculiarities:

factory overclocking
- not at all relevant for us, if anything we might end up underclocking our card to keep power usage and temperature over long time spans under control
cooling method (1-3 fans, optional watercooling)
physical size
- typically 3+ expansion slots
- watercooled cards can be slimmed down to 1 slot height
power usage
power connectors
- most cards need 2 to 3 PCIe 6+2pin connectors
looks
- especially RGB lights, if you are into that
...

For my build, I am going to start out with a single RTX 3090 FE, but I want to select the other components carefully so that I can expand to 2x3090 or even 3x3090 in the future. Specifications:

Ampere architecture
24GB GDDR6X VRAM (= 12 chips x 2GB/chip?)
- have a tendency to run hot (>100 degrees Celcius)
- might need to replace the thermal pads
- using a GPU brace is also reported to fix some heat issues
- or consider underclocking with sudo nvidia-smi -i <GPU_index> -pl <power_limit>
- or look into custom water cooling blocks, if that is your thing
memory bus: 384bit (= 12 chips x 32bits/chip)
dimensions: 313 mm x 138 mm x 3 expansion slots
PCIe connector: PCIe Gen 4 x16
- note: PCIe Gen 5 is the latest standard, but there are not Gen 5 GPUs yet
power
- 350W
- connector
  - placed on long edge of card (instead of short edge in higher segments)
  - (variant of) new 12Vhpwr connector found on new ATX 3.0 PSUs
  - including conversion cable to 2xPCIe 6+2pin connectors
last consumer card to support NVLink

For a much more in-depth analysis of GPUs for deep learning, check out https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/.

Multi-GPU considerations

We need a motherboard and CPU combination that has enough PCIe Gen 4 x16 slots with enough PCIe lanes and enough physical spacing in between to make this possible.

space
heat
power
PCIe lanes
SLI / NVLink
- bridge sold separately
- requires fixed amount of space between cards
  - not compatible with "creative" (i.e., vertical) GPU placement options
use same make and model for all cards
- else computation will often wait for the slowest card to finish
note: this kind of setup only makes sense for deep learning, not for gaming

CPU

For a CPU we have to choose between Intel and AMD. While Intel used to be a no-brainer in the not-too-distant past, the tables have turned in recent years. I knew this was true in the consumer space, but as it turns out it is also valid in high-end segments such as HEDT, workstation and server CPUs.

For a ML workstation, the CPU is not nearly as important as the GPU. When possible, extra budget should go to the GPU instead. However, the CPU has an important role in making sure the GPUs can reach their full potential. To do this, it has to be able to supply them with enough data so that they are not sitting idle. This memory bandwidth will play a crucial role in our choice of CPU segment.

Some background on PCI slots: each has a physical size (i.e., width) and a number of communication lanes that are both expressed with indicators such as "x1", "x2", "x4", "x8" or "x16". A GPU typically occupies a physical x16 slot because a lot of data has to be transferred back and forth. Note that smaller expansion cards can fit in larger slots but not the other way around. For example, a single x4 card can plug in a x16 PCIe slot, thereby forfeiting the other x12 lanes. Typically, a slot that is x4 wide and contains an x4 expansion card will use all x4 lanes and a slot that is x16 wide will use all x16 lanes. However, the two factors can in practice diverge. When either the CPU or motherboard are not able to handle the combined amount of lanes over all slots, they can decide to run one or more slot at half the number of lanes. So an x16 GPU slot can run with x8 lanes.

Furthermore, each PCIe generation roughly works double as fast as the previous one. So PCIe Gen 5x8 can work as fast as PCIe Gen 4x16. You might now be thinking: it evens out if we put a Gen 4x16 GPU in a Gen 5x8 slot. Unfortunately, that is not the case. The Gen 5 slot has full backwards compatibility with Gen 4 expansion cards, but it will also be limited to Gen 4 speeds in that case. Effectively, it will still be running at Gen4x8 if only 8 lanes are available. To be clear: running at half the amount of lanes does not halve the effective speed of the GPU. The number of lanes is not the main bottleneck in most systems, so the performance penalty will be much lower.

Consumer-grade CPUs such as Intel Core i3-i9 and AMD Ryzen 3-9 have a very limited number of available PCIe lanes. For example, the top end AMD Ryzen 9 7900X can only manage 28 PCIe lanes (of which 4 are reserved to communicate with the motherboard chipset). That leaves 24 lanes (e.g., x16 + x8) for our GPUs. For almost all consumers - who are only ever interested in having a single GPU - this is plenty. However, we have to ask ourselves if running our second GPU with only x8 instead of x16 lanes is worth it. For many people the answer will be "yes" and they should stick to this segment. The alternative is looking at HEDT, workstation or server segment CPUs, as we will do below.

HEDT or high-end desktop started with Intel Extreme Edition CPUs, and later Intel Core X CPUs. These days the HEDT segment of Intel has been integrated in their Xeon lineup of workstation and server processors. Specifically, the Xeon W9 2400 and 3400 series. The category sits somewhere between consumer-grade hardware and workstation hardware, offering more multithreading performance and more PCIe lanes. AMD is going back and forth with regard to their HEDT support. Threadripper CPUs are in the HEDT segment, while Threadripper PRO CPUs are in the workstation segment. AMD had not released a non-PRO Threadripper in a while, but at CES 2024 they announced a new lineup (e.g., AMD Ryzen Threadripper 7970X with 32 cores and 92 PCIe 5.0 lanes). In summary, HEDT is a good match for our build but the segment is being squeezed by high-end consumer hardware and lower-end workstation hardware.

Specifically, the AMD Ryzen Threadripper PRO 5000WX series (based on the older Zen 3 architecture) is very competitively priced these days. It offers workstation CPUs with up to 64 cores, 2TB of DDR4 RAM and 128 PCIe lanes. As we will see in the motherboard section, these builds come with typical enterprise features that are redundant for our target audience, to the point where a HEDT build would be a better match if properly priced. An additional benefit of using somewhat older (i.e., 2022) hardware is that the DDR4 memory and PCIe 4.0 SSDs that come with it are cheaper than the recent DDR5/PCIe 5.0 counterparts.

I also briefly looked at the server segment (Intel Xeon and AMD EPYC) but found no better offerings there. In the end I settled for a Threadripper PRO 5955WX 16 core CPU with a TDP of 280W and 128 PCIe lanes that I could get a decent deal on. The Intel counterparts have fewer cores, lower clock rates, fewer PCIe lanes for the same or more money.

Another fun fact about CPUs: you can buy them boxed (default) or as "tray". Tray refers to the tray with multiple CPUs that are typically bought by OEMs for use in prebuilt their systems. As such, these don't come with any extras (no box, no manual, no stock cooler, ...). OEMs are not supposed to resell these to consumers, but it sometimes happens regardless. Manufacturers like Intel and AMD will typically not provide factory warranty for such products, so you will have to talk to the intermediary (i.e., the OEM) instead in case of problems. The main advantage of tray CPUs is their lower price. If the discount is significant enough, it is worth considering. However, I learned the hard way that Threadripper CPUs are supposed to come with a torque wrench to fix the CPU mount precisely as tight as prescribed. The tray version of these CPUs obviously also do not include this tool.

CPU cooler

For workstation-grade builds, I prefer aircooling over watercooling. It requires barely any maintenance and it can run without failure for years on end. We do however need a cooler with a sizable heatsink to be able to dissipate the TDP of our CPU. When choosing one, make sure it does not occlude any RAM of PCIe slots that you intend to use.

Specifically for Threadripper PRO builds, pay attention to the orientation of the cooler. In desktops and workstations, coolers are supposed to blow air from front to back in the case. However, our CPU is from the server segment - contrary to the non-PRO Threadrippers in the HEDT segment - where a horizontal socket orientation is more common. In that case a regular cooler will blow air from bottom to top. The Noctua NH-U14S and NH-U12S both suffer from this. Eventually, I discovered the Arctic Freezer 4U-M, which has the correct orientation and also matches all other requirements (i.e., socket and TDP). The "4U" terminology refers to server height in a rack.

Motherboard

Our choice of CPU (or more precisely its sWRX80 socket) limits our choice of motherboards quite significantly. Again, we select in function of our multi-GPU setup. Specifically, we are looking for plenty of PCI Gen 4x16 slots that can run at full speed and are also spaced far enough apart. Additionally, we would like a few M.2 slots with heatsink that have a direct connection to the CPU and that are placed far enough away from hot GPUs. Finally, make sure to check the connectivity options (USB, USB-C, WiFi, Bluetooth). Fortunately, most motherboards in the short list fit the bill. I eventually went with the ASUS Pro WS WRX80E-SAGE SE WIFI because I saw it being used in Lambda Labs builds and I could get a good deal on one. It was only afterwards that I realized the awkward dimensions of this board (see Case section). It is worth looking for smaller alternatives, but make sure to look at the block diagram showing all interconnections before making a decision. The ASRock WRX80 Creator comes to mind, although it seems hard to come by and does not support x16 lanes in all PCIe slots.

Some random notes on the ASUS Pro WS WRX80E-SAGE SE WIFI board:

requires lots of power cables
- 1 x 24-pin ATX connector
- 2 x 8-pin CPU/EPS connector
- 2 x 6-pin PCIe connector
- 1 x 6+2-pin PCIe connector
main feature: 7 PCIe 4.0 x16 slots
built-in power and reset button
- works without connecting front panel headers of case
built-in VGA output
- useful when you don't have discrete GPU yet
  - note: AMD Threadripper PRO does not have iGPU
- makes linux crash on boot
  - first attempt: add acpi=off to grub bootloader options list
    - makes it possible to boot into live environment
    - also disables all but two USB ports
    - also crashes KVM via BMC
    - also disables all NVMe SSD drives
  - proper fix 1: add pci=nommconf to grub bootloader options list
    - in bootloader: E, add option, F10
      - linux /boot/vmlinuz-x.y.z-... ro quiet splash pci=nommconf
    - make permanent when booted
      - vi /etc/default/grub
        
        add pci=nommconf to GRUB_CMDLINE_LINUX variable
      - sudo update-grub
      - reboot
  - proper fix 2: disable VGA header via physical switch
- still works with BMC disabled
Q-code display output does not seem to match with table in manual
BMC / IPMI
- typical server-level feature
- access
  - can only be accessed over ethernet (not WiFi) via one of the two ports
  - make sure to use HTTPS
  - check IP address in BIOS
    - typically https://192.168.0.x
  - user: admin
  - password: admin
- makes system take minutes to boot after complete power down (e.g., after unplugging)
  - much faster after regular shutdown and start
    - but still slow compared to regular desktop
  - fixed when BMC is disabled
- LEDs
  - stay on when system is off
  - green (blinking): BMC is up and running
  - orange: on iff new warning in system event log
    - possibly about fans with RPM below threshold
- control fan curves via web portal
  - or via BIOS (after firmware update)
  - (non-PWM?) fans will run at max speed when BMC is disabled
- built-in KVM
contains two small fans
bottom pins are oriented south instead of up
- pro: allows large GPU to hang off bottom edge of motherboard
- con: many cases have limited space near bottom to connect everything
WiFi
- 6, not 6E
- shark-shaped WiFi antenna is very unpractical
  - alternative: aftermarket antennas that attach directly to connectors
no thunderbolt header (as is typical in AMD builds)
sound when running Ubuntu
- chip: Realtek ALC4080
- back panel 3.5mm AUX jack works fine
- front audio requires recent driver
  - requires alsa-ucm-conf>=1.2.9
    - commit: https://github.com/alsa-project/alsa-ucm-conf/commit/c79e8c18c6d08e3298bb4073a5429c17c1ff2b7b
    - not in Ubuntu 22.04 LTS
    - OK in Ubuntu 23.10

Other

RAM

check QVL list of motherboard for compatibility
- mine insisted specifically on DDR4-3200 RAM
amount
- more is better (up to a point)
- at least 20% more than total VRAM
type: DDR4 (cheaper) or DDR5
form factor
- DIMM (desktop)
- make sure they fit under the CPU cooler
speed, timings, latency: not important
overclocking profiles (Intel XMP/AMD EXPO): not important
heatsink: not important
mostly works best with two modules in dual channel mode
ECC
- nice to have
- more expensive
- more difficult to find
warranty: lifetime

Storage

main SSD
- type: NVMe M.2 SSD
- size: 1TB+
  - models take up a lot of room
- PCIe
  - typically uses x4 lanes
  - ideally directly connected to CPU instead of via motherboard chipset
  - both Gen 4 and Gen 5 options are available
- no need for heatsink if you motherboard already has one
- warranty (5+ years)
optional 5400RPM HDD(s) for cheap extra storage

PSU

must haves
- power rating
  - rule of thumb: $1.2 \times (TDP_{CPU} + \sum_i TDP_{GPU_i})$
- right type and amount of connectors
  - ATX24 for motherboard
  - EPS for CPU
  - PCIe depending on motherboard, GPU and other components
  - warning: never daisy chain GPUs
nice to haves
- 80+ efficiency rating (gold < platinum < titanium)
  - note: small percentual differences become relevant when using lots of power
- 12Vhpwr connector
  - supports up to 600W
  - plug these in properly, or you risk melting the plug
- modular design
- silent
- warranty (10+ years)

Case

volume
- for aircooled multi-GPU setup, disregard any case with a volume below 60 liters
constraints
- supports motherboard form factor
- CPU cooler height
- GPU length
- PSU length
nice to haves
- dust filters
- cable management options
- easy to open
- built-in GPU brace(s)

I realized fairly late in the process that my motherboard has an unusual form factor: EEB (12.2" x 13") instead of the far more common ATX (12" x 9.6"). This severely limited the number of compatible cases I could choose from. Even cases that officially claimed to support EEB form factors had some caveats. For example, because of downward-facing connectors on the bottom edge of the motherboard, I had to make sure I had spare room in that area to be able to connect all cables. Note that this extra space is useful regardless if you plan on installing a large GPU in the bottom PCIe slot. Furthermore, the standard cable management holes in the backplate of many cases get covered by the much wider motherboard. This results in some unconventional cable management practices. If I would do this build over again, I would put a much stronger emphasis on selecting a standard ATX motherboard.

Some feasible options:

Corsair 7000D Airflow (very tight, not recommended)
Fractal Design Define 7 XL
Fractal Design Meshify 2 XL
Lian Li O11 Dynamic XL
Phanteks Enthoo Pro 2

Cooling fans

Ventilation is very important is a high-powered system, especially if the goal is to sustain long duration workloads. Do not cheap out on fans after building a $2000+ computer.

case should have positive pressure ( $Q_{in} \geq Q_{out}$ )
size
- use 120mm or 140mm fans
- not 200mm, they fail first due to high torque
RPM trade-off
- high RPM = more airflow
- low RPM = less noise
purpose
- static pressure: for radiators, meshes, filters
- airflow: elsewhere
- hybrid
connector
- 3 pin (voltage regulated)
- 4 pin (PWM regulated, better)
bearings
- fluid
  - cheap
  - mineral oil for lubrication
  - dust sensitive
  - go bad after a while
  - sensitive to orientation (avoid horizontal)
- ball
  - expensive
  - long lasting
  - more noisy
  - ideal for servers
  - any orientation
- sleeve
  - hybrid between ball and fluid bearing
  - closed system
  - prefers horizontal orientation
- rifle
  - like sleeve
  - with Archimedes screw
    - prefers horizontal orientation
  - used in be quiet! fans
- magnetic / maglev
  - lowest noise
  - expensive
  - any orientation
fan orientation
- vertical: fails first
- horizontal
recommendations
- Noctua
  - favorite among enthusiasts
    - good airflow
    - low noise
    - long lifespan
  - very expensive compared to competition
  - 120mm, all round
    - https://noctua.at/en/nf-a12x25-pwm
    - https://noctua.at/en/products/fan/nf-a12x25-pwm-chromax-black-swap
  - 140mm, all round
    - https://noctua.at/en/nf-a14-pwm
    - https://noctua.at/en/nf-a14-pwm-chromax-black-swap
- Corsair ML140 (maglev)