Running Portal natively on Apple Silicon

Portal, a puzzle-platform game developed by Valve, was first released in 2007 as part of The Orange Box. It quickly became a cult classic, known for its unique mechanics and dark humor. Set in a mysterious research facility, the game revolves around the use of a "portal gun" to create linked portals that allow the player to navigate through the environment and solve puzzles. It’s widely regarded as one of the best games of the 2000s and remains a staple of gaming culture.

In recent years, Portal became available on multiple platforms, including macOS. However, with the increasing focus on 64-bit computing, Valve's macOS version of Portal remained stuck in the 32-bit era, making it incompatible with the latest macOS updates. In 2020, Apple announced that macOS Catalina (10.15) and beyond would no longer support 32-bit applications, forcing many older games and software to be abandoned or require updates to work on newer systems.

For those looking to play Portal on macOS today, there is a workaround: using the leaked Source Engine code to build the game from scratch. The source code was made public years ago, and while it has some issues out of the box, with a bit of tinkering, it can be compiled to work on modern macOS versions.

In this post, I'll walk you through the steps to build Portal on macOS using the leaked Source Engine, specifically for users who are dealing with Apple’s 64-bit-only policy. The process involves downloading the leaked source code, building the engine, downloading the necessary game assets, and combining everything to make Portal run on your system.

Step 1: Download the Leaked Source Code

The leaked source code for the Source Engine is available on GitHub. However, you'll want to use a specific fork for it to build successfully on macOS.

Step 2: Build the Source Code

Once you've downloaded the source, follow these instructions to build it.

Prerequisites

Install the required dependencies:

xcode-select --install
brew install sdl2 freetype2 fontconfig pkg-config opus libpng libedit jpeg jpeg-turbo python3

Next, set up your workspace:

cd ~/workspace
git clone --recursive https://github.com/er2off/source-engine.git
cd source-engine
git checkout clang19

Build the Engine

Now you can configure and build the source:

python3 waf configure -T release --prefix='' --build-games=portal
python3 waf build
python3 waf install --destdir='~/Documents/Gaming/Portal'

Step 3: Download Game Assets from Steam

Portal for macOS is still available on Steam today, but only as 32-bit version. This means you can download it, but you cannot run in. Our goal is to combine the assets from this download with our own 64-bit game engine build. Unfortunately, recent updates to the game have made it incompatible with the leaked source engine. The last version of Portal from 2024 that works with the leaked engine can be found on SteamDB:

Luckily, the current beta branch "SteamPipe Beta" points to this older version, so it is very easy to download from Steam.

Step 4: Combine the Engine and Assets

Now that you’ve built the engine and downloaded the necessary game files, it’s time to combine them. First we back up Steam's Portal folder, then delete the 32-bit binaries, and finally replace them with our own 64-bit versions.

cd ~/Library/Application\ Support/Steam/steamapps/common/Portal
cp -r . ~/Portal_backup
rm -rf ./bin ./portal/bin ./hl2_osx
cp -r ~/Documents/Gaming/Portal/bin ./bin
cp -r ~/Documents/Gaming/Portal/portal/bin ./portal/bin
cp ~/Documents/Gaming/Portal/hl2_launcher ./hl2_osx

Note how hl2_launcher gets renamed to hl2_osx.

Step 5: Run the Game

Finally, you're ready to run the game!

./hl2_osx -game Portal

This should launch the game using the custom-built engine. In my experiments it runs flawlessly.

The launch button in Steam should now work as well.

What about Portal 2?

While this approach works for the original Portal (and Half Life 2), it does not work for Portal 2. Portal 2 requires a more recent version of the Source Engine, so it is not compatible with the leaked code we used. If you're looking to play Portal 2 or need a more straightforward way to play Portal on newer versions of macOS, there are several alternatives you can explore using emulation. I have not tried these myself, but googling for terms such as Wine, Whisky, Crossover and Proton should get you started.

Happy gaming!

PS: in retrospect it might be safer to combine the engine and the assets in a folder outside of Steam, so that it will not accidentally get overwritten by any incoming updates.

Running CUDA 12 workloads on Ubuntu

Introduction

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform that allows developers to harness the power of NVIDIA GPUs for general-purpose computing (GPGPU). CUDA provides a suite of tools and libraries that enable high-performance computing on GPUs, making it a go-to solution for a wide range of computational tasks, including deep learning.

CUDA competes with other GPU computing platforms, such as AMD's ROCm and Intel's OneAPI. Both ROCm and OneAPI are open-source platforms that offer similar capabilities to CUDA. However, CUDA remains dominant, especially in the AI and deep learning space, due to its mature ecosystem and widespread support.

CUDA can be deployed on various operating systems, including Linux, Windows, and macOS. However, it is worth noting that CUDA support for macOS was discontinued after version 12.5 due to Apple's transition to the new ARM-based architecture (Apple Silicon).

In this blog post, we will dive into CUDA by exploring it across three layers:

  1. System-wide setup: We will cover the installation and configuration of the graphics driver.
  2. CUDA and cuDNN setup: We will discuss two different approaches: system-wide or isolated.
  3. Using CUDA: How to leverage CUDA in your projects, including the installation of GPU-accelerated libraries and frameworks.

System overview

For this guide, we will walk through setting up CUDA on a Linux system. Specifically, we will be using Ubuntu 23.10. While it would have been ideal to use a long-term support (LTS) version like 24.04, the differences in setup will be minimal.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 23.10
Release:        23.10
Codename:       mantic
$ uname -m
x86_64
$ uname -r
6.5.0-44-generic
$ ldd --version
ldd (Ubuntu GLIBC 2.38-1ubuntu6.3) 2.38
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

Linux display driver Setup

When setting up CUDA on Linux, one of the first steps is ensuring that your display driver is correctly installed. On Linux, this is always a system-wide process, and you have two main options: the proprietary NVIDIA driver or the open-source Nouveau driver.

  • Proprietary drivers: The official drivers provided by NVIDIA, offering the best performance and full support for CUDA. Common versions include 470, 525, 535, 545 and 550.
  • Nouveau drivers: The Nouveau driver is an open-source alternative to NVIDIA's proprietary driver. While it provides basic functionality and is a good choice for general use, it does not support CUDA.

In this guide, we will be using the proprietary NVIDIA drivers. These drivers also include the nvidia-smi tool, which is vital for managing and monitoring your GPU.

There are two main procedures for installing the drivers: automatic or manual.

Automatic Installation

The easiest way to install the appropriate NVIDIA driver is through the automatic installation process, which detects your GPU and recommends the best driver.

$ ubuntu-drivers devices
== /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0 ==
modalias : pci:v000010DEd00002204sv000010DEsd0000147Dbc03sc00i00
vendor   : NVIDIA Corporation
model    : GA102 [GeForce RTX 3090]
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-545 - distro non-free
driver   : nvidia-driver-545-open - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-535 - distro non-free recommended
driver   : nvidia-driver-535-open - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
$ ubuntu-drivers list
$ ubuntu-drivers install
Manual Installation

For those who prefer more control over the installation process, or if you want the latest drivers not available in the default Ubuntu repositories, you can manually install the driver.

You can install the display drivers either from the default Ubuntu repositories or from the additional PPA (Personal Package Archive) provided by Ubuntu's graphics drivers team if you want the latest and greatest.

$ sudo add-apt-repository ppa:graphics-drivers/ppa && sudo apt update
$ sudo apt install nvidia-driver-535

This command installs the NVIDIA driver version 535, but you can replace "535" with your desired version number.

Once installed, you can verify that the driver is correctly set up using the nvidia-smi tool:

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:41:00.0  On |                  N/A |
|  0%   45C    P8              32W / 350W |    541MiB / 24576MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:42:00.0 Off |                  N/A |
|  0%   40C    P8              19W / 350W |     10MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2635      G   /usr/lib/xorg/Xorg                          192MiB |
|    0   N/A  N/A      2944      G   /usr/bin/gnome-shell                         82MiB |
|    0   N/A  N/A      3607      G   ...irefox/3626/usr/lib/firefox/firefox      134MiB |
|    0   N/A  N/A      4701      G   ...erProcess --variations-seed-version      116MiB |
|    1   N/A  N/A      2635      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
Important Notes
  • Stability vs. latest features: When choosing your driver version, consider the trade-off between stability and access to the latest features. Older drivers like version 470 are more stable and widely tested, while newer versions like 550 offer the latest updates and support for newer hardware.
  • GPU compatibility: Ensure that the driver you select is compatible with your GPU model. The automatic detection method mentioned above typically handles this well.
  • Bundled CUDA runtime: The NVIDIA driver comes with a minimal CUDA runtime (i.e., version 12.2) necessary for running basic CUDA applications. However, it does not include the full CUDA toolkit required for development purposes. The runtime version bundled with the driver will not change even if you separately install a proper CUDA runtime or toolkit.

To see where the CUDA runtime is located on your system, you can run:

$ find /usr -name libcuda.so*
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so.535.171.04
/usr/lib/i386-linux-gnu/libcuda.so.1
/usr/lib/i386-linux-gnu/libcuda.so
/usr/lib/i386-linux-gnu/libcuda.so.535.171.04

This command will display the paths to the installed CUDA runtime libraries, which are essential for running CUDA-enabled applications.

Compute capability

Compute capability determines the hardware capabilities of your GPU and cannot be upgraded through software. Here's a table summarizing the compute capabilities of various NVIDIA GPU architectures:

Architecture Compute Capability GPU Models
Volta 7.0 V100
Turing 7.5 GeForce RTX 20xx, Quadro RTX 8000 and RTX 6000, Tesla T4
Ampere 8.x A100 (8.0), GeForce RTX 30xx (8.6), RTX A6000 (8.6)
Ada Lovelace 8.9 GeForce RTX 40xx, RTX 6000 Ada
Hopper 9.0 H100, H200
Blackwell 10.x B100, B200, GeForce RTX 5090

To check the compute capability of your GPU, you can use the following command:

$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6
8.6

For more detailed information on compute capabilities and their implications, refer to the NVIDIA CUDA C Programming Guide.

CUDA and cuDNN

When working with CUDA, it is important to distinguish between the CUDA runtime and the CUDA toolkit, similar to the difference between the Java Runtime Environment (JRE) and the Java Development Kit (JDK). The NVIDIA driver includes a minimal CUDA runtime that enables you to run basic CUDA-enabled applications. However, this runtime is limited and does not include all the components needed for more advanced CUDA tasks.

The CUDA toolkit, on the other hand, is a comprehensive package that provides all the development tools necessary for creating, compiling, and running CUDA applications. It also includes a more complete runtime, which provides additional libraries and features needed for more complex applications. Most deep learning tasks will require this toolkit to function correctly.

When installing the CUDA toolkit, ensure that you only install versions that are less than or equal to the runtime version bundled with your display driver. Installing a higher version without updating the driver first can lead to instability.

In addition to the CUDA toolkit, some deep learning frameworks also require the CUDA Deep Neural Network (cuDNN) package.

System-wide installation

To set up CUDA and cuDNN system-wide, start by ensuring that the NVIDIA proprietary drivers are installed, as discussed earlier. Next, you will need to install a C++ compiler, which is required for compiling CUDA code. This can be done with sudo apt install gcc g++.

Once GCC is installed, you can proceed to install CUDA. You have two main options for this:

  • The easiest approach is to use the official Ubuntu repository by running sudo apt install nvidia-cuda-toolkit
  • Alternatively, for more control or to get the latest version, you can install CUDA directly from NVIDIA’s official source. This involves following the detailed instructions provided in the NVIDIA CUDA Installation Guide for Linux.

After installing CUDA, the next step is to set up cuDNN, which is essential for deep learning applications. You can do this by following the instructions in the NVIDIA cuDNN Installation Guide.

For situations where you need to manage multiple projects with different CUDA or cuDNN versions, setting up an isolated environment using Conda or Mamba is the recommended option. This approach keeps the system-wide components minimal and allows each environment to have its own specific setup.

Start by ensuring that the NVIDIA proprietary drivers are installed as discussed earlier, since they are the only system-wide component required. Next, install Mamba, which is a faster alternative to Conda, via Miniforge. You can verify that the installation was successful by running mamba info:

$ mamba info

          mamba version : 1.5.5
     active environment : None
            shell level : 0
       user config file : /home/user/.condarc
 populated config files : /home/user/miniforge3/.condarc
          conda version : 23.11.0
    conda-build version : not installed
         python version : 3.10.13.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=zen3
                          __conda=23.11.0=0
                          __cuda=12.2=0
                          __glibc=2.38=0
                          __linux=6.5.0=0
                          __unix=0=0
       base environment : /home/user/miniforge3  (writable)
      conda av data dir : /home/user/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/user/miniforge3/pkgs
                          /home/user/.conda/pkgs
       envs directories : /home/user/miniforge3/envs
                          /home/user/.conda/envs
               platform : linux-64
             user-agent : conda/23.11.0 requests/2.31.0 CPython/3.10.13 Linux/6.5.0-44-generic ubuntu/23.10 glibc/2.38 solver/libmamba conda-libmamba-solver/23.12.0 libmambapy/1.5.5
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
Relevance of virtual packages in mamba environments

Mamba environments utilize virtual packages to dynamically detect and represent certain system features that are critical for package resolution and compatibility. These virtual packages include system-specific details such as architecture, the operating system, the version of the GNU C Library (glibc), and, importantly, the version of CUDA supported by your installed NVIDIA drivers.

Virtual packages are not installed in the traditional sense but are automatically detected by Mamba. They help the package manager resolve dependencies by ensuring that the packages you install are compatible with your system's underlying hardware and software.

Among these virtual packages, __cuda is particularly important when working with CUDA. It represents the maximum version of CUDA that your NVIDIA driver officially supports. This information is automatically detected and provided by Mamba, assisting in the selection of the appropriate CUDA toolkit and related packages for your environment.

The output above indicates that our system's NVIDIA drivers support CUDA up to version 12.2. While Mamba does not strictly enforce this version when installing the CUDA toolkit, it serves as a guideline. You can technically install lower or higher versions of the toolkit, but installing a version higher than what __cuda indicates is generally not recommended, as it could lead to instability or compatibility issues. If your project requires a higher than supported version of the toolkit, consider upgrading your graphics driver first.

Creating a new mamba environment

Once Mamba is set up, we can proceed to install the CUDA toolkit within an isolated environment. We can do this by creating a new environment and specifying the CUDA version we need, along with any other packages such as Python or cuDNN. For example:

$ mamba create -n my-environment python=3.12 cuda-toolkit=12.2 cudnn
$ mamba activate my-environment

This command sets up a new environment named my-environment with Python 3.12, CUDA toolkit 12.2, and a recent, compatible version of cuDNN.

After activating your environment, you can verify that CUDA is correctly installed by checking the version of the NVIDIA CUDA compiler:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Package overview

When working with CUDA 12, it is important to be aware of the significant changes in package names and structures compared to earlier versions. In CUDA 11 and earlier versions, the installation was typically done using the cudatoolkit or cudatoolkit-dev package. However, starting with CUDA 12, there has been a restructuring of package names. This shift makes installation trickier, especially as much of the older documentation still references these outdated package names.

The following meta-packages can be used to setup a CUDA 12 environment:

  • cuda-toolkit: This package (notice the hyphen) is now the primary way of installing CUDA development tools. Running mamba install cuda-toolkit=12.2, for instance, will typically provide all the necessary components for CUDA 12.2 development, including the compiler, libraries, and headers. The downside is that it is fairly big, and it will likely include much more than what is strictly necessary for your use case.
  • cuda-runtime: If you only need the runtime components, install this package. Note that this refers to the complete runtime, not the minimal one bundled with the driver.
  • cuda: This is a meta-package that pulls in both the toolkit and runtime. Seeing as the runtime contains a subset of the packages in the toolkit, this is in effect functionally equivalent to the cuda-toolkit package.
  • Other meta packages such as cuda-libraries, cuda-libraries-dev, cuda-compiler and cuda-tools can be used to specify more precisely what your project needs. Below is a simplified hierarchy of the most relevant packages. Check the appendix for a full list.
cuda
├── cuda-runtime
│   └── cuda-libraries
└── cuda-toolkit
    ├── cuda-compiler
    ├── cuda-libraries
    ├── cuda-libraries-dev
    └── cuda-tools
        └── cuda-command-line-tools

Finally, for the fullest possible control, refer to the actual CUDA packages instead of these meta-packages (which are simply groups of packages).

Note that neither cuda, cuda-runtime, nor cuda-toolkit include cuDNN. It is a separate package specifically tailored for deep learning applications, which needs to be installed independently as shown earlier.

Channel selection

When setting up CUDA and related packages in a Mamba environment, the various channels that offer similar packages might cause confusion. The two primary channels to consider are conda-forge (default in Miniforge) and nvidia, both of which offer nearly identical CUDA-related packages, including the CUDA toolkit, cuDNN, and other NVIDIA libraries. We will ignore the anaconda channel (default in Anaconda) because it typically hosts somewhat outdated package versions. Many mamba commands accept a -c <channel> option to include an extra channel on top of the default channels configured in the .condarc file.

The conda-forge channel is a widely used, community-driven repository known for its extensive package coverage beyond CUDA. This makes conda-forge particularly suitable for projects that require a mix of CUDA and other libraries. Additionally, conda-forge is continuously updated and maintained, ensuring that you have access to recent versions of packages.

In contrast, the official nvidia channel, maintained directly by NVIDIA, is dedicated specifically to CUDA and other NVIDIA tools. To install the complete CUDA suite from this channel, you can use mamba install -c nvidia cuda. Although the CUDA-related (meta-)packages in the nvidia channel are almost identical to those found in conda-forge, the nvidia channel provides slightly earlier access to the latest versions and includes some less common versions that may not be available on conda-forge.

In most cases, if your environment requires a wide range of software, conda-forge is likely the better option due to its extensive package offerings. It also helps to avoid some minor hiccups that can result from multi-channel package resolution.

Note that certain packages, such as pytorch also provide their own dedicated channel to install packages from.

Setting environment variables

For certain applications, you might need to manually set additional environment variables:

  • CUDA_HOME and CUDA_PATH
    • These are interchangeable and typically point to the root of the CUDA toolkit folder, which contains the lib and bin directories.
    • In the case of a conda environment, they should point to the environment's root.
  • LD_LIBRARY_PATH
    • Add $CUDA_HOME/lib to this path to ensure your system can locate the necessary libraries.
$ echo $CONDA_PREFIX
/home/user/miniforge3/envs/my-environment

$ which nvcc
/home/user/miniforge3/envs/my-environment/bin/nvcc

$ export CUDA_HOME=$CONDA_PREFIX
$ export CUDA_PATH=$CUDA_HOME

Install frameworks and libraries

Once CUDA and cuDNN are set up, the next step is installing the python packages that leverage GPU acceleration. Depending on your development environment, you can install these using either pip in a virtual environment or mamba.

Here are some popular libraries and frameworks:

  • PyCUDA: Python wrapper for CUDA.
  • CuPy: NumPy-compatible library that runs on CUDA.
  • cuNumeric: A drop-in replacement for NumPy, optimized for CUDA.
  • RAPIDS: A suite of libraries for data science and analytics on GPUs, including cuDF (a faster pandas) and cuML (a faster scikit-learn).
  • Deep learning frameworks: TensorFlow, PyTorch, ONNX

For example, to install PyTorch with CUDA support using mamba:

$ mamba create -n torch-env -c pytorch -c nvidia python=3.12 pytorch-cuda=12.1 torchvision torchaudio


Looking for: ['python=3', 'pytorch-cuda=12.1', 'torchvision', 'torchaudio']

...

  Package                          Version  Build                         Channel           Size
──────────────────────────────────────────────────────────────────────────────────────────────────
  Install:
──────────────────────────────────────────────────────────────────────────────────────────────────

  + libcublas                    12.1.0.26  0                             nvidia           345MB
  + libcufft                      11.0.2.4  0                             nvidia           108MB
  + libcusolver                  11.4.4.55  0                             nvidia           103MB
  + libcusparse                  12.0.2.55  0                             nvidia           171MB
  + libnpp                       12.0.2.50  0                             nvidia           147MB
  + cuda-cudart                   12.1.105  0                             nvidia           193kB
  + cuda-nvrtc                    12.1.105  0                             nvidia            21MB
  + libnvjitlink                  12.1.105  0                             nvidia            18MB
  + libnvjpeg                    12.1.1.14  0                             nvidia             3MB
  + cuda-cupti                    12.1.105  0                             nvidia            16MB
  + cuda-nvtx                     12.1.105  0                             nvidia            58kB
  ...
  + libcurand                    10.3.7.37  0                             nvidia            54MB
  + libcufile                    1.11.0.15  0                             nvidia             1MB
  + cuda-opencl                    12.6.37  0                             nvidia            27kB
  + cuda-libraries                  12.1.0  0                             nvidia             2kB
  + cuda-runtime                    12.1.0  0                             nvidia             1kB
  ...
  + pytorch                          2.4.0  py3.12_cuda12.1_cudnn9.1.0_0  pytorch            1GB
  + torchtriton                      3.0.0  py312                         pytorch          245MB
  + torchaudio                       2.4.0  py312_cu121                   pytorch            7MB
  + torchvision                     0.19.0  py312_cu121                   pytorch            9MB

  Summary:

  Install: 180 packages

  Total download: 3GB

───────────────────────────────────────────────────────────────────────────────────────────────────


Confirm changes: [Y/n]

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

To activate this environment, use

     $ mamba activate torch-env

To deactivate an active environment, use

     $ mamba deactivate

Here, the meta-package pytorch-cuda allows us to specify the required CUDA version. Since version 12.2 is not available, we settle for version 12.1. If we had installed the regular pytorch package, we would have downloaded the CPU version without CUDA acceleration. We can double check the build string of the pytorch package in the command output: py3.12_cuda12.1_cudnn9.1.0_0.

Notice how we need two extra channels: pytorch and nvidia. The first attempt without the nvidia channel failed because pytorch-cuda=12.1 has a dependency on a very specific version of cuBLAS that is unavailable in conda-forge.

Furthermore, notice how we did not specify cudnn this time. PyTorch with CUDA support includes a statically linked version of this library, so we don't need to include it separately.

When the environment is created and activated, we can test whether CUDA support is enabled.

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
2
>>> torch.cuda.current_device()
0
>>> torch.backends.cudnn.version()
90100

Topics for another time

  • more elaborate pytorch example
    • or tensorflow vs tensorflow-gpu
  • NVIDIA TensorRT

Appendix

Glossary

  • cudaRT - CUDA runtime
  • cuBLAS - CUDA BLAS
  • cuFFT - CUDA Fast Fourier Transform
  • cuDPP - CUDA Data Parallel Primitives
  • cuDNN - CUDA Deep Neural Network
  • cuRAND - CUDA Random Number Generation library
  • cuSOLVER - CUDA based collection of dense and sparse direct solvers
  • cuSPARSE - CUDA Sparse Matrix library
  • NPP - NVIDIA Performance Primitives library
  • nvGRAPH - NVIDIA Graph Analytics library
  • NVML - NVIDIA Management Library
  • NVRTC - NVIDIA Runtime Compilation library for CUDA C++
  • NVCC - Nvidia CUDA Compiler
    • based on LLVM
    • source file extension: *.cu
  • NCCL - NVIDIA Collective Communications Library
    • multi-GPU setup
  • Thrust: open source C++ library of parallel algorithms and data structures

Mamba: CUDA 12 package overview

  • useful commands
    • mamba search -c <channel> --override-channels [--info] <package-spec>
    • mamba repoquery whoneeds --tree --recursive -c <channel> <package>
    • mamba repoquery depends --tree --recursive -c <channel> <package>
Hierarchical meta package list
  • cuda
    • cuda-runtime
      • cuda-libraries
        • (see below)
    • cuda-toolkit
      • cuda-compiler
        • c-compiler
        • cuda-cuobjdump
        • cuda-cuxxfilt
        • cuda-nvcc
        • cuda-nvprune
        • cxx-compiler
      • cuda-libraries
        • cuda-cudart
        • cuda-nvrtc
        • cuda-opencl
        • libcublas
        • libcufft
        • libcufile
        • libcurand
        • libcusolver
        • libcusparse
        • libnpp
        • libnvfatbin
        • libnvjitlink
        • libnvjpeg
      • cuda-libraries-dev
        • cuda-cccl
        • cuda-cudart-dev
        • cuda-driver-dev
        • cuda-nvrtc-dev
        • cuda-opencl-dev
        • cuda-profiler-api
        • libcublas-dev
        • libcufft-dev
        • libcufile-dev
        • libcurand-dev
        • libcusolver-dev
        • libcusparse-dev
        • libnpp-dev
        • libnvfatbin-dev
        • libnvjitlink-dev
        • libnvjpeg-dev
      • cuda-nvml-dev
      • cuda-tools
        • cuda-command-line-tools
          • cuda-cupti-dev
          • cuda-gdb
          • cuda-nvdisasm
          • cuda-nvprof
          • cuda-nvtx
          • cuda-sanitizer-api
        • cuda-visual-tools
        • gds-tools
  • cuda-minimal-build
    • cuda-cccl
    • cuda-compiler
      • ...
    • cuda-cudart-dev
    • cuda-profiler-api
  • not part of any meta package
    • cuda-compat
    • cuda-crt
    • cuda-nsight
    • cuda-nvvm
    • cuda-nvvp
    • cuda-python
    • cudnn
    • cuquantum
    • cutensor
    • nccl
Flat package list
  • cuda-cccl
  • cuda-compat
  • cuda-crt
  • cuda-crt-dev_linux-64
  • cuda-crt-tools
  • cuda-cudart
  • cuda-cudart-dev
  • cuda-cuobjdump
  • cuda-cupti
  • cuda-cupti-dev
  • cuda-cupti-doc
  • cuda-cuxxfilt
  • cuda-driver-dev
  • cuda-gdb
  • cuda-gdb-src
  • cuda-nsight
  • cuda-nvcc
  • cuda-nvcc-dev_linux-64
  • cuda-nvcc-impl
  • cuda-nvcc-tools
  • cuda-nvdisasm
  • cuda-nvml-dev
  • cuda-nvprof
  • cuda-nvprune
  • cuda-nvrtc
  • cuda-nvrtc-dev
  • cuda-nvtx
  • cuda-nvtx-dev
  • cuda-nvvm
  • cuda-nvvm-dev_linux-64
  • cuda-nvvm-impl
  • cuda-nvvm-tools
  • cuda-nvvp
  • cuda-opencl
  • cuda-opencl-dev
  • cuda-profiler-api
  • cuda-python
  • cuda-sanitizer-api
  • cuda-visual-tools
  • cudnn
  • cupti
  • cuquantum
  • cutensor
  • libcublas
  • libcublas-dev
  • libcufft
  • libcufft-dev
  • libcuquantum
  • libcurand
  • libcurand-dev
  • libcusolver
  • libcusolver-dev
  • libcusparse
  • libcusparse-dev
  • libcutensor
  • nccl
Repoqueries
  • output has been slightly edited for clarity
$ mamba repoquery depends cuda=12.6 -c conda-forge --tree --recursive

cuda[12.6.0]
  ├─ cuda-runtime[12.6.0]
    └─ cuda-libraries[12.6.0]
       ├─ cuda-cudart[12.6.37]
         ├─ cuda-cudart_linux-64[12.6.37]
           └─ cuda-version[12.6]
         ├─ libgcc-ng[14.1.0]
           ├─ _libgcc_mutex[0.1]
           └─ _openmp_mutex[4.5]
              ├─ _libgcc_mutex already visited
              └─ llvm-openmp[18.1.8]
                 ├─ libzlib[1.3.1]
                 └─ zstd[1.5.6]
                    ├─ libzlib already visited
                    └─ libstdcxx-ng[14.1.0]
         └─ libstdcxx-ng already visited
       ├─ cuda-nvrtc[12.6.20]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ cuda-opencl[12.6.37]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ ocl-icd[2.3.2]
            └─ libgcc-ng already visited
       ├─ libcublas[12.6.0.22]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-nvrtc[12.0.76]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ cuda-version[12.0.0]
       ├─ libcufft[11.2.6.28]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ libcufile[1.11.0.15]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ libcurand[10.3.7.37]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ libcusolver[11.6.4.38]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         ├─ libcublas already visited
         ├─ libcusparse[12.5.2.23]
           ├─ libgcc-ng already visited
           ├─ libstdcxx-ng already visited
           └─ libnvjitlink[12.6.20]
              ├─ libgcc-ng already visited
              └─ libstdcxx-ng already visited
         └─ libnvjitlink already visited
       ├─ libcusparse already visited
       ├─ libnvjitlink already visited
       ├─ libnpp[12.3.1.23]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ libnvfatbin[12.6.20]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       └─ libnvjpeg[12.3.3.23]
          ├─ libgcc-ng already visited
          └─ libstdcxx-ng already visited
  └─ cuda-toolkit[12.6.0]
     ├─ cuda-libraries already visited
     ├─ cuda-compiler[12.6.0]
       ├─ c-compiler[1.0.0]
         ├─ libgcc-ng already visited
         └─ gcc_linux-64[10.3.0]
            ├─ binutils_linux-64[2.36]
              ├─ binutils_impl_linux-64[2.36.1]
                ├─ ld_impl_linux-64[2.36.1]
                └─ sysroot_linux-64[2.12]
                   └─ kernel-headers_linux-64[2.6.32]
              └─ sysroot_linux-64 already visited
            ├─ sysroot_linux-64 already visited
            └─ gcc_impl_linux-64[10.3.0]
               ├─ libgcc-ng already visited
               ├─ libstdcxx-ng already visited
               ├─ binutils_impl_linux-64 already visited
               ├─ sysroot_linux-64 already visited
               ├─ libgcc-devel_linux-64[10.3.0]
               ├─ libgomp[14.1.0]
                 └─ _libgcc_mutex already visited
               └─ libsanitizer[10.3.0]
                  └─ libgcc-ng already visited
       ├─ cuda-cuobjdump[12.6.20]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-nvdisasm[12.0.76]
            ├─ libgcc-ng already visited
            └─ libstdcxx-ng already visited
       ├─ cuda-cuxxfilt[12.6.20]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       ├─ cuda-nvcc[12.6.20]
         ├─ gcc_linux-64 already visited
         ├─ cuda-nvcc_linux-64[12.6.20]
           ├─ cuda-cudart-dev_linux-64[12.6.37]
             ├─ cuda-cccl_linux-64[12.0.90]
             ├─ cuda-cudart-static_linux-64[12.0.107]
             └─ cuda-cudart_linux-64[12.0.107]
                ├─ libgcc-ng already visited
                └─ libstdcxx-ng already visited
           ├─ cuda-driver-dev_linux-64[12.6.37]
           ├─ cuda-nvcc-dev_linux-64[12.6.20]
             ├─ libgcc-ng already visited
             ├─ cuda-crt-dev_linux-64[12.6.20]
             └─ cuda-nvvm-dev_linux-64[12.6.20]
           ├─ cuda-nvcc-impl[12.6.20]
             ├─ cuda-cudart already visited
             ├─ cuda-nvcc-dev_linux-64 already visited
             ├─ cuda-cudart-dev[12.0.107]
               ├─ libgcc-ng already visited
               ├─ libstdcxx-ng already visited
               ├─ cuda-cudart[12.0.107]
                 ├─ libgcc-ng already visited
                 └─ libstdcxx-ng already visited
               ├─ cuda-cudart-dev_linux-64[12.0.107]
                 ├─ cuda-cccl_linux-64 already visited
                 └─ cuda-cudart-static_linux-64 already visited
               └─ cuda-cudart-static[12.0.107]
                  ├─ libgcc-ng already visited
                  ├─ libstdcxx-ng already visited
                  └─ cuda-cudart-static_linux-64 already visited
             ├─ cuda-nvcc-tools[12.6.20]
               ├─ libgcc-ng already visited
               ├─ libstdcxx-ng already visited
               ├─ cuda-crt-tools[12.6.20]
               └─ cuda-nvvm-tools[12.6.20]
                  ├─ libgcc-ng already visited
                  └─ libstdcxx-ng already visited
             └─ cuda-nvvm-impl[12.6.20]
                ├─ libgcc-ng already visited
                └─ libstdcxx-ng already visited
           ├─ cuda-nvcc-tools already visited
           └─ sysroot_linux-64[2.28]
              ├─ _sysroot_linux-64_curr_repodata_hack[3]
              └─ kernel-headers_linux-64[4.18.0]
                 └─ _sysroot_linux-64_curr_repodata_hack already visited
         └─ gxx_linux-64[10.3.0]
            ├─ gcc_linux-64 already visited
            ├─ binutils_linux-64 already visited
            ├─ sysroot_linux-64 already visited
            └─ gxx_impl_linux-64[10.3.0]
               ├─ sysroot_linux-64 already visited
               ├─ gcc_impl_linux-64 already visited
               └─ libstdcxx-devel_linux-64[10.3.0]
       ├─ cuda-nvprune[12.6.20]
         ├─ libgcc-ng already visited
         └─ libstdcxx-ng already visited
       └─ cxx-compiler[1.0.0]
          ├─ libgcc-ng already visited
          ├─ libstdcxx-ng already visited
          └─ gxx_linux-64 already visited
     ├─ cuda-libraries-dev[12.6.0]
       ├─ cuda-cccl[12.6.37]
         ├─ cccl[2.5.0]
         └─ cuda-cccl_linux-64[12.6.37]
       ├─ cuda-cudart-dev[12.6.37]
         ├─ cuda-cudart already visited
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         ├─ cuda-cudart-dev_linux-64 already visited
         └─ cuda-cudart-static[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ cuda-cudart-static_linux-64[12.6.37]
       ├─ cuda-driver-dev[12.6.37]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-driver-dev_linux-64[12.0.107]
       ├─ cuda-nvrtc-dev[12.6.20]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-nvrtc already visited
       ├─ cuda-opencl-dev[12.6.37]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-opencl already visited
       ├─ cuda-profiler-api[12.6.37]
         └─ cuda-cudart-dev already visited
       ├─ libcublas-dev[12.6.0.22]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libcublas already visited
       ├─ libcufft-dev[11.2.6.28]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libcufft already visited
       ├─ libcufile-dev[1.11.0.15]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libcufile already visited
       ├─ libcurand-dev[10.3.7.37]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libcurand already visited
       ├─ libcusolver-dev[11.6.4.38]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libcusolver already visited
       ├─ libcusparse-dev[12.5.2.23]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         ├─ libcusparse already visited
         └─ libnvjitlink already visited
       ├─ libnpp-dev[12.3.1.23]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libnpp already visited
       ├─ libnvfatbin-dev[12.6.20]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libnvfatbin already visited
       ├─ libnvjitlink-dev[12.6.20]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ libnvjitlink already visited
       └─ libnvjpeg-dev[12.3.3.23]
          ├─ libnvjpeg already visited
          └─ cuda-cudart-dev already visited
     ├─ cuda-nvml-dev[12.6.37]
       ├─ libgcc-ng already visited
       └─ libstdcxx-ng already visited
     └─ cuda-tools[12.6.0]
        ├─ cuda-command-line-tools[12.6.0]
          ├─ cuda-cupti-dev[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ cuda-cupti[12.6.37]
               ├─ libgcc-ng already visited
               └─ libstdcxx-ng already visited
          ├─ cuda-gdb[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ gmp[6.3.0]
               ├─ libgcc-ng already visited
               └─ libstdcxx-ng already visited
          ├─ cuda-nvdisasm[12.6.20]
            ├─ libgcc-ng already visited
            └─ libstdcxx-ng already visited
          ├─ cuda-nvprof[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ cuda-cupti[12.0.90]
               ├─ libgcc-ng already visited
               └─ libstdcxx-ng already visited
          ├─ cuda-nvtx[12.6.37]
            ├─ libgcc-ng already visited
            └─ libstdcxx-ng already visited
          └─ cuda-sanitizer-api[12.6.34]
             ├─ libgcc-ng already visited
             └─ libstdcxx-ng already visited
        ├─ cuda-visual-tools[12.6.0]
          ├─ cuda-libraries-dev already visited
          ├─ cuda-nvml-dev already visited
          ├─ cuda-nsight[12.6.20]
          ├─ cuda-nvvp[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            ├─ cuda-nvdisasm already visited
            └─ cuda-nvprof[12.0.90]
               ├─ libgcc-ng already visited
               ├─ libstdcxx-ng already visited
               └─ cuda-cupti already visited
          └─ nsight-compute[2024.3.0.15]
             └─ ...
        └─ gds-tools[1.11.0.15]
           ├─ libgcc-ng already visited
           ├─ libstdcxx-ng already visited
           └─ libcufile already visited

$ mamba repoquery depends cuda-minimal-build=12.6 -c conda-forge --tree --recursive

cuda-minimal-build[12.6.0]
  ├─ cuda-cccl[12.6.37]
    ├─ cccl[2.5.0]
    └─ cuda-cccl_linux-64[12.6.37]
  ├─ cuda-compiler[12.6.0]
    ├─ c-compiler[1.0.0]
      ├─ gcc_linux-64[10.3.0]
        ├─ binutils_linux-64[2.36]
          ├─ binutils_impl_linux-64[2.36.1]
            ├─ ld_impl_linux-64[2.36.1]
            └─ sysroot_linux-64[2.12]
               └─ kernel-headers_linux-64[2.6.32]
          └─ sysroot_linux-64 already visited
        ├─ sysroot_linux-64 already visited
        └─ gcc_impl_linux-64[10.3.0]
           ├─ binutils_impl_linux-64 already visited
           ├─ sysroot_linux-64 already visited
           ├─ libgcc-devel_linux-64[10.3.0]
           ├─ libgcc-ng[14.1.0]
             ├─ _libgcc_mutex[0.1]
             └─ _openmp_mutex[4.5]
                ├─ _libgcc_mutex already visited
                └─ llvm-openmp[18.1.8]
                   ├─ libzlib[1.3.1]
                   └─ zstd[1.5.6]
                      ├─ libzlib already visited
                      └─ libstdcxx-ng[14.1.0]
           ├─ libstdcxx-ng already visited
           ├─ libgomp[14.1.0]
             └─ _libgcc_mutex already visited
           └─ libsanitizer[10.3.0]
              └─ libgcc-ng already visited
      └─ libgcc-ng already visited
    ├─ cuda-cuobjdump[12.6.20]
      ├─ libgcc-ng already visited
      ├─ libstdcxx-ng already visited
      └─ cuda-nvdisasm[12.0.76]
         ├─ libgcc-ng already visited
         ├─ libstdcxx-ng already visited
         └─ cuda-version[12.0.0]
    ├─ cuda-cuxxfilt[12.6.20]
      ├─ libgcc-ng already visited
      └─ libstdcxx-ng already visited
    ├─ cuda-nvcc[12.6.20]
      ├─ gcc_linux-64 already visited
      ├─ cuda-nvcc_linux-64[12.6.20]
        ├─ cuda-cudart-dev_linux-64[12.6.37]
          ├─ cuda-cccl_linux-64[12.0.90]
          ├─ cuda-cudart-static_linux-64[12.0.107]
          └─ cuda-cudart_linux-64[12.0.107]
             ├─ libgcc-ng already visited
             └─ libstdcxx-ng already visited
        ├─ cuda-driver-dev_linux-64[12.6.37]
        ├─ cuda-nvcc-dev_linux-64[12.6.20]
          ├─ libgcc-ng already visited
          ├─ cuda-crt-dev_linux-64[12.6.20]
          └─ cuda-nvvm-dev_linux-64[12.6.20]
        ├─ cuda-nvcc-impl[12.6.20]
          ├─ cuda-nvcc-dev_linux-64 already visited
          ├─ cuda-cudart[12.6.37]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            └─ cuda-cudart_linux-64[12.6.37]
          ├─ cuda-cudart-dev[12.0.107]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            ├─ cuda-cudart[12.0.107]
              ├─ libgcc-ng already visited
              └─ libstdcxx-ng already visited
            ├─ cuda-cudart-dev_linux-64[12.0.107]
              ├─ cuda-cccl_linux-64 already visited
              └─ cuda-cudart-static_linux-64 already visited
            └─ cuda-cudart-static[12.0.107]
               ├─ libgcc-ng already visited
               ├─ libstdcxx-ng already visited
               └─ cuda-cudart-static_linux-64 already visited
          ├─ cuda-nvcc-tools[12.6.20]
            ├─ libgcc-ng already visited
            ├─ libstdcxx-ng already visited
            ├─ cuda-crt-tools[12.6.20]
            └─ cuda-nvvm-tools[12.6.20]
               ├─ libgcc-ng already visited
               └─ libstdcxx-ng already visited
          └─ cuda-nvvm-impl[12.6.20]
             ├─ libgcc-ng already visited
             └─ libstdcxx-ng already visited
        ├─ cuda-nvcc-tools already visited
        └─ sysroot_linux-64[2.28]
           ├─ _sysroot_linux-64_curr_repodata_hack[3]
           └─ kernel-headers_linux-64[4.18.0]
              └─ _sysroot_linux-64_curr_repodata_hack already visited
      └─ gxx_linux-64[10.3.0]
         ├─ gcc_linux-64 already visited
         ├─ binutils_linux-64 already visited
         ├─ sysroot_linux-64 already visited
         └─ gxx_impl_linux-64[10.3.0]
            ├─ sysroot_linux-64 already visited
            ├─ gcc_impl_linux-64 already visited
            └─ libstdcxx-devel_linux-64[10.3.0]
    ├─ cuda-nvprune[12.6.20]
      ├─ libgcc-ng already visited
      └─ libstdcxx-ng already visited
    └─ cxx-compiler[1.0.0]
       ├─ libgcc-ng already visited
       ├─ libstdcxx-ng already visited
       └─ gxx_linux-64 already visited
  ├─ cuda-cudart-dev[12.6.37]
    ├─ libgcc-ng already visited
    ├─ libstdcxx-ng already visited
    ├─ cuda-cudart-dev_linux-64 already visited
    ├─ cuda-cudart already visited
    └─ cuda-cudart-static[12.6.37]
       ├─ libgcc-ng already visited
       ├─ libstdcxx-ng already visited
       └─ cuda-cudart-static_linux-64[12.6.37]
  └─ cuda-profiler-api[12.6.37]
     └─ cuda-cudart-dev already visited

Building a ML workstation

Intro

I have been keeping a close eye on the evolutions in AI/ML for a while now. Whenever I come across an interesting demo, I of course like to try it out. Because my main computer at home only has a weak iGPU, I often resort to running workloads in the cloud (mostly Google Colab or AWS). While that works reasonably well, there are some downsides:

  • risk of going over budget when an instance accidentally keep running after use
  • general mild inconvenience of working with remote systems
  • cloud defeats the purpose of running a private/local LLM
  • more expensive in the long run

That is why I decided to build my own ML system last month. I am not sure if I will actually end up saving money this way, but it is going to be an educational experience regardless. It is still early days, but this post contains my lessons learned so far.

Component selection

GPU

The core of any ML workstation is the GPU. Due to the ubiquity of CUDA requirements in deep learning, there is only a single viable brand: Nvidia. Their offerings can be categorized in three categories:

Architecture Desktop Workstation Datacenter
Pascal (2016) GeForce GTX 10xx Quadro P Tesla P4 / Tesla P100
Volta (2017) N/A Quadro GV100 Tesla V100
Turing (2018) GeForce RTX 20xx Quadro RTX Tesla T4
Ampere (2020) GeForce RTX 30xx RTX A series A100
Ada (2022) GeForce RTX 40xx RTX 6000 Ada N/A?
Hopper (2022) N/A N/A H100
Blackwell

If you have tens of thousands of dollars to burn, you will want to look at Nvidia's enterprise offerings and more specifically at the A100 or newer H100 GPUs. These options come with abundant VRAM (40 to 80GB), which we can put to good use in a deep learning context. Additionally, they are very power efficient with a lower TDP compared to consumer-grade GeForce cards. This translates to a smaller physical footprint, so that multiple cards can fit in a single server case. Be careful when installing these GPUs in a regular desktop case though: they only have passive (i.e., fanless) cooling so they require very intensive external ventilation as is standard in a typical, temperature-controlled data center.

Notice how I focus on VRAM memory above all else. The reasoning behind this is simple: if you do not have enough memory, your model simply will not run. The other specs will only determine how patient you will have to be to see the result.

One step down in the price range (5 000EUR - 10 000EUR) we find their workstation offerings. Here, the RTX A6000 and RTX 6000 Ada with 48GB VRAM both look appealing. These come in a "blower style" form factor instead of the more traditional "open air" form factor, meaning they exhaust hot air via the back instead of spreading it back into the case. This again makes it possible to install many cards in a limited physical space without having to worry too much about heat dissipation. Unfortunately, these cards make a lot of noise, and this type of cooling is not suitable for more power-hungry consumer-grade cards (250W+).

The price range of up to 2000EUR makes those consumer-grade cards a lot more viable for most people. Conveniently, the current and last generation flagships - RTX 4090 and RTX 3090 (Ti) respectively - both have 24GB VRAM. In conclusion, buying a (lightly) used RTX 3090 (700EUR - 800EUR) might be the most budget-friendly option out there. Downsides of these consumer grade cards are their high power usage and their unwieldy form factor. (You will have a hard time fitting an RTX 4090 in a 4U server case.)

While Nvidia produces all of its own cards in the datacenter and workstation segments, there is a lot more competition in the consumer space. While Nvidia releases Founders Edition (FE) cards for soms of its own GPU chips, many other companies (Asus, MSI, Gigabyte, ...) build their own alternative cards around those same chips. They all have their own peculiarities:

  • factory overclocking
    • not at all relevant for us, if anything we might end up underclocking our card to keep power usage and temperature over long time spans under control
  • cooling method (1-3 fans, optional watercooling)
  • physical size
    • typically 3+ expansion slots
    • watercooled cards can be slimmed down to 1 slot height
  • power usage
  • power connectors
    • most cards need 2 to 3 PCIe 6+2pin connectors
  • looks
    • especially RGB lights, if you are into that
  • ...

For my build, I am going to start out with a single RTX 3090 FE, but I want to select the other components carefully so that I can expand to 2x3090 or even 3x3090 in the future. Specifications:

  • Ampere architecture
  • 24GB GDDR6X VRAM (= 12 chips x 2GB/chip?)
    • have a tendency to run hot (>100 degrees Celcius)
    • might need to replace the thermal pads
    • using a GPU brace is also reported to fix some heat issues
    • or consider underclocking with sudo nvidia-smi -i <GPU_index> -pl <power_limit>
    • or look into custom water cooling blocks, if that is your thing
  • memory bus: 384bit (= 12 chips x 32bits/chip)
  • dimensions: 313 mm x 138 mm x 3 expansion slots
  • PCIe connector: PCIe Gen 4 x16
    • note: PCIe Gen 5 is the latest standard, but there are not Gen 5 GPUs yet
  • power
    • 350W
    • connector
      • placed on long edge of card (instead of short edge in higher segments)
      • (variant of) new 12Vhpwr connector found on new ATX 3.0 PSUs
      • including conversion cable to 2xPCIe 6+2pin connectors
  • last consumer card to support NVLink

For a much more in-depth analysis of GPUs for deep learning, check out https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/.

Multi-GPU considerations

We need a motherboard and CPU combination that has enough PCIe Gen 4 x16 slots with enough PCIe lanes and enough physical spacing in between to make this possible.

  • space
  • heat
  • power
  • PCIe lanes
  • SLI / NVLink
    • bridge sold separately
    • requires fixed amount of space between cards
      • not compatible with "creative" (i.e., vertical) GPU placement options
  • use same make and model for all cards
    • else computation will often wait for the slowest card to finish
  • note: this kind of setup only makes sense for deep learning, not for gaming

CPU

For a CPU we have to choose between Intel and AMD. While Intel used to be a no-brainer in the not-too-distant past, the tables have turned in recent years. I knew this was true in the consumer space, but as it turns out it is also valid in high-end segments such as HEDT, workstation and server CPUs.

For a ML workstation, the CPU is not nearly as important as the GPU. When possible, extra budget should go to the GPU instead. However, the CPU has an important role in making sure the GPUs can reach their full potential. To do this, it has to be able to supply them with enough data so that they are not sitting idle. This memory bandwidth will play a crucial role in our choice of CPU segment.

Some background on PCI slots: each has a physical size (i.e., width) and a number of communication lanes that are both expressed with indicators such as "x1", "x2", "x4", "x8" or "x16". A GPU typically occupies a physical x16 slot because a lot of data has to be transferred back and forth. Note that smaller expansion cards can fit in larger slots but not the other way around. For example, a single x4 card can plug in a x16 PCIe slot, thereby forfeiting the other x12 lanes. Typically, a slot that is x4 wide and contains an x4 expansion card will use all x4 lanes and a slot that is x16 wide will use all x16 lanes. However, the two factors can in practice diverge. When either the CPU or motherboard are not able to handle the combined amount of lanes over all slots, they can decide to run one or more slot at half the number of lanes. So an x16 GPU slot can run with x8 lanes.

Furthermore, each PCIe generation roughly works double as fast as the previous one. So PCIe Gen 5x8 can work as fast as PCIe Gen 4x16. You might now be thinking: it evens out if we put a Gen 4x16 GPU in a Gen 5x8 slot. Unfortunately, that is not the case. The Gen 5 slot has full backwards compatibility with Gen 4 expansion cards, but it will also be limited to Gen 4 speeds in that case. Effectively, it will still be running at Gen4x8 if only 8 lanes are available. To be clear: running at half the amount of lanes does not halve the effective speed of the GPU. The number of lanes is not the main bottleneck in most systems, so the performance penalty will be much lower.

Consumer-grade CPUs such as Intel Core i3-i9 and AMD Ryzen 3-9 have a very limited number of available PCIe lanes. For example, the top end AMD Ryzen 9 7900X can only manage 28 PCIe lanes (of which 4 are reserved to communicate with the motherboard chipset). That leaves 24 lanes (e.g., x16 + x8) for our GPUs. For almost all consumers - who are only ever interested in having a single GPU - this is plenty. However, we have to ask ourselves if running our second GPU with only x8 instead of x16 lanes is worth it. For many people the answer will be "yes" and they should stick to this segment. The alternative is looking at HEDT, workstation or server segment CPUs, as we will do below.

HEDT or high-end desktop started with Intel Extreme Edition CPUs, and later Intel Core X CPUs. These days the HEDT segment of Intel has been integrated in their Xeon lineup of workstation and server processors. Specifically, the Xeon W9 2400 and 3400 series. The category sits somewhere between consumer-grade hardware and workstation hardware, offering more multithreading performance and more PCIe lanes. AMD is going back and forth with regard to their HEDT support. Threadripper CPUs are in the HEDT segment, while Threadripper PRO CPUs are in the workstation segment. AMD had not released a non-PRO Threadripper in a while, but at CES 2024 they announced a new lineup (e.g., AMD Ryzen Threadripper 7970X with 32 cores and 92 PCIe 5.0 lanes). In summary, HEDT is a good match for our build but the segment is being squeezed by high-end consumer hardware and lower-end workstation hardware.

Specifically, the AMD Ryzen Threadripper PRO 5000WX series (based on the older Zen 3 architecture) is very competitively priced these days. It offers workstation CPUs with up to 64 cores, 2TB of DDR4 RAM and 128 PCIe lanes. As we will see in the motherboard section, these builds come with typical enterprise features that are redundant for our target audience, to the point where a HEDT build would be a better match if properly priced. An additional benefit of using somewhat older (i.e., 2022) hardware is that the DDR4 memory and PCIe 4.0 SSDs that come with it are cheaper than the recent DDR5/PCIe 5.0 counterparts.

I also briefly looked at the server segment (Intel Xeon and AMD EPYC) but found no better offerings there. In the end I settled for a Threadripper PRO 5955WX 16 core CPU with a TDP of 280W and 128 PCIe lanes that I could get a decent deal on. The Intel counterparts have fewer cores, lower clock rates, fewer PCIe lanes for the same or more money.

Another fun fact about CPUs: you can buy them boxed (default) or as "tray". Tray refers to the tray with multiple CPUs that are typically bought by OEMs for use in prebuilt their systems. As such, these don't come with any extras (no box, no manual, no stock cooler, ...). OEMs are not supposed to resell these to consumers, but it sometimes happens regardless. Manufacturers like Intel and AMD will typically not provide factory warranty for such products, so you will have to talk to the intermediary (i.e., the OEM) instead in case of problems. The main advantage of tray CPUs is their lower price. If the discount is significant enough, it is worth considering. However, I learned the hard way that Threadripper CPUs are supposed to come with a torque wrench to fix the CPU mount precisely as tight as prescribed. The tray version of these CPUs obviously also do not include this tool.

CPU cooler

For workstation-grade builds, I prefer aircooling over watercooling. It requires barely any maintenance and it can run without failure for years on end. We do however need a cooler with a sizable heatsink to be able to dissipate the TDP of our CPU. When choosing one, make sure it does not occlude any RAM of PCIe slots that you intend to use.

Specifically for Threadripper PRO builds, pay attention to the orientation of the cooler. In desktops and workstations, coolers are supposed to blow air from front to back in the case. However, our CPU is from the server segment - contrary to the non-PRO Threadrippers in the HEDT segment - where a horizontal socket orientation is more common. In that case a regular cooler will blow air from bottom to top. The Noctua NH-U14S and NH-U12S both suffer from this. Eventually, I discovered the Arctic Freezer 4U-M, which has the correct orientation and also matches all other requirements (i.e., socket and TDP). The "4U" terminology refers to server height in a rack.

Motherboard

Our choice of CPU (or more precisely its sWRX80 socket) limits our choice of motherboards quite significantly. Again, we select in function of our multi-GPU setup. Specifically, we are looking for plenty of PCI Gen 4x16 slots that can run at full speed and are also spaced far enough apart. Additionally, we would like a few M.2 slots with heatsink that have a direct connection to the CPU and that are placed far enough away from hot GPUs. Finally, make sure to check the connectivity options (USB, USB-C, WiFi, Bluetooth). Fortunately, most motherboards in the short list fit the bill. I eventually went with the ASUS Pro WS WRX80E-SAGE SE WIFI because I saw it being used in Lambda Labs builds and I could get a good deal on one. It was only afterwards that I realized the awkward dimensions of this board (see Case section). It is worth looking for smaller alternatives, but make sure to look at the block diagram showing all interconnections before making a decision. The ASRock WRX80 Creator comes to mind, although it seems hard to come by and does not support x16 lanes in all PCIe slots.

Some random notes on the ASUS Pro WS WRX80E-SAGE SE WIFI board:

  • requires lots of power cables
    • 1 x 24-pin ATX connector
    • 2 x 8-pin CPU/EPS connector
    • 2 x 6-pin PCIe connector
    • 1 x 6+2-pin PCIe connector
  • main feature: 7 PCIe 4.0 x16 slots
  • built-in power and reset button
    • works without connecting front panel headers of case
  • built-in VGA output
    • useful when you don't have discrete GPU yet
      • note: AMD Threadripper PRO does not have iGPU
    • makes linux crash on boot
      • first attempt: add acpi=off to grub bootloader options list
        • makes it possible to boot into live environment
        • also disables all but two USB ports
        • also crashes KVM via BMC
        • also disables all NVMe SSD drives
      • proper fix 1: add pci=nommconf to grub bootloader options list
        • in bootloader: E, add option, F10
          • linux /boot/vmlinuz-x.y.z-... ro quiet splash pci=nommconf
        • make permanent when booted
          • vi /etc/default/grub
            • add pci=nommconf to GRUB_CMDLINE_LINUX variable
          • sudo update-grub
          • reboot
      • proper fix 2: disable VGA header via physical switch
    • still works with BMC disabled
  • Q-code display output does not seem to match with table in manual
  • BMC / IPMI
    • typical server-level feature
    • access
      • can only be accessed over ethernet (not WiFi) via one of the two ports
      • make sure to use HTTPS
      • check IP address in BIOS
      • user: admin
      • password: admin
    • makes system take minutes to boot after complete power down (e.g., after unplugging)
      • much faster after regular shutdown and start
        • but still slow compared to regular desktop
      • fixed when BMC is disabled
    • LEDs
      • stay on when system is off
      • green (blinking): BMC is up and running
      • orange: on iff new warning in system event log
        • possibly about fans with RPM below threshold
    • control fan curves via web portal
      • or via BIOS (after firmware update)
      • (non-PWM?) fans will run at max speed when BMC is disabled
    • built-in KVM
  • contains two small fans
  • bottom pins are oriented south instead of up
    • pro: allows large GPU to hang off bottom edge of motherboard
    • con: many cases have limited space near bottom to connect everything
  • WiFi
    • 6, not 6E
    • shark-shaped WiFi antenna is very unpractical
      • alternative: aftermarket antennas that attach directly to connectors
  • no thunderbolt header (as is typical in AMD builds)
  • sound when running Ubuntu

Other

RAM
  • check QVL list of motherboard for compatibility
    • mine insisted specifically on DDR4-3200 RAM
  • amount
    • more is better (up to a point)
    • at least 20% more than total VRAM
  • type: DDR4 (cheaper) or DDR5
  • form factor
    • DIMM (desktop)
    • make sure they fit under the CPU cooler
  • speed, timings, latency: not important
  • overclocking profiles (Intel XMP/AMD EXPO): not important
  • heatsink: not important
  • mostly works best with two modules in dual channel mode
  • ECC
    • nice to have
    • more expensive
    • more difficult to find
  • warranty: lifetime
Storage
  • main SSD
    • type: NVMe M.2 SSD
    • size: 1TB+
      • models take up a lot of room
    • PCIe
      • typically uses x4 lanes
      • ideally directly connected to CPU instead of via motherboard chipset
      • both Gen 4 and Gen 5 options are available
    • no need for heatsink if you motherboard already has one
    • warranty (5+ years)
  • optional 5400RPM HDD(s) for cheap extra storage
PSU
  • must haves
    • power rating
      • rule of thumb:
    • right type and amount of connectors
      • ATX24 for motherboard
      • EPS for CPU
      • PCIe depending on motherboard, GPU and other components
      • warning: never daisy chain GPUs
  • nice to haves
    • 80+ efficiency rating (gold < platinum < titanium)
      • note: small percentual differences become relevant when using lots of power
    • 12Vhpwr connector
      • supports up to 600W
      • plug these in properly, or you risk melting the plug
    • modular design
    • silent
    • warranty (10+ years)
Case
  • volume
    • for aircooled multi-GPU setup, disregard any case with a volume below 60 liters
  • constraints
    • supports motherboard form factor
    • CPU cooler height
    • GPU length
    • PSU length
  • nice to haves
    • dust filters
    • cable management options
    • easy to open
    • built-in GPU brace(s)

I realized fairly late in the process that my motherboard has an unusual form factor: EEB (12.2" x 13") instead of the far more common ATX (12" x 9.6"). This severely limited the number of compatible cases I could choose from. Even cases that officially claimed to support EEB form factors had some caveats. For example, because of downward-facing connectors on the bottom edge of the motherboard, I had to make sure I had spare room in that area to be able to connect all cables. Note that this extra space is useful regardless if you plan on installing a large GPU in the bottom PCIe slot. Furthermore, the standard cable management holes in the backplate of many cases get covered by the much wider motherboard. This results in some unconventional cable management practices. If I would do this build over again, I would put a much stronger emphasis on selecting a standard ATX motherboard.

Some feasible options:

  • Corsair 7000D Airflow (very tight, not recommended)
  • Fractal Design Define 7 XL
  • Fractal Design Meshify 2 XL
  • Lian Li O11 Dynamic XL
  • Phanteks Enthoo Pro 2
Cooling fans

Ventilation is very important is a high-powered system, especially if the goal is to sustain long duration workloads. Do not cheap out on fans after building a $2000+ computer.

  • case should have positive pressure ()
  • size
    • use 120mm or 140mm fans
    • not 200mm, they fail first due to high torque
  • RPM trade-off
    • high RPM = more airflow
    • low RPM = less noise
  • purpose
    • static pressure: for radiators, meshes, filters
    • airflow: elsewhere
    • hybrid
  • connector
    • 3 pin (voltage regulated)
    • 4 pin (PWM regulated, better)
  • bearings
    • fluid
      • cheap
      • mineral oil for lubrication
      • dust sensitive
      • go bad after a while
      • sensitive to orientation (avoid horizontal)
    • ball
      • expensive
      • long lasting
      • more noisy
      • ideal for servers
      • any orientation
    • sleeve
      • hybrid between ball and fluid bearing
      • closed system
      • prefers horizontal orientation
    • rifle
      • like sleeve
      • with Archimedes screw
        • prefers horizontal orientation
      • used in be quiet! fans
    • magnetic / maglev
      • lowest noise
      • expensive
      • any orientation
  • fan orientation
    • vertical: fails first
    • horizontal
  • recommendations

Further reading

PTI + DragGAN

I came across a tool called DragGAN this weekend. Although GANs are somewhat outdated, the fun example videos triggered me to play with the technique for a bit. Running the provided demos is very easy in Google Colab. The only hiccup I experienced was that I had to manually upload the StyleGAN-Human model to Colab to add it to the GUI list. It is not included in the original download script.

The DragGAN tutorial suggests using the PTI technique to use it own custom images. There are however no detailed instructions on how to combine the two techniques and pass the correct information between them. This notebook shows how it can be done. It can run in Google Colab on a T4 GPU.

Note that the basemodel we use here is stylegan2_ada_ffhq which has been trained on Flickr Faces HD (FFHD). As such, it will only work on pictures of faces.

In [1]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

class Downloader(object):
    def __init__(self, use_pydrive):
        self.use_pydrive = use_pydrive

        if self.use_pydrive:
            self.authenticate()

    def authenticate(self):
        auth.authenticate_user()
        gauth = GoogleAuth()
        gauth.credentials = GoogleCredentials.get_application_default()
        self.drive = GoogleDrive(gauth)

    def download_file(self, file_id, file_dst):
        if self.use_pydrive:
            downloaded = self.drive.CreateFile({'id':file_id})
            downloaded.FetchMetadata(fetch_all=True)
            downloaded.GetContentFile(file_dst)
        else:
            !gdown --id $file_id -O $file_dst

downloader = Downloader(True)

Step 1 - Install Packages required by PTI

In [ ]:
!pip install lpips wandb

# used for faster inference of StyleGAN by enabling C++ code compilation
!wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
!sudo unzip ninja-linux.zip -d /usr/local/bin/
!sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force

Step 2 - Download Pretrained models

In [ ]:
!git clone https://github.com/XingangPan/DragGAN.git
In [ ]:
!git clone https://github.com/danielroich/PTI.git
%cd /content/PTI/
!git checkout da94d59d15d94822e95840ab5a0aa9ba1a19c851
In [10]:
import os
image_dir_name = 'image'
os.makedirs(f'./{image_dir_name}_original', exist_ok=True)
os.makedirs(f'./{image_dir_name}_processed', exist_ok=True)
save_path = "pretrained_models"
os.makedirs(save_path, exist_ok=True)
In [11]:
downloader.download_file("125OG7SMkXI-Kf2aqiwLLHyCvSW-gZk3M", os.path.join(save_path, 'ffhq.pkl'))
In [12]:
downloader.download_file("1xPmn19T6Bdd-_RfCVlgNBbfYoh1muYxR", os.path.join(save_path, 'align.dat'))

Step 3 - Configuration Setup

In [19]:
import sys
import pickle
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from IPython.display import display

from configs import paths_config, hyperparameters, global_config
from utils.align_data import pre_process_images
from scripts.run_pti import run_PTI

image_name = 'personal_image'
global_config.device = 'cuda'
paths_config.e4e = '/content/PTI/pretrained_models/e4e_ffhq_encode.pt'
paths_config.input_data_id = image_dir_name
paths_config.input_data_path = f'/content/PTI/{image_dir_name}_processed'
paths_config.stylegan2_ada_ffhq = '/content/PTI/pretrained_models/ffhq.pkl'
paths_config.checkpoints_dir = '/content/PTI/'
paths_config.style_clip_pretrained_mappers = '/content/PTI/pretrained_models'
hyperparameters.use_locality_regularization = False

Step 4 - Preproccess Data

TODO: upload a picture to /content/PTI/image_original/personal_image.jpg

In [ ]:
original_image = Image.open(f'./{image_dir_name}_original/{image_name}.jpg')
In [ ]:
pre_process_images(f'/content/PTI/{image_dir_name}_original')

Step 5 - Invert images using PTI

In order to run PTI and use StyleGAN2-ada, the cwd should the parent of 'torch_utils' and 'dnnlib'

In [ ]:
model_id = run_PTI(use_wandb=False, use_multi_id_training=False)

Visualize results

In [26]:
def load_generators(model_id, image_name):
  with open(paths_config.stylegan2_ada_ffhq, 'rb') as f:
    d = pickle.load(f)
    old_G = d['G_ema'].cuda()
    old_D = d['D'].cuda()

  with open(f'{paths_config.checkpoints_dir}/model_{model_id}_{image_name}.pt', 'rb') as f_new:
    new_G = torch.load(f_new).cuda()

  return old_G, old_D, new_G
In [27]:
old_G, old_D, new_G = load_generators(model_id, image_name)
In [28]:
# def plot_syn_images(syn_images):
#   for img in syn_images:
#       img = (img.permute(0, 2, 3, 1) * 127.5 + 128).clamp(0, 255).to(torch.uint8).detach().cpu().numpy()[0]
#       plt.axis('off')
#       resized_image = Image.fromarray(img,mode='RGB').resize((256,256))
#       display(resized_image)
#       del img
#       del resized_image
#       torch.cuda.empty_cache()
In [29]:
w_pivot_path = f'{paths_config.embedding_base_dir}/{paths_config.input_data_id}/{paths_config.pti_results_keyword}/{image_name}/0.pt'
# w_pivot = torch.load(w_pivot_path)

# old_image = old_G.synthesis(w_pivot, noise_mode='const', force_fp32 = True)
# new_image = new_G.synthesis(w_pivot, noise_mode='const', force_fp32 = True)

# print('Upper image is the inversion before Pivotal Tuning and the lower image is the product of pivotal tuning')
# plot_syn_images([old_image, new_image])

Export

In [31]:
def export_updated_pickle(old_G, old_D, new_G, output_path):
  tmp = {}
  tmp['G'] = old_G.eval().requires_grad_(False).cpu()
  tmp['G_ema'] = new_G.eval().requires_grad_(False).cpu()
  tmp['D'] = old_D.eval().requires_grad_(False).cpu()
  tmp['training_set_kwargs'] = None
  tmp['augment_pipe'] = None

  with open(output_path, 'wb') as f:
      pickle.dump(tmp, f)

output_path = f'{paths_config.checkpoints_dir}/stylegan2_{image_name}.pkl'
export_updated_pickle(old_G, old_D, new_G, output_path)
In [32]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!mkdir -p /content/DragGAN/checkpoints
!cp $output_path /content/DragGAN/checkpoints
!cp $w_pivot_path /content/DragGAN/checkpoints

DragGAN

In [ ]:
%cd /content/DragGAN
!git checkout c5e88b3eaf64c33a9e82782d75b4329d16711c3a
In [ ]:
!pip install -r requirements.txt
In [35]:
# !python scripts/download_model.py

Fix some errors in python scripts:

  • use our custom w_pivot from PTI
  • set the default model in the GUI to our own
  • bypass the watermark due to a font issue
In [36]:
!sed -i 's#None.*w_load#torch.load("/content/DragGAN/checkpoints/0.pt"),#' /content/DragGAN/visualizer_drag_gradio.py
!sed -i 's/stylegan2_lions_512_pytorch/stylegan2_personal_image/' /content/DragGAN/visualizer_drag_gradio.py
!sed -i 's/d = ImageDraw/return input_image_array  # d = ImageDraw/' /content/DragGAN/viz/renderer.py
In [ ]:
!python /content/DragGAN/visualizer_drag_gradio.py

StarCoder (WIP)

Intro

How to set up starcoder in AWS

  • create S3 bucket
  • create policy that allows read/write to that bucket
  • create EC2 role containing that policy
  • start a new EC2 instance
    • TODO select right instance type
    • t2.micro for now to set up S3 properly
    • use newly created IAM role
  • sudo yum install git
  • Amazon Linux 2023 does not support git-lfs out of the box, workaround:
    • curl -LO https://github.com/git-lfs/git-lfs/releases/download/v3.3.0/git-lfs-linux-amd64-v3.3.0.tar.gz
    • tar xvfz git-lfs-linux-amd64-v3.3.0.tar.gz
    • sudo ./install.sh instead of git lfs install
    • git lfs version
  • git clone https://huggingface.co/bigcode/starcoder
    • takes a while, needs to download 65GB
  • cd starcoder
  • TODO save to S3
  • don't forget to stop the instance when you're done

Local, out of the box usage

  • conda create -n starcoder python=3.11
  • conda activate starcoder
  • git clone https://github.com/bigcode-project/starcoder.git
  • cd starcoder
  • pip install -r requirements.txt
  • set ENV HUGGING_FACE_HUB_TOKEN
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
checkpoint = "bigcode/starcoder"

model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
print( pipe("def hello():") )

What's new in Generative AI land

Regularly updated sources

Big tech news

Adobe

AI21 labs

Amazon

Andrej Karpathy / Eureka Labs

Anthropic

Apple

Black Forest Labs

Chip Huyen

Cohere

Databricks

Google

Huggingface

Inflection

Lightning AI

Meta

Microsoft

Midjourney

Mistral

NVIDIA

Ollama

OpenAI

Runway ML

Stability AI

xAI

Other news

Tools

Recent models

Catching up with the Deep Learning revolution

Timeline

Information sources

Important people

  • Geoffrey Hinton (1947): Google Brain, 1/3 godfathers of AI, backpropagation
  • Yann LeCun (1960): FB, 1/3 godfathers of AI, CNN
  • Yoshua Bengio (1964): Deep Learning book, 1/3 godfathers of AI
  • Andrew Ng (1976): Google Brain, Baidu, Coursera, deeplearning.ai,
  • Ian Goodfellow (1986): Deep Learning book, Google Brain, OpenAI, Apple, GANs, supervised by Ng + Bengio
  • François Chollet: Google, Keras
  • Aaron Courville
  • Pieter Abbeel, prof EE/robotics/AI @ UC Berkeley
    • ESAT at KUL
    • PhD at Stanford under Andrew Ng
    • podcast: The Robot Brains
  • Andrej Karpathy: Stanford, Tesla, OpenAI, Eureka Labs
  • Chip Huyen: Stanford, Claypot AI, Voltron Data
  • Ilya Sutskever: AlexNet, Google, OpenAI
  • Tim Dettmers: QLoRA, bitsandbytes, GPU comparison

Modalities

  • input
    • text
      • code
    • audio
      • speech / voice
    • visual
      • image
      • video
  • output
    • text
      • code
    • audio
      • speech / voice
      • music
    • actions
      • movement (robots)
      • tools/APIs (agents)

Glossary

  • AE: auto encoder
  • AI: artificial intelligence
  • ANN: artificial neural network
  • BERT: bidirectional encoder representations from transformers
  • BPE: byte pair encoding
  • CLIP: contrastive language-image pretraining
  • CNN: convolutional neural network
  • CoT: chain of thought
  • CPU: central processing unit
  • DBN: deep belief network
  • DL: deep learning
  • DNN: deep neural network
  • DRL: deep reinforcement learning
  • EM: expectation maximization
  • Flan: finetuned language model
  • FNN: feedforward neural network
  • GAN: generative adversarial network
  • GPT: generative pre-trained transformer
  • GPU: graphical processing unit
  • HF: HuggingFace
  • LiT: locked image tuning
  • LLM: large language model
  • LoRA: low-rank adaptation
  • LSTM: long short term memory
  • ML: machine learning
  • MLP: multilayer perceptron
  • MoE: mixture of experts
  • MP: max pooling
  • NLG: natural language generation
  • NLP: natural language processing
  • NLU: natural language understanding
  • PEFT: parameter-efficient fine-tuning
  • RAG: retrieval-augmented generation
  • RBM: restricted Boltzmann machine
  • ReLU: rectified linear unit
  • RL: reinforcement learning
  • RNN: recurrent neural network
  • SFT: supervised finetuning
  • SGD: stochastic gradient descent
  • SL: supervised learning
  • SOTA: state of the art
  • SSL: self-supervised learning
  • SVM: support vector machines
  • TPU: tensor processing unit
  • UL: unsupervised learning
  • VAE: variational auto encoder
  • ViT: vision transformer
  • VRAM: video RAM (i.e., the memory of the GPU)

Infrastructure

  • you will need one or more Nvidia GPUs
    • with CUDA, Tensor Cores and cuDNN support
    • overview of recent Nvidia GPU architectures:
Architecture Desktop Workstation Datacenter
Pascal (2016) GeForce GTX 10xx Quadro P Tesla P4 / Tesla P100
Volta (2017) N/A Quadro GV100 Tesla V100
Turing (2018) GeForce RTX 20xx Quadro RTX Tesla T4
Ampere (2020) GeForce RTX 30xx RTX A series A100
Ada (2022) GeForce RTX 40xx RTX 6000 Ada N/A?
Hopper (2022) N/A N/A H100, H200
Blackwell GeForce RTX 50xx ? B100, B200

Cloud environments

Accelerator Standard RAM High RAM*
None 12.7 GB 25.5 GB
Standard GPU 12.7 GB 25.5 GB
Premium GPU* 12.7 GB 25.5 GB
TPU 12.7 GB 35.2 GB

Machine learning libraries

Datasets

Model hubs

Model metrics and benchmarks

Vision models

  • outdated
    • MNIST error rate
    • ImageNet error rate
  • recent
    • ...

Language models

Misc

Recent Generative AI Models

This page was last updated on 2025-09-13.

New Table

Table 1 - Models

Model

Company

Date

Params

Paper

Source

Website

Weights

Remarks

Robotic Transformer 1

Google DeepMind

2022-12-13

link

link

Einstein GPT

Salesforce

2023-03-07

link

uses OpenAI API?

🧨 Stable UnCLIP 2.1

Stability AI

2023-03-24

link

link

link

model behind Reimagine

LLaVA

University of Wisconsin-Madison

2023-04-17

link

link

link

link

LLaVA = Large Language and Vision Assistant

WizardLM

Microsoft

2023-04-24

7B, 13B, 30B, 70B

link

link

link

based on llama

Eleven Multilingual v1

ElevenLabs

2023-04-27

link

link

English, French, German, Hindi, Italian, Polish, Portuguese, Spanish

PaLM 2

Google

2023-05-10

link

link

LIMA

Meta AI

2023-05-18

65B

link

based on llama

🔈Massive Multilingual Speech

Meta AI

2023-05-22

300M, 1B

link

link

link

link

Falcon

TII.AE

2023-05-26

1B, 7B, 40B

coming soon

link

AlphaDev

Google DeepMind

2023-06-07

link

link

🔈 StyleTTS 2

Columbia University

2023-07-13

link

link

link

link

WizardCoder

Microsoft

2023-06-14

15B

link

link

link

Llama 2

Meta AI

2023-07-18

7B, 13B, 70B

link

link

link link2

link

Meta-Transformer

2023-07-20

85M, 302M

link

link

link

link

12 modalities

Stable Beluga 2

Stability AI

2023-07-21

70B

link

link

based on llama 2

🧨 Stable Diffusion XL 1.0

Stability AI

2023-07-26

3.5B

link

link

link

base refiner

Robotic Transformer 2

Google DeepMind

2023-07-28

link

link

StableCode

Stability AI

2023-08-08

3B

link

base instruct

🔈 AudioSep - Separate Anything You Describe

Audio-AGI

2023-08-09

link

link

link

link

🔈 AudioLDM2

ByteDance

2023-08-10

link

link

link

link

🔈 Eleven Multilingual v2

ElevenLabs

2023-08-22

link

link

English, French, German, Hindi, Italian, Polish, Portuguese, Spanish, Chinese, Korean, Dutch, Turkish, Swedish, Indonesian, Filipino, Japanese, Ukrainian, Greek, Czech, Finnish, Romanian, Danish, Bulgarian, Malay, Slovak, Croatian, Classic Arabic, Tamil

SeamlessM4T

Meta AI

2023-08-22

1.2B, 2.3B

link

link

link

link

Code Llama

Meta AI

2023-08-24

7B, 13B, 34B

link

link

link

link

Nougat OCR

Meta AI

2023-08-25

link

link

link

link

Specialized in academic documents

Falcon 180B

TII

2023-09-06

180B

coming soon

link

link

see also: falcon-40b

Persimmon

Adept

2023-09-07

8B

link

link

link

🔈 StableAudio

Stability AI

2023-09-13

link

🧨 DALL-E 3

OpenAI

2023-09-21

link

📽️ LaVie

Shanghai Artificial Intelligence Laboratory

2023-09-26

link

link

link

link

Mistral-7B

Mistral AI

2023-09-27

7B

link

link

link

Qwen

Alibaba

2023-09-28

7B, 14B

link

link

link

LLaVA 1.5

University of Wisconsin-Madison

2023-10-05

link

link

link

link

jina-embeddings-v2

Jina AI

2023-10-25

link

link

link

Yi

01.ai

2023-11-02

6B, 34B

link

link

📽️ Emu Video

Meta AI

2023-11-16

link

link

📽️ Stable Video Diffusion

Stability AI

2023-11-21

link

link

link

link

Meditron

École Polytechnique Fédérale de Lausanne (EPFL)

2023-11-27

7B, 70B

link

link

link

🧨 SDXL Turbo

Stability AI

2023-11-28

link

link

link

link

📽️ Animate Anyone

Alibaba

2023-11-28

link

link

link

Seamless

Meta AI

2023-11-30

link

link

link

link

OpenVoice

MyShell.ai

2023-12-03

7B, 13B, 34B, 70B

link

link

link

Gemini

Google DeepMind

2023-12-06

link

link

nano / pro / ultra, pro will power Bard

AlphaCode 2

Google DeepMind

2023-12-06

link

link

Stable LM Zephyr 3B

Stability AI

2023-12-07

3B

link

link

Mistral 8x7B

Mistral AI

2023-12-11

45B

link

link

link

🧨 Imagen 2

Google DeepMind

2023-12-13

link

Stable Code 3B

Stability AI

2024-01-16

3B

link

link

Stable LM 2

Stability AI

2024-01-19

1.6B

link

link

Eagle 7B

RWKV

2024-01-29

7B

link

link

RWKV-v5 architecture

Code Llama 70B

Meta AI

2024-01-29

7B, 13B, 34B, 70B

link

link

link

link

MGIE

Apple

2024-02-05

link

link

link

link

Sora

OpenAI

2024-02-15

link

link

Gemma

Google

2024-02-21

2B, 7B

link

link

link

link 1 link 2

🧨 Stable Diffusion 3

Stability AI

2024-02-22 (preview)

0.8B, ..., 8B

link

Old Table

Table 2 - Models (old)

Model

Company

Date

Base Model

Parameters

Training Data Size

Training Time

Context length

Paper

Source

Website

Training data

Code License

Weights License

Type

Model weights

Instruction Tuning

RLHF

Remarks

Deep Blue

IBM

1996-01-01

from scratch

N/A

https://www.sciencedirect.com/science/article/pii/S0004370201001291

N/A

https://www.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/

games

Chess

Watson

IBM

2011-01-01

from scratch

N/A

https://doi.org/10.1609/aimag.v31i3.2303

N/A

games

Jeopardy

AlexNet

  1. Krizhevsky, G. Hinton

2012-09-30

from scratch

60M

https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

https://github.com/dansuh17/alexnet-pytorch (clone)

vision

won ImageNet LSVRC 2012 challenge with 15.3%

word2vec

Google

2013-01-16

https://arxiv.org/abs/1301.3781

no

no

Inception v1

Google

2014-09-17

from scratch

https://arxiv.org/abs/1409.4842

https://github.com/google/deepdream

vision

won ImageNet LSVRC 2014 challenge with 6.7%

DQN

Google DeepMind

2015-02-25

from scratch

https://www.nature.com/articles/nature14236

https://github.com/deepmind/dqn

deep RL

char-rnn

Andrej Karpathy

2015-05-21

from scratch

https://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://github.com/karpathy/char-rnn

language

Features on https://www.aiweirdness.com/

GloVe

Stanford

2015-09-01

https://nlp.stanford.edu/pubs/glove.pdf

https://github.com/stanfordnlp/GloVe

https://nlp.stanford.edu/projects/glove/

Apache 2.0

Apache 2.0

yes

no

no

fastText

Facebook

2015-11-09

https://arxiv.org/abs/1607.04606

https://github.com/facebookresearch/fastText

https://fasttext.cc/

MIT

MIT

yes

no

no

Inception v3

Google

2015-12-02

https://arxiv.org/abs/1512.00567

vision

https://huggingface.co/timm/inception_v3.tv_in1k

ResNet

Microsoft

2015-12-10

from scratch

https://arxiv.org/abs/1512.03385

vision

won ImageNet LSVRC 2015 challenge with 3.57%; "better than humans"

AlphaGo

Google DeepMind

2016-01-27

from scratch

https://www.nature.com/articles/nature16961

games

Inception v4

Google

2016-02-23

vision

https://huggingface.co/timm/inception_v4.tf_in1k

Tay

Microsoft

2016-03-23

N/A

N/A

https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/

chatbot

CycleGAN

UC Berkeley

2017-03-30

https://arxiv.org/abs/1703.10593

https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

GAN

yes

AlphaGo Zero

Google DeepMind

2017-10-19

https://www.nature.com/articles/nature24270

games

AlphaZero

Google DeepMind

2017-12-05

https://arxiv.org/abs/1712.01815

games

ELMo (Embeddings from Language Models)

Allen Institute for AI

2018-02-15

180M

https://arxiv.org/abs/1802.05365

language

yes

GPT (Generative Pre-trained Transformer)

OpenAI

2018-06-11

from scratch

117M

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

https://github.com/openai/finetune-transformer-lm

transformer

yes

no

no

BERT (Bidirectional Encoder Representations from Transformers)

Google

2018-10-11

108M, 334M

https://arxiv.org/abs/1810.04805

https://github.com/google-research/bert

transformer

yes

StyleGAN

Nvidia

2018-12-12

https://arxiv.org/abs/1812.04948

https://github.com/NVlabs/stylegan

GAN

yes

https://thispersondoesnotexist.com

GPT2

OpenAI

2019-02-14

1.5B

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://github.com/openai/gpt-2

transformer

yes

no

no

XLNet

CMU & Google

2019-06-19

117M, 360M

https://arxiv.org/abs/1906.08237

https://github.com/zihangdai/xlnet

Apache 2.0

yes

RoBERTa

Meta AI

2019-07-26

BERT

354M

https://arxiv.org/abs/1907.11692

transformer

yes

ALBERT (A Lite BERT)

Google

2019-09-26

BERT

12M, 18M, 60M, 235M

https://arxiv.org/abs/1909.11942

https://github.com/google-research/ALBERT

Apache 2.0

transformer

yes

DistilBERT

HuggingFace

2019-10-02

BERT

66M

https://arxiv.org/abs/1910.01108

https://github.com/huggingface/transformers

Apache 2.0

transformer

yes

Text-to-Text Transfer Transformer (T5)

Google

2019-10-23

from scratch

11B

1T tokens

https://arxiv.org/abs/1910.10683

https://github.com/google-research/text-to-text-transfer-transformer

https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

Apache 2.0

Apache 2.0

transformer

yes

no

no

AlphaFold

Google DeepMind

2020-01-15

from scratch

https://www.nature.com/articles/s41586-019-1923-7

https://github.com/deepmind/deepmind-research/tree/master/alphafold_casp13

yes

Turing NLG

Microsoft

2020-02-13

17B

N/A

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

ELECTRA

Stanford & Google

2020-03-23

BERT?

14M, 110M, 335M

https://arxiv.org/abs/2003.10555

yes

DeBERTa

Microsoft

2020-06-05

BERT

https://arxiv.org/abs/2006.03654

https://github.com/microsoft/DeBERTa

MIT

transformer

yes

GPT3

OpenAI

2020-06-11

from scratch

175B

300B tokens

https://arxiv.org/abs/2005.14165

/

private

private

transformer

no

no

no

ImageGPT

OpenAI

2020-06-17

https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf

https://github.com/openai/image-gpt

private

private

transformer

no

mT5

Google

2020-10-22

from scratch

300M - 13B

1T tokens

https://arxiv.org/abs/2010.11934

https://github.com/google-research/multilingual-t5

mC4

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/google/mt5-base

DALL-E

OpenAI

2021-01-05

GPT-3

12B

https://arxiv.org/abs/2102.12092

private

private

transformer

no

DeBERTa V2

Microsoft

2021-02-03

900M - 1.5B

N/A

transformer

yes

CLIP

OpenAI

2021-02-26

https://arxiv.org/abs/2103.00020

https://github.com/OpenAI/CLIP

https://openai.com/research/clip

MIT

yes

GLM

Tsinghua University

2021-03-18

110M - 10B

https://arxiv.org/abs/2103.10360

https://github.com/THUDM/GLM

transformer

yes

GPT-Neo

EleutherAI

2021-03-21

125M, 1.3B, 2.7B

N/A

https://github.com/EleutherAI/gpt-neo

https://www.eleuther.ai/artifacts/gpt-neo

MIT

transformer

https://huggingface.co/EleutherAI/gpt-neo-1.3B

LaMDA

Google

2021-05-18

from scratch

137B

2.8T tokens

58d

https://arxiv.org/abs/2201.08239

N/A

N/A

transformer

no

GPT-J

EleutherAI

2021-06-09

6B

https://github.com/kingoflolz/mesh-transformer-jax

yes

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/EleutherAI/gpt-j-6b

no

no

CPM-2

Tsinghua University

2021-06-20

11B

https://arxiv.org/abs/2106.10715

https://github.com/TsinghuaAI/CPM

yes

Copilot

GitHub

2021-06-29

OpenAI Codex

N/A

N/A

code

no

ERNIE 3.0

Baidu

2021-07-05

10B

375B tokens

https://arxiv.org/abs/2107.02137

N/A

http://research.baidu.com/Blog/index-view?id=160

N/A

N/A

transformer

no

AlphaFold 2

Google DeepMind

2021-07-15

21B

https://www.nature.com/articles/s41586-021-03819-2

https://github.com/deepmind/alphafold

yes

Jurassic-1

AI21 Labs

2021-08-01

178B

300B tokens

N/A

N/A

https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1

N/A

N/A

no

Codex

OpenAI

2021-08-10

GPT3

12B

100B tokens

https://arxiv.org/abs/2107.03374

N/A

https://openai.com/blog/openai-codex

private

private

code

no

T0

BigScience

2021-10-15

T5

11B

27h

https://arxiv.org/abs/2110.08207

https://github.com/bigscience-workshop/t-zero

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/bigscience/T0

DeBERTa V3

Microsoft

2021-11-18

https://arxiv.org/abs/2111.09543

https://github.com/microsoft/DeBERTa

MIT

transformer

yes

Gopher

Google DeepMind

2021-12-08

from scratch

280B

300B tokens

38d

https://arxiv.org/abs/2112.11446

no

no

no

GLaM (Generalist Language Model)

Google

2021-12-13

from scratch

1.2T

280T tokens

24d

https://arxiv.org/abs/2112.06905

WebGPT

OpenAI

2021-12-17

GPT 3

175B

https://arxiv.org/abs/2112.09332

N/A

private

private

transformer

no

no

yes

ClipSeg

2021-12-18

https://arxiv.org/abs/2112.10003

https://github.com/timojl/clipseg

InstructGPT

OpenAI

2022-01-27

GPT3

175B

https://arxiv.org/abs/2203.02155

N/A

private

private

transformer

no

yes

yes

Megatron-Turing (MT) NLG

Microsoft

2022-01-28

530B

270B tokens

https://arxiv.org/abs/2201.11990

N/A

transformer

no

AlphaCode

Google DeepMind

2022-02-02

0.3B,1B,3B,9B,41B

967B tokens

https://arxiv.org/abs/2203.07814

https://www.deepmind.com/blog/competitive-programming-with-alphacode

N/A

code

no

GPT3.5

OpenAI

2022-03-15

355B

N/A

private

private

transformer

no

Imagen

Google

2022-03-23

https://arxiv.org/abs/2205.11487

https://imagen.research.google/

CodeGen-Multi

Salesforce

2022-03-25

350M - 16B

2048

https://arxiv.org/abs/2203.13474v1

code

https://huggingface.co/Salesforce/codegen-350M-multi

Chinchilla

Google DeepMind

2022-03-29

70B

1.4T tokens

https://arxiv.org/abs/2203.15556

N/A

https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training

N/A

no

T5X

Google

2022-03-31

https://arxiv.org/abs/2203.17189

https://github.com/google-research/t5x

transformer

PaLM (Pathways Language Model)

Google

2022-04-04

8B, 62B, 540B

780B tokens

https://arxiv.org/abs/2204.02311

https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html

N/A

N/A

transformer

no

GPT-NeoX

EleutherAI

2022-04-14

20B

825GB

https://arxiv.org/abs/2204.06745

https://github.com/EleutherAI/gpt-neox

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/EleutherAI/gpt-neox-20b

no

no

Tk-Instruct

Allen Institute for AI

2022-04-16

T5

3B, 11B

4h

https://arxiv.org/abs/2204.07705

https://github.com/yizhongw/Tk-Instruct

Apache 2.0

https://huggingface.co/allenai/tk-instruct-11b-def

yes

Flamingo

Google DeepMind

2022-04-29

https://arxiv.org/abs/2204.14198

https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model

N/A

no

OPT

Meta AI

2022-05-03

from scratch

125M - 175B

180B tokens

https://arxiv.org/abs/2205.01068

MIT

NC research

transformer

https://huggingface.co/facebook/opt-30b

no

no

UL2

Google Brain

2022-05-10

20B

1T tokens

https://arxiv.org/abs/2205.05131

Apache 2.0

Apache 2.0

transformer

yes

no

no

LaMDA 2

Google

2022-05-11

N/A

transformer

no

YaLM

Yandex

2022-06-22

from scratch

100B

N/A

https://github.com/yandex/YaLM-100B

transformer

yes

BLOOM

BigScience

2022-07-06

from scratch

up to 176B

366B tokens

105d

https://arxiv.org/abs/2211.05100

bigscience-bloom-rail-1.0

bigscience-bloom-rail-1.0

transformer

https://huggingface.co/bigscience/bloom

no

no

NLLB-200 (No Language Left Behind)

Meta AI

2022-07-06

from scratch

55B

https://about.fb.com/news/2022/07/new-meta-ai-model-translates-200-languages-making-technology-more-accessible/

translator

translate between 200 languages

Midjourney

Midjourney Inc.

2022-07-12

from scratch

N/A

N/A

https://www.midjourney.com

N/A

diffuser

no

Exposed as Discord bot

DALL-E 2

OpenAI

2022-07-20

GPT-3

https://cdn.openai.com/papers/dall-e-2.pdf

private

private

diffuser

no

AlexaTM

Amazon

2022-08-02

from scratch

20B

1.3T tokens

120d

https://arxiv.org/abs/2208.01448

https://github.com/amazon-science/alexa-teacher-models

transformer

via SageMaker

no

no

Stable Diffusion

Stability AI

2022-08-10

from scratch

890M

https://arxiv.org/abs/2112.10752

https://github.com/CompVis/stable-diffusion

https://stability.ai/blog/stable-diffusion-announcement

diffuser

yes

See also https://stablediffusionweb.com/

DreamBooth

Google

2022-08-25

https://arxiv.org/abs/2208.12242

https://github.com/google/dreambooth

https://dreambooth.github.io/

N/A

no

CodeGeeX

Tsinghua University

2022-09-19

from scratch

13B

850B tokens

60d

https://arxiv.org/abs/2303.17568

https://github.com/THUDM/CodeGeeX

https://models.aminer.cn/codegeex/blog/

Apache 2.0

CodeGeeX License

code

on request

N/A

N/A

WeLM

WeChat

2022-09-21

from scratch

10B

300B tokens

24d

https://arxiv.org/abs/2209.10372

https://welm.weixin.qq.com/docs/api/

yes

no

no

Chinese language

Sparrow

Google DeepMind

2022-09-22

from scratch

70B

https://arxiv.org/abs/2209.14375

https://www.deepmind.com/blog/building-safer-dialogue-agents

N/A

no

no

yes

GLM-130B

Tsinghua University

2022-10-05

from scratch

130B

400B tokens

60d

https://arxiv.org/abs/2210.02414

https://github.com/THUDM/GLM-130B

transformer

yes

Flan-T5

Google

2022-10-20

T5

60M - 11B

https://arxiv.org/abs/2210.11416

https://github.com/google-research/t5x

Apache 2.0

Apache 2.0

transformer

yes

yes

no

Flan-PaLM

Google

2022-10-20

PaLM

540B

37h

https://arxiv.org/abs/2210.11416

N/A

N/A

N/A

N/A

transformer

no

yes

no

U-PaLM

Google

2022-10-20

PaLM

8B, 62B, 540B

5d

https://arxiv.org/abs/2210.11399

N/A

N/A

transformer

no

no

no

BLOOMZ

BigScience

2022-11-03

BLOOM

176B

https://arxiv.org/abs/2211.01786

https://github.com/bigscience-workshop/xmtf

bigscience-bloom-rail-1.0

bigscience-bloom-rail-1.0

transformer

yes

yes

no

BLOOM + Multitask prompted finetuning (MTF)

mT0

BigScience

2022-11-03

mT5

300M - 13B

https://arxiv.org/abs/2211.01786

https://github.com/bigscience-workshop/xmtf

Apache 2.0

Apache 2.0

https://huggingface.co/bigscience/mt0-large

Google mT5 + Multitask prompted finetuning (MTF)

OpenJourney

PromptHero

2022-11-08

Stable Diffusion

N/A

diffuser

https://huggingface.co/prompthero/openjourney

Stable Diffusion finetuned to resemble MidJourney

Galactica

Meta AI

2022-11-16

from scratch

125M - 120B

106B tokens

https://arxiv.org/abs/2211.09085

cc-by-nc-4.0

transformer

https://huggingface.co/facebook/galactica-120b

Focussed on Science

Stable Diffusion v2

Stability AI

2022-11-24

from scratch

N/A

https://github.com/Stability-AI/stablediffusion

https://stability.ai/blog/stable-diffusion-v2-release

diffuser

yes

GPT-JT

TogetherComputer

2022-11-29

GPT-J

6B

N/A

https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered-by-open-source-ai

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/togethercomputer/GPT-JT-6B-v1

no

ChatGPT

OpenAI

2022-11-30

GPT 3.5

N/A

N/A

https://openai.com/blog/chatgpt

no

private

private

chatbot

no

yes

yes

OpenCLIP

various

2022-12-14

from scratch

https://arxiv.org/pdf/2212.07143.pdf

https://github.com/LAION-AI/scaling-laws-openclip

OPT-IML

Meta AI

2022-12-22

OPT

30B, 175B

https://arxiv.org/abs/2212.12017

MIT

NC research

transformer

yes

yes

no

Bard

Google

2023-02-06

LaMDA 2 or PaLM 2?

N/A

chatbot

no

LLaMA

Meta AI

2023-02-23

from scratch

7B, 13B, 30B, 65B

1.4T tokens

21d

https://arxiv.org/abs/2302.13971

https://github.com/facebookresearch/llama

https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

GPL 3.0

NC research

transformer

https://huggingface.co/decapoda-research/llama-65b-hf

no

no

Flan-UL2

Google Brain

2023-02-28

UL2

20B

Flan collection

https://arxiv.org/abs/2205.05131v3

https://github.com/google-research/google-research/tree/master/ul2

Apache 2.0

Apache 2.0

https://huggingface.co/google/flan-ul2

yes

no

Open-Assistant SFT-1

OpenAssistant

2023-03-09

Pythia 12B

12B

N/A

https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training

https://open-assistant.io/

Apache 2.0

transformer

https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

Jurassic-2

AI21 Labs

2023-03-09

?

N/A

N/A

https://www.ai21.com/blog/introducing-j2

N/A

N/A

no

Alpaca-LoRA

Eric J. Wang

2023-03-13

LLaMA

N/A

https://github.com/tloen/alpaca-lora

transformer

yes

Alpaca

Stanford

2023-03-13

LLaMA

7B

N/A

https://github.com/tatsu-lab/stanford_alpaca

https://crfm.stanford.edu/2023/03/13/alpaca.html

transformer

yes

h2oGPT

H2O.ai

2023-03-13

Pythia 12B, GPT-NeoX 20B

12B, 20B

N/A

https://github.com/h2oai/h2ogpt

https://gpt.h2o.ai/

Apache 2.0

transformer

https://huggingface.co/h2oai

ChatGLM

Tsinghua University

2023-03-14

GLM / GLM-130B?

6B

https://github.com/THUDM/ChatGLM-6B

https://chatglm.cn/blog

chatbot

GPT4

OpenAI

2023-03-14

from scratch

8x220B

https://arxiv.org/abs/2303.08774

private

private

transformer

no

yes

yes

Zero-1-to-3

Columbia University

2023-03-20

https://arxiv.org/abs/2303.11328

https://github.com/cvlab-columbia/zero123

https://zero123.cs.columbia.edu/

diffuser

yes

Dolly v1

Databricks

2023-03-24

GPT-J

6B

N/A

https://github.com/databrickslabs/dolly

https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html

cc-by-nc-4.0

chatbot

https://huggingface.co/databricks/dolly-v1-6b

GPT4All

Nomic AI

2023-03-28

LLaMA

7B

https://static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf

https://github.com/nomic-ai/gpt4all

yes

GPL 3.0

chatbot

https://huggingface.co/nomic-ai/gpt4all-lora

Finetuned LLaMA 7B based on GPT3.5 chats

Cerebras-GPT

Cerebras Systems

2023-03-28

from scratch

111M - 13B

https://arxiv.org/abs/2304.03208

https://github.com/Cerebras/modelzoo

https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/cerebras/Cerebras-GPT-13B

no

no

Reproduction of GPT 3 training process

LLaMA-Adapter

Shanghai AI Lab

2023-03-28

LLaMA

7B

https://arxiv.org/abs/2303.16199

https://github.com/ZrrSkywalker/LLaMA-Adapter

ColossalChat

Colossal AI

2023-03-29

LLaMA

https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat

https://chat.colossalai.org/

Apache 2.0

chatbot

Vicuna

LM-SYS

2023-03-30

LLaMA

7B, 13B

N/A

https://github.com/lm-sys/FastChat

https://vicuna.lmsys.org/

see LLaMA

transformer

yes

BloombergGPT

Bloomberg

2023-03-30

50B

https://arxiv.org/abs/2303.17564

https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/

transformer

RWKV-4 Raven

BlinkDL

2023-04-01

1.5B, 3B, 7B, 14B

https://arxiv.org/abs/2305.13048

https://github.com/BlinkDL/RWKV-LM

RNN

https://huggingface.co/BlinkDL/rwkv-4-raven

Pythia

EleutherAI

2023-04-03

70M - 12B

300B tokens

https://arxiv.org/abs/2304.01373

https://github.com/EleutherAI/pythia

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/EleutherAI/pythia-12b

no

no

Koala

UC Berkeley

2023-04-03

LLaMA

7B, 13B

N/A

https://github.com/young-geng/EasyLM#koala

https://bair.berkeley.edu/blog/2023/04/03/koala

transformer

https://huggingface.co/young-geng/koala/tree/main

Baize

Baize Project

2023-04-03

LLaMA

7B, 13B, 30B

https://arxiv.org/abs/2304.01196

https://github.com/project-baize/baize-chatbot

transformer

https://huggingface.co/project-baize/baize-lora-7B

Finetuned LLaMA with LoRA

SAM

Meta AI

2023-04-05

https://arxiv.org/abs/2304.02643

https://github.com/facebookresearch/segment-anything

https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/

yes

vision

Bark

Suno

2023-04-09

80M

N/A

https://github.com/suno-ai/bark

cc-by-nc-4.0

voice

yes

Dolly v2

Databricks

2023-04-12

Pythia

3B, 7B, 12B

N/A

https://github.com/databrickslabs/dolly

Apache 2.0

MIT

chatbot

https://huggingface.co/databricks/dolly-v2-12b

yes

no

CodeWhisperer

Amazon

2023-04-13

N/A

N/A

N/A

https://aws.amazon.com/blogs/aws/amazon-codewhisperer-free-for-individual-use-is-now-generally-available/

N/A

code

no

Self-hosted Copilot clone

GPT4All-J

Nomic AI

2023-04-14

GPT-J

6.7B

https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

https://github.com/nomic-ai/gpt4all

yes

Apache 2.0

Apache 2.0

transformer

https://huggingface.co/nomic-ai/gpt4all-j

yes

no

DINOv2

Meta AI

2023-04-14

from scratch

21M - 1.1B

https://arxiv.org/abs/2304.07193

https://github.com/facebookresearch/dinov2

https://ai.facebook.com/blog/dino-v2-computer-vision-self-supervised-learning/

vision

yes

VideoLDM

Nvidia

2023-04-18

Stable Diffusion

https://arxiv.org/abs/2304.08818

N/A

https://research.nvidia.com/labs/toronto-ai/VideoLDM/

StableLM

Stability AI

2023-04-19

from scratch

3B, 7B, (15B, 65B, 175B)

N/A

https://github.com/stability-AI/stableLM/

https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models

cc-by-nc-4.0

transformer

yes

Open-Assistant SFT-6

OpenAssistant

2023-04-22

LLaMA

30B

https://arxiv.org/abs/2304.07327

see LLaMA

transformer

https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor

WizardLM

Microsoft

2023-04-24

LLaMA

7B

https://arxiv.org/abs/2304.12244

https://github.com/nlpxucan/WizardLM

transformer

yes

DeepFloyd IF

Stability AI

2023-04-28

N/A

https://github.com/deep-floyd/IF

https://stability.ai/blog/deepfloyd-if-text-to-image-model

StableVicuna

Stability AI

2023-04-28

Vicuna 13B

13B

N/A

https://github.com/Stability-AI/StableLM

https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot

cc-by-nc-4.0

transformer

https://huggingface.co/CarperAI/stable-vicuna-13b-delta

Vicuna 13B + RLHF

FastChat-T5

LM-SYS

2023-04-28

Flan-T5-XL

3B

N/A

https://github.com/lm-sys/FastChat#fastchat-t5

Apache 2.0

transformer

https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

LLaMA-Adapter V2

Shanghai AI Lab

2023-04-28

LLaMA

https://arxiv.org/abs/2304.15010

https://github.com/ZrrSkywalker/LLaMA-Adapter

transformer

Replit Code

Replit

2023-05-02

from scratch

2.7B

N/A

https://github.com/replit/ReplitLM

https://replit.com/site/ghostwriter

cc-by-sa-4.0

code

https://huggingface.co/replit/replit-code-v1-3b

OpenLLaMA

OpenLM Research

2023-05-02

from scratch

7B

https://github.com/openlm-research/open_llama

RedPajama

Apache 2.0

transformer

https://huggingface.co/openlm-research/open_llama_7b_preview_300bt

Apache 2.0 LLaMA clone based on RedPajama data

Shap-E

OpenAI

2023-05-03

from scratch

300M

https://arxiv.org/pdf/2305.02463.pdf

https://github.com/openai/shap-e

MIT

diffuser

https://github.com/openai/shap-e/blob/main/shap_e/models/download.py

3D image generation

StarCoder

BigCode

2023-05-04

15B

1T tokens + 35B python tokens

8k

https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

https://github.com/bigcode-project/starcoder

https://huggingface.co/blog/starcoder

BigCode OpenRAIL-M v1

code

https://huggingface.co/bigcode/starcoder

RedPajama

TogetherComputer

2023-05-05

from scratch

3B, 7B

N/A

https://github.com/togethercomputer/RedPajama-Data

https://www.together.xyz/blog/redpajama-models-v1

Apache 2.0

https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1

Open reproduction of LLaMA

MPT-7B (MosaicML Pretrained Transformer)

MosaicML

2023-05-05

from scratch

7B

N/A

https://github.com/mosaicml/llm-foundry

https://www.mosaicml.com/blog/mpt-7b

Apache 2.0

transformer

https://huggingface.co/mosaicml/mpt-7b-instruct

Open reproduction of LLaMA

MPT-30B (MosaicML Pretrained Transformer)

MosaicML

2023-06-22

from scratch

30B

N/A

https://github.com/mosaicml/llm-foundry

https://www.mosaicml.com/blog/mpt-30b

Apache 2.0

transformer

https://huggingface.co/mosaicml/mpt-30b-instruct

Open reproduction of LLaMA

PanGu-sigma

Huawei

AnthropicLM

Anthropic AI

N/A

no

Lit-LLaMA

LLaMA

7B, 13B, 30B, 65B

Apache 2.0

NC research

optional with Alcapa

no

ImageBind

Meta AI

2023-05-09

from scratch

https://arxiv.org/abs/2305.05665

https://github.com/facebookresearch/ImageBind

https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/

cc-by-nc-4.0

cc-by-nc-4.0

transformer

https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth

six different modalities: images, text, audio, depth, thermal, and IMU

Open-LLaMA V2

s-JoL

2023-05-11

from scratch

N/A

https://github.com/s-JoL/Open-Llama

MIT

MIT

transformer

https://huggingface.co/s-JoL/Open-Llama-V2

yes

yes

PaLM 2

Google

2023-05-10

from scratch

https://ai.google/static/documents/palm2techreport.pdf

N/A

https://ai.google/discover/palm2

transformer

no

AWS certification

Last month, I decided to get some IT certifications for the first time. I have always been sceptical of such certifications, but a few colleagues managed to convince me that some are worth it. The ones from AWS for example are reasonably priced and the exams require more than just rote memorization.

This blog post from A Cloud Guru helped me decide where to start. The following image in particular was very helpful:

AGC certification guide

Next, I followed their training on PluralSight, and one month later I am happy to report that I am now triple-certified!

AWS Certified Cloud Practitioner badge AWS Certified Solutions Architect badge AWS Certified Developer badge

Over the summer, I might give the Machine Learning Specialty a try as well.

Update 2024-03: it took a bit longer than anticipated, but I finally obtained the AWS MLS certificate. The example questions you can find online are quite hard and the answers not always as clear-cut as you would want them to be. This held me back from quickly doing the exam. However, in the end, the questions on the real exam were fairly straightforward. Anyone with a few years of machine learning experience should be able to handle them.

The confusing world of OpenAI pricing

As we all know by now, a free version of ChatGPT exists with unpredictable levels of availability. This free version is based on a model called GPT-3.5. If you want higher availability or if you want to be able to switch to the newer GPT-4 model, you need a ChatGPT Plus subscription. That will cost you $20 per month (excl. tax). So far so good.

Confusingly, this subscription will not help you when you want to use the OpenAI API to access GPT models. That requires a separate subscription with a different pricing model. Instead of a fixed price per month, you pay per 1000 tokens as shown in the table below.

model 1K prompt tokens 1K completion tokens context size
gpt-3.5-turbo $0.002 $0.002 4,096 tokens
gpt-4 $0.030 $0.060 8,192 tokens
gpt-4-32k $0.060 $0.120 32,768 tokens

On average, a token corresponds to roughly 4 characters of text. Check out this interactive tool to see exactly how text is parsed into tokens. To give you an idea of how to interpret the context size, the current version of the ChatGPT wikipedia page up to and including the "See also" section contains a little under 8000 tokens. That is about 12.5 pages. That means the 32000 tokens of gpt-4-32k correspond to about 50 pages.

When I wanted to try the API, I installed the openai python package in a conda environment and created the following code snippet.

conda create -n openai python=3 openai
conda activate openai
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

# print([model["id"] for model in openai.Model.list()["data"]])
# model = "gpt-3.5-turbo"
model = "gpt-4"
# model = "gpt-4-32k"
response = openai.ChatCompletion.create(model=model, messages=[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "ELI5: quantum computing"}
])
# print(response)
print(response["choices"][0]["message"]["content"])

To my surprise, the requested model could not be found. Turns out there is a GPT-4 API waiting list that you have to sign up for, even if you are already a ChatGPT Plus subscriber and as such have access to GPT-4 via the chat interface.

In conclusion: this code snippet will unfortunately only work when you subscribe to the API and got an invite after signing up for the waiting list. You could switch to gpt-3.5-turbo while waiting for the invitation. For light to medium usage, that might be a cheaper and more reliable way to access a GPT assistant than spending $20 per month on ChatGPT Plus.

Update 2023-07-06: the GPT-4 API is now generally available tyo all paying customers without having to join a waitlist.

References