NVIDIA AI has just released cuda oxide: an experimental Rust-to-CUDA compiler backend that compiles SIMT GPU cores directly to the PTX


Step 01 of 09 · Basic requirements

What you need before you start

cuda-Oxide has specific version requirements for each dependency. Before installing anything, make sure your system meets all of these elements. Project currently Linux only (Tested on Ubuntu 24.04).

Linux (Ubuntu 24.04)
Rust at night
CUDA Toolkit 12.x+
LLVM 21+
Clang 21/libclang-common-21-dev
gate

ⓘ Why LLVM 21?
Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell – TMA, tcgen05, WGMMA – requires llc From LLVM 21 or later. This is a difficult requirement, not a recommendation.

Check your current CUDA version to confirm compatibility:

nvcc --version

Step 02 of 09 · Install Rust Nightly

Set up the Rust Nightly toolchain

Coda oxide requires rust At night With two additional components: rust-src and rustc-dev. The toolchain is installed on nightly-2026-04-03 via rust-toolchain.toml In the repository – It will be automatically installed when you run the build for the first time inside the repo.

If you need to install it manually:

 Install the pinned nightly toolchain
rustup toolchain install nightly-2026-04-03

 Add required components
rustup component add rust-src rustc-dev \
  --toolchain nightly-2026-04-03

 Confirm the toolchain is active
rustup show

ⓘ Why these ingredients?
rustc-dev Exposes the internal compiler APIs that the custom codec backend links to. rust-src This is needed so that the compiler can find its standard library sources and compile them for the machine target.

Step 03 of 09 · Install LLVM 21

Install LLVM 21 with NVPTX Backend

The coda oxide pipeline emits the LLVM IR script (.ll files) and delivered to external llc Binary for PTX production. You need LLVM 21 or later with the NVPTX backend enabled.

 Ubuntu/Debian
sudo apt install llvm-21

 Verify the NVPTX backend is present
llc-21 --version | grep nvptx

Automatically detects pipeline llc-22 and llc-21 On your PATH In this order. To install a specific binary, set the environment variable:

 Pin to a specific llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Common failure
If NVPTX does not appear in the output llc-21 --versionyour LLVM architecture is compiled without an NVPTX target. Install from the official LLVM apt repository instead of the distribution’s default packages, which may omit the GPU backends.

Step 04 of 09 · Install Clang

Install Clang 21 for cuda link box

the cuda-bindings Fund uses bindgen To create FFI links to cuda.h At the time of construction. bindgen needs libclang Specifically, it needs Clang’s resource guide (which includes stddef.h). Naked libclang1-* The runtime package is Not enough.

 Install the full clang-21 package (includes resource headers)
sudo apt install clang-21

 Alternatively, the -dev header package also works
sudo apt install libclang-common-21-dev

⚠ Symptoms of reindeer loss
If you only install the runtime and not the headers, building the host with an encrypted file will fail 'stddef.h' file not found Error during bindgen. Being cargo oxide doctor The next step is to catch this before attempting to build.

Step 05 of 09 · Charge oxide installation

Ribo cloning and charge oxide stabilization

cargo-oxide It is a shipping subcommand that runs the entire build pipeline cargo oxide build, cargo oxide run, cargo oxide debugand cargo oxide pipeline.

Inside the repo (to try examples):

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

 cargo oxide works out of the box via a workspace alias
cargo oxide run vecadd

Outside the repo (For your own projects):

 Install globally from the git source
cargo install \
  --git https://github.com/NVlabs/cuda-oxide.git \
  cargo-oxide

 On first run, cargo-oxide fetches and builds the codegen backend

Then check that all prerequisites are met using the built-in health check:

cargo oxide doctor

ⓘ What the doctor examines
It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA toolkit, LLVM version, NVPTX support, Clang/libclang headers, and codegen backend binary. Fix any red items before continuing.

Step 06 of 09 · Run the first kernel

Build and run vecadd example

The first canonical example is vecadd – A vector addition kernel that adds two arrays of 1024 f32 values ​​on the GPU and check the result on the host.

 Build and run end-to-end
cargo oxide run vecadd

If everything is configured correctly, you will see:

✓ SUCCESS: All 1024 elements correct!

To see the full translation path – from Rust MIR through every Pliron dialect all the way to PTX – run:

 Print the full Rust MIR - dialect-mir - mem2reg - dialect-llvm - LLVM IR - PTX trace
cargo oxide pipeline vecadd

To correct errors with cuda-gdb:

cargo oxide debug vecadd --tui

ⓘ Output artifacts
A successful build results in two files: target/debug/vecadd (binary host) and target/debug/vecadd.ptx (device code). The host binary loads the PTX file via the CUDA driver at runtime.

Step 07 of 09 · Write the kernel

Write your own number[kernel] job

The kernel function is explained using [kernel]. is used DisjointSlice For mutable outputs and andamp;[T] For read-only input. Access the thread’s unique device index using thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 safety: race-free by construction, no `unsafe` needed.
// DisjointSlice::get_mut() only accepts a ThreadIndex -
// a hardware-derived opaque type guaranteeing unique writes per thread.
[kernel]
pub fn scale(input: andamp;[f32], factor: f32, mut out: DisjointSliceandlt;f32andgt;) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = input[idx.get()] * factor;
    }
}

ⓘ Level 1 Safety – How it works
ThreadIndex It’s an obscure new genre around usize which can only be generated from registers built into the devices (threadIdx, blockIdx, blockDim). Since each thread gets a unique value, and DisjointSlice::get_mut() Only A is accepted ThreadIndexSweat-free writing by construction – no unsafe anywhere in the kernel.

Step 08 of 09 · Launching from the host

Launching the kernel from host code

The host and machine code live at the same time .rs file. The host side is used CudaContext, DeviceBufferand cuda_launch! Macro for GPU memory management and transmission.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn main() {
    // Initialize GPU context on device 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(andamp;ctx, "scale_example").unwrap();

    // Upload input data to GPU memory
    let data: Vecandlt;f32andgt; = (0..1024).map(|i| i as f32).collect();
    let input  = DeviceBuffer::from_host(andamp;stream, andamp;data).unwrap();
    let mut output = DeviceBuffer::andlt;f32andgt;::zeroed(andamp;stream, 1024).unwrap();

    // Dispatch the kernel - LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Download result back to host
    let result = output.to_host_vec(andamp;stream).unwrap();
    assert!((result[1] - 2.5).abs() andlt; 1e-5);
    println!("✓ Kernel ran successfully!");
}

ⓘ What cuda_launch! He does
It scales a list of arguments-slice flatness, scalars, and captured occlusions-to PTX kernel parameters and sends the kernel on the specified flow. You do not have to manually organize the argument.

Step 09 of 09 · Next steps

What to explore next?

You have a working coda oxide setting. Here are the high-value forward paths, ranked by complexity:

  • General beads with uniformity – Try generic Example (cargo oxide run generic) Let’s see how fn scale Combine to separate PTX beads for each type.
  • Closing with snaps – the host_closure The example shows how a move |x: f32| x * factor The closures are escalated and passed as PTX kernel parameters automatically.
  • Asynchronous GPU executioncuda_launch_async! Lazy returns DeviceOperation Which is executed on .sync() or .await. Look async_mlp and async_vecadd Examples.
  • Shared memory and warp essence – These require scope unsafe Blocks with documented safety contracts. See Level 2 in the security model documentation.
  • GEMM at the speed of light – the gemm_sol Example achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) using cta_group::2The CLC and 4-stage pipeline.
  • Blackwell tensor cores – the tcgen05 The example targets sm_100a with TMEM, MMA, and… cta_group::2. Requires LLVM 21+.

ⓘ Known limitations in version 0.1.0
index_2d(stride) It is currently documented as unhealthy – if threads in the same kernel use different step values, two threads can get stuck andamp;mut T To the same item with no unsafe vision. Until the fix lands (raising the step to the type parameter), bind the step to a single parameter let Link them and reuse them in every communication site.

Full documentation: nvlabs.github.io/cuda-oxyde · source: github.com/NVlabs/cuda-oxyde

Leave a Reply