Adventures with the AMD Radeon AI Pro R9700

Adventures with AMD Radeon AI Pro R9700 (gxf1201)

Your Primer

Picking AMD and ROCm for AI is choosing hard mode. I guess you could pick a "harder" mode by using an Intel GPU or a less popular AI framework like Tiny Grad, but if you want the DIY experience of fussing with source builds and missing kernels, then AMD with ROCm will do just fine.

What is ROCm you say? Think of it as AMD's swing at competing with Nvidia CUDA, but with rougher edges. Like it's CUDA competitor, ROCm is the umbrella term for all of the libraries, tools, and abstractions used to program compute operations on these cards. Now, let's complicate things a little more and have two different micro architectures and instruction sets. In AMD land there is the CDNA family, which you will find in the datacenter accelerators like the MI300, and then there is the RDNA family, which you will find in your edgy PC gamer desktop builds. Since the AI boom is primarily happening in datacenters, CDNA cards will have better support for pretty much everything in the ROCm stack.

A Tale of Two Cards

With that out of the way let's talk about my experience. On 10/25/2025 AMD launched the Radeon AI Pro R9700, a new AI workstation card marketed as "AI ready!". Let's see some specs!

Radeon AI Pro R9700:

Specification Value
GPU Architecture AMD RDNA™ 4
Ray Accelerators 64
AI Accelerators 128
Stream Processors 4096
Compute Units 64
Boost Frequency Up to 2920 MHz
Game Frequency 2350 MHz
Peak Pixel Fill-Rate Up to 373.76 GP/s
Peak Single Precision (FP32 Vector) Performance 47.8 TFLOPs
Peak Half Precision (FP16 Vector) Performance 47.8 TFLOPs
Peak Half Precision (FP16 Matrix) Performance 191 TFLOPs
Peak Half Precision (FP16 Matrix) Performance with Structured Sparsity 383 TFLOPs
Peak 8-bit Precision (FP8 Matrix) Performance (E5M2, E4M3) 383 TFLOPs
Peak 8-bit Precision (FP8 Matrix) Performance with Structured Sparsity 766 TFLOPs
Peak 8-bit Precision (INT8 Matrix) Performance 383 TOPs
Peak 8-bit Precision (INT8 Matrix) Performance with Structured Sparsity 766 TOPs
Peak 4-bit Precision (INT4 Matrix) Performance 766 TOPs
Peak 4-bit Precision (INT4 Matrix) Performance with Structured Sparsity 1531 TOPs
ROPs 128
Transistor Count 53.9 Billion
OS Support Windows 10 (64-Bit), Windows 11 (64-Bit), Linux x86 64-Bit
Wattage 300W

As for memory it has:

Specification Value
Dedicated Memory Size 32 GB
Dedicated Memory Type GDDR6
AMD Infinity Cache Technology 64 MB
Memory Interface 256-bit
Peak Memory Bandwidth 640 GB/s
Memory ECC Support Yes (Linux Only)

The price tag starts at $1300 and unlike the a 5090 these aren't made of Unobtanium. Did you notice it's an RDNA4 card? That's right, this is basically a Radeon RX 9070 XT with 32GB of ram and a blower fan. Roughly, I like to describe this as 60% of a 5090 at 40% the cost. Arguably though, we should be comparing this to the NVIDIA RTX PRO 4500 Blackwell workstation card.

NVIDIA RTX PRO 4500 Blackwell:

Specification Value
GPU Name GB203
Architecture Blackwell 2.0
Process Size 5 nm
Transistors 45,600 million
Release Date Mar 18th, 2025
Base Clock 1635 MHz
Boost Clock 2407 MHz
Memory Size 32 GB
Memory Type GDDR7
Bandwidth 896.0 GB/s
Tensor Cores 328
FP16 (half) 50.53 TFLOPS (1:1)
FP32 (float) 50.53 TFLOPS
Wattage 200W

This card comes in at $2,759.99 on Newegg as of 02/27/2026. Unfortunately, I could not find INT4, INT8, and FP8 metrics anywhere. At FP32/FP16 though the AMD Radeon Pro R9700 and Nvidia RTX Pro 4500 are in the same ball park. The memory bandwidth of the NVIDIA RTX PRO 4500 is 40% faster (GDDR7 vs GDDR6) and on a per watt basis it is also 50% more efficient at compute. That price though. You can get two AMD Radeon Pro R9700 cards for the price of one NVIDIA RTX PRO 4500, and that's just what I did.

The Build

I was fortunate enough to start buying my PC parts at the start of the transistor apocalypse. I did pay a bit of a premium for my ram but not nearly what ya'll are getting charged today.

Part Value
CPU AMD Ryzen 9 9900X3D
RAM 128 GB 5600 Mhz
Motherboard ASUS ProArt X870E-CREATOR
Hard drive Samsung SSD 9100 PRO 2TB
GPU AMD Radeon AI Pro R9700 (2x)
Power Supply Thermaltake 1650W
Case Fractal Design North XL
Cooler ASUS ProArt LC 360 AIO

Gimme Those Tokens!

Ok, initially I only had one GPU so we have to walk down this road a little before we get to the dual wielding beast this machine became. I have the unfounded belief that Mistralai Devstral Small 2 24B is going to be the model for local coding agents. Let's get into some of the details of this one before we come back to talking inference hosting.

The Devstral Small 2 is a 24 billion parameter dense model in FP8. At inference time, all 24B parameters are activated for each input. This is in contrast to a mixture of experts(MOE) like the Qwen3 line of agentic coding models. In an MOE only a small subset of the parameters is active, and a router identifies which expert to activate. There is more nuance than just raw active parameter count though, so let's talk about hidden dimensions.

In an LLM the hidden dimension size defines the size of the vectors the model uses. We can think of this as the "resolution" of tokens. A higher hidden dimensionality, means it can potentially capture more nuance in the relationships between tokens. I say potentially, because it still comes down to training. There is a tax to high dimensionality though, and you pay for it in VRAM. You can roughly guesstimate the KV cache with the following equation: 2×Layers×Hidden Dimension=size. So double the hidden dimensions and we double the cost per token in the KV cache, which in turn determines how large of a context window we can operate with, given the finite resources of our machine.

The Radeon Pro R9700 has 30 GiB of VRAM, if you have ECC enabled, which I do because I'm not a savage. At FP8 for most of the layers it takes roughly 28.8 GiB of VRAM on vLLM. Effectively this leaves me with no usable context window.

The Side Questing Begins

Side Quest 1: Quantization

I prefer vLLM, but that's just because I have used it frequently for work. Looking on Huggingface I found a 4-bit weight https://huggingface.co/cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit version which could halve the memory requirements. Firing it up in vLLM using the ROCm Nightly build, I found the CounchLinear kernel is the only one that would work, but not with a group size of 32 this quant uses. Of course I'd do what any reasonable person would do at that point. I stalked the developer of this quantized version across multiple social networks and eventually convinced him to share his LLM Compressor scripts and methodology. A couple hours on a rented H100, and I had my own shiny new quantized version with a 128 group size. Round two and I get a 10 tps single user experience. Not usable. Really, this is a problem with ROCm and AITER. AMD just doesn't appear to have kernels there for W4A16 models.

Side Quest 2: Pivot to Llama.cpp

It's at this point you may be picking up on a particular character flaw. I love me a good underdog. You tell me turn right, I go left. Nvidia cards are the best, I buy AMD. I probably should have just used llama.cpp directly or LM Studio, but for this side quest I chose Ollama instead.

Interesting thing about Ollama, they use ROCm 6.x and the gfx1201 (my card) isn't really supported until ROCm 7.2. Don't get me wrong, you can lie and say its an older RDNA3 card and it may work, but that's not for me. It was at this time I signed up for the AMD AI Developer Program and got access to the Discord. With the help of #resolver0 and AMD_LOG_LEVEL=3 OLLAMA_DEBUG=2, I was able to build an Ollama 0.15.4 fork with ROCm 7.2. I also have a docker image if you'd like to try it out for yourself.

docker run -d --name ollama-rocm7p2 \
    --device /dev/kfd \
    --device /dev/dri \
    --group-add video \
    -e HSA_OVERRIDE_GFX_VERSION=12.0.1 \
    -p 11434:11434 \
    --security-opt seccomp=unconfined \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ./:/model \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    androiddrew/ollama:0.15.4-rocm-7.2

I don't really have any complaints about Ollama. It works. Using the unsloth Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf took 14.4 GiB of VRAM, leaving me with 15.6 GiB for context. I was getting roughly 33 TPS at decode for a single user experience, but that context window is less than 40K. Since I want to use Opencode and crank it up to 120K context window I really am only left with two options. Use this new Ollama build to offload layers on to the CPU so I can get more context, or acquire more VRAM.

Side Quest 3: Money Solves Problems

Well, I was fortunate enough that by this time it was my birthday and the wife did not begrudge me spending an additional $1300. Turns out though I needed a larger power supply too, but I didn't share that. Now that I am equipped with a dual-wielding beast with 62 GiB of usable VRAM, it was time to re-evaluate my inference choices. Llama.cpp is pretty damn decent, but it lacks true tensor parallelism. With vLLM I can split the model weights across both GPUs, allowing me to load models I couldn't on a single GPU and leverage the full available VRAM for my purposes.

Remember that 10 TPS I saw with my 4-bit quantized version? Well at the standard FP8 I actually get a useable kernel from the ROCm stack. With 28.8 GiB of model weights, that leaves me with 33.2 GiB for context window. At standard FP16 that's 217,000 tokens. There is enough VRAM there now for two concurrent sessions with max-context-len 100,000+. Unfortunately, I couldn't get KV cache at FP8 to work, but no need to be greedy. I have enough compute for me and a friend.

So what am I seeing for performance? Well a quick benchmark showed:

Throughput: 3.03 requests/s, 1369.18 total tokens/s, 659.90 output tokens/s
Total num prompt tokens:  23401
Total num output tokens:  21772

Single user sees about low 30's for TPS during decode phase.

(APIServer pid=2725) INFO 02-24 20:57:23 [loggers.py:257] Engine 000: Avg prompt throughput: 6.3 tokens/s, Avg generation throughput: 29.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=2725) INFO 02-24 20:57:33 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%

Not, bad. Certainly usable, and definitely can have multiple agents running at the same time. The thing to remember is performance is just going to keep getting better. Both Ollama and vLLM right now lack the CK Attention that AMD recommends for these cards, and none of the kernels that ship in rocm-libs are tuned yet for the gxf1201. This performance is purely using the fallback kernels. If you believe their stats, we could see a 15%-25% bump in performance for certain model shapes.

Side Quest 4: Opencode and vLLM

I did not know what it would take to get Opencode and vLLM to play nice with each other. I will keep this brief for you, but this did take a couple of hours trying to figure out what the hell I was doing. The key things needed were the following settings in vllm and the opencode config.

  • --tool-call-parser mistral
  • --tokenizer_mode mistral
  • --enable-auto-tool-choice
  • --chat-template-content-format string

Full command used:

export MODEL_PATH='mistralai/Devstral-Small-2-24B-Instruct-2512'
vllm serve $MODEL_PATH \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 150000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 1 \
    --dtype auto \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --chat-template-content-format string \
    --tokenizer_mode mistral \
    --trust-remote-code \
    --skip-mm-profiling \
    --limit-mm-per-prompt '{"image": 2}'
      rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0

Full Opencode json config:

  "$schema": "https://opencode.ai/config.json",

  "tools": {
    "bash": true,
    "edit": true,
    "write": true,
    "read": true,
    "grep": true,
    "glob": true,
    "list": true,
    "lsp": true,
    "patch": true,
    "skill": true,
    "todowrite": true,
    "todoread": true,
    "webfetch": true
   },

  "provider": {
    "warpspace-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "vLLM (Warpspace)",
      "options": {
        "baseURL": "http://10.0.1.141:8000/v1",
          "apiKey": "shouldnotmatteri",
          "supportsStreaming": false,
      },
      "models": {
        "mistralai/Devstral-Small-2-24B-Instruct-2512": {
          "name": "mistralai/Devstral-Small-2-24B-Instruct-2512",
            "capabilities": {
            "tools": true
          }
        }
      }
    }
  }
}

Let’s Put a Bow on This Llama

Thank you for sticking it out until the end. I stand by my statement that choosing AMD is choosing hard mode for AI. If you want something that "just works" and you have the money, just buy Nvidia or a Mac M4 Studio. For me though, this friction is what's necessary for me to learn. I'm not an expert when it comes to AI, even though I pretend to be when at work most days, but these past few weeks have demystified gen-ai for me. If you have any questions you either already know me, or you can @androiddrew on the AMD Developer Community Discord.

tags: amd