AI Storage Reference Architecture v2

Ⅰ

Kernel-Free Data Path

Zero kernel involvement from NVMe to GPU HBM. SPDK + DPDK + cuFile. Every syscall is latency left on the table.

Ⅱ

Blob-Native Semantics

No POSIX. No inodes. No directory tree. Blobs are the primitive. Content-addressed by token hash. POSIX is a compatibility shim, never the hot path.

Ⅲ

GPUDirect Everywhere

NVMe → GPU HBM via RDMA DMA. CPU DRAM is never in the tensor data path. BlueField-3 DPU as RDMA peer on every GPU node.

Ⅳ

Commodity at Scale

Standard x86 + PCIe 5.0 NVMe + ConnectX-7 NICs. No proprietary ASICs. Any node replaceable in minutes. Hardware is fungible; software defines all performance.

Ⅴ

Kubernetes-Native Control

Every storage resource is a Kubernetes object. CSI, operators, and scheduler extensions — not bolted on but co-designed from day one.

AI Framework Layer

NIXL + cuFile

NIXL — NVIDIA's AI storage abstraction. Native vLLM, PyTorch, HuggingFace integration. Blob API surface exposed directly to inference engines.
cuFile / GDS — kernel-bypass path from NVMe-oF into GPU HBM via DMA. No CPU memcopy.

NVIDIA Open

KV Cache Manager

mooncake / custom

Disaggregated KV cache — prefill tokens stored as content-addressed blobs, shared across inference pods. Eliminates redundant prefill compute for repeated prompt prefixes.

Apache 2.0

Blob Orchestration

Custom Blob Engine

DHT router — consistent-hash key → stor-node mapping. Rebalances on node add/remove.
Blob lifecycle — tiering policy (HBM → NVMe → cold object).
Stripe planner — EC stripe assignment, parity computation coordination.

Custom · Apache 2.0

Erasure Coding

ISA-L (Intel)

SIMD-accelerated erasure coding in user space. 4+2 or 8+2 configurable. AVX-512 GF multiply — orders of magnitude faster than kernel RAID-6. No kernel MD involvement.

BSD License

Storage Engine

SPDK Blobstore

SPDK NVMe driver — user-space PCIe 5.0 NVMe, poll-mode, zero kernel.
SPDK Blobstore — blob primitive directly on NVMe. No VFS, no page cache.
SPDK NVMe-oF target — serves blobs to GPU nodes via RDMA fabric.

Apache 2.0

Network Data Path

DPDK Poll-Mode

ConnectX-7 driven by DPDK — no kernel TCP/IP. Replication and metadata traffic bypasses the OS network stack entirely. Huge pages pre-allocated at boot.

Apache 2.0

Control Plane

Kubernetes Operator

Custom CSI driver — provisions blob volumes as PersistentVolumes. StorageClass maps to EC policy + tier.
Scheduler extension — topology-aware pod placement (GPU node ↔ nearest stor-nodes).

Custom · Apache 2.0

Observability

DCGM + Prometheus

DCGM exporter — GPU utilization, HBM bandwidth, NVLink saturation.
SPDK vhost metrics — NVMe queue depth, latency percentiles per blob.
Grafana with AI storage dashboards. Alert on GPU idle caused by storage stall.

NVIDIA Open

Kubernetes Control Plane

ctrl-node

×3 nodes — HA quorum

PlatformSupermicro AS-2125HS-TNR

CPU2× AMD EPYC 9654 (96c / 192t)

RAM512 GB DDR5-4800 ECC RDIMM

OS / etcd disk2× 3.84 TB NVMe PCIe 5.0 (RAID-1)

NIC — fabric2× ConnectX-7 HDR/100 GbE

NIC — mgmt1× 25 GbE OOB + IPMI

GPUNone

Runs: kube-apiserver, etcd (3-node Raft), CoreDNS, Cilium operator, Blob Operator controller, Prometheus, Grafana. Tainted — no user workloads.

GPU Compute Node

gpu-node

×8–64 nodes (scale unit: 8)

PlatformNVIDIA DGX H200 or HGX H200

CPU2× Intel Xeon Platinum 8592+ (64c)

RAM2 TB DDR5-5600 (8-channel / CPU)

GPU8× NVIDIA H200 SXM5 — 141 GB HBM3e each

GPU interconnectNVLink 4.0 — 900 GB/s all-reduce

Local NVMe8× 3.84 TB PCIe 5.0 (scratch/spill only)

NIC — GDS fabric4× ConnectX-7 NDR InfiniBand 400Gb

DPUBlueField-3 — GDS RDMA peer, RDMA write → HBM3e

Runs: vLLM, PyTorch DataLoader pods. GDS enabled via cuFile mounted into pod. All persistent tensors on stor-nodes over NDR IB. Local NVMe = scratch only — never persistent.

Storage Node (SPDK/DPDK)

stor-node

×6–48 nodes (scale unit: 6)

PlatformSupermicro SSG-211E-NE312R (2U, 32× E3.S)

CPU2× AMD EPYC 9654 (96c / 192t)

RAM1 TB DDR5 (SPDK huge-pages: 512 GB reserved)

NVMe32× Samsung PM9D3a 30.72 TB E3.S PCIe 5.0

Raw capacity / node~983 TB

NVMe BW / node~320 GB/s read, ~120 GB/s write

NIC — data fabric2× ConnectX-7 NDR IB 400Gb (RDMA)

NIC — replication2× ConnectX-7 200 GbE (DPDK)

CXL memory4× 128 GB CXL 2.0 DIMM (hot metadata tier)

Runs: SPDK NVMe-oF target daemon (user-space), DPDK replication agent, custom blob engine, ISA-L EC worker threads. Zero kernel involvement in data path. CXL DIMMs hold DHT routing table and hot blob metadata in memory-semantic tier.

Storage Gateway

gw-node

×2–4 nodes (active-active pair)

PlatformSupermicro AS-2125HS-TNR

CPU2× AMD EPYC 9554P (64c / 128t)

RAM512 GB DDR5-4800

NVMe (metadata cache)8× 3.84 TB PCIe 5.0

NIC — uplink4× ConnectX-7 400 GbE (client-facing)

NIC — fabric2× ConnectX-7 NDR IB 400Gb

Runs: Blob API endpoint (native blob + POSIX compatibility shim), S3-compatible adapter for ingestion pipelines, NVMe-oF initiator for clients that lack GDS. ECMP load-balanced. Active-active pair — no single point of failure.

① Data Plane — InfiniBand NDR 400Gb (RDMA)

Switch: NVIDIA Quantum-3 QM9700 — 64-port NDR 400Gb, 25.6 Tb/s bisection

Topology: 2-tier fat-tree (leaf + spine). Full bisection bandwidth. No oversubscription on this plane.

Protocol: InfiniBand native RDMA — lower latency than RoCEv2 (no ECN/PFC tuning required, no pause frame propagation risk)

GPUDirect: Storage node RDMA Write → GPU node BlueField-3 → HBM3e. Fully peer-to-peer, CPU not in path.

Latency: <1 µs fabric RTT; <200 µs end-to-end NVMe→GPU (target)

② Replication Plane — 200 GbE Ethernet (DPDK)

Switch: Arista 7060CX3-64D — 64× 100GbE, dedicated to stor-node replication only

Purpose: ISA-L parity writes between storage nodes. EC stripe replication via DPDK user-space transport (no kernel TCP/IP).

Isolation: Physically separate switch, separate NIC ports on stor-nodes. Replication traffic cannot compete with GDS reads.

Durability: Synchronous 3-node commit before ACK to writer. Cross-AZ async replication via 10 GbE WAN uplink from each gateway pair.

Target: <500 µs synchronous replication round-trip within fault domain

③ Management Plane — 25 GbE OOB + BMC

Switch: Dedicated OOB Arista 7020 — management traffic only, never touches data plane VLANs

Kubernetes traffic: Cilium eBPF CNI, Gateway API ingress, Wireguard inter-node encryption. No kube-proxy.

BMC: Redfish API on all nodes. Full IPMI OOB — power cycle, console, firmware update without touching OS.

BGP: eBGP between leaf and spine for pod IP advertisement. No flat L2 domain — L3 from the rack top.

Security: mTLS between all components. Cilium network policy enforces zero-trust between node roles.

Path A — Inference Read: NVMe → GPU HBM (GPUDirect Storage, CPU DRAM = 0 bytes) Target: <200 µs

Samsung PM9D3a
NVMe E3.S
stor-node

→

PCIe 5.0
DMA burst

SPDK
Blobstore
user-space

→

DPDK
zero-copy

ConnectX-7
NDR 400Gb
RDMA Write

→

InfiniBand
<1 µs RTT

Quantum-3
QM9700
fabric

→

InfiniBand
<1 µs RTT

BlueField-3
DPU
gpu-node GDS peer

→

cuFile
p2p DMA

H200
HBM3e
141 GB
GPU kernel ready

CPU DRAM is never in this path. The RDMA Write from ConnectX-7 on the stor-node goes directly to the GPU's BAR-mapped HBM3e address space, brokered by BlueField-3 on the GPU node. This is the core latency claim — every other architecture that touches CPU DRAM adds 3–5× latency.

Path B — Checkpoint Write: GPU HBM → Durable NVMe (3-replica, synchronous) Target: <500 µs to durable

H200 HBM3e
gpu-node

→

cuFile
async write

BlueField-3
DPU
RDMA initiator

→

NDR IB
400Gb

SPDK
NVMe-oF
Target
stor-node-1

→

blob_put
DHT key

ISA-L EC
encoder
4+2 stripe

→

DPDK
repl fabric

NVMe SSDs
×3 nodes
sync commit

→

ACK to
writer

✓ Durable
3-node EC

ISA-L EC runs in user space on the primary stor-node CPU — AVX-512 GF multiply completes a 4+2 stripe of a 1 MB blob in ~2 µs. Parity shards dispatched via DPDK over the dedicated replication fabric. ACK returned to writer only after all 3 nodes confirm NVMe write completion.

Path C — KV Cache Offload: vLLM PagedAttention → Content-Addressed Blob Store Cross-pod dedup via token hash

vLLM
PagedAttn
KV block evict

→

NIXL
blob API

Blob Orch.
token hash
→ blob key

→

DHT lookup
CXL-cached

DHT Router
key → stor-node

→

cache hit?
skip write

SPDK
Blobstore
stor-node

→

NVMe
write

NVMe SSD
persisted

Content-addressing is the KV cache superpower: if two inference pods computed identical prompt prefix KV blocks, the second pod finds the blob already present (same token hash → same key → cache hit). This eliminates redundant prefill compute and redundant NVMe writes cluster-wide — a capability no filesystem-based system can offer natively.

Path D — CXL Memory Tier: Hot Metadata & DHT Table (Sub-microsecond lookup) Target: <300 ns lookup

Blob
Orchestrator
lookup request

→

memory-
semantic

CXL 2.0
DIMM ×4
512 GB total
on stor-node

→

load/store
no protocol

DHT routing
table entry
→ NVMe addr

→

direct
dispatch

SPDK I/O
submission
queue pair

CXL 2.0 DIMMs on each storage node act as a memory-semantic metadata tier — faster than NVMe by 10–100×, cheaper than DRAM by 3×, and capacity-persistent across reboots. The DHT routing table and hot blob metadata index live here. This eliminates a NVMe lookup hop for every blob access — critical for sub-200 µs end-to-end targets.

Edge Inference Lab

Enterprise private AI · 16–32 GPUs

ctrl-node3×

gpu-node (8× H200)2–4×

stor-node (32× 30 TB NVMe)6×

gw-node2×

Raw NVMe~5.9 PB

Usable (EC 4+2)~3.9 PB

GPU HBM3e total16–32× 141 GB

Peak storage BW~1.9 TB/s

Storage : GPU ratio3:1

Mid-Scale AI Cloud

Neocloud inference service · 128–256 GPUs

ctrl-node3×

gpu-node (8× H200)16–32×

stor-node (32× 30 TB NVMe)24×

gw-node4×

Raw NVMe~23.6 PB

Usable (EC 4+2)~15.7 PB

GPU HBM3e total128–256× 141 GB

Peak storage BW~7.7 TB/s

Storage : GPU ratio3:2

Hyperscale AI DC Pod

Training + inference · 512–2048 GPUs

ctrl-node3×

gpu-node (8× H200)64–256×

stor-node (32× 30 TB NVMe)48–96×

gw-node8×

Raw NVMe47–94 PB

Usable (EC 8+2)37–75 PB

GPU HBM3e total512–2048× 141 GB

Peak storage BW15–31 TB/s

Storage : GPU ratio3:4 (BW-matched)