AI Storage Reference Architecture

Constraint-free · Best-of-breed hardware · Purpose-built for AI inference & training · February 2026

Five Non-Negotiable Design Principles
Kernel-Free Data Path
Zero kernel involvement from NVMe to GPU HBM. SPDK + DPDK + cuFile. Every syscall is latency left on the table.
Blob-Native Semantics
No POSIX. No inodes. No directory tree. Blobs are the primitive. Content-addressed by token hash. POSIX is a compatibility shim, never the hot path.
GPUDirect Everywhere
NVMe → GPU HBM via RDMA DMA. CPU DRAM is never in the tensor data path. BlueField-3 DPU as RDMA peer on every GPU node.
Commodity at Scale
Standard x86 + PCIe 5.0 NVMe + ConnectX-7 NICs. No proprietary ASICs. Any node replaceable in minutes. Hardware is fungible; software defines all performance.
Kubernetes-Native Control
Every storage resource is a Kubernetes object. CSI, operators, and scheduler extensions — not bolted on but co-designed from day one.
Software Stack — Open Source Foundation, No Proprietary Storage Dependencies
AI Framework Layer
NIXL + cuFile
NIXL — NVIDIA's AI storage abstraction. Native vLLM, PyTorch, HuggingFace integration. Blob API surface exposed directly to inference engines.
cuFile / GDS — kernel-bypass path from NVMe-oF into GPU HBM via DMA. No CPU memcopy.
NVIDIA Open
KV Cache Manager
mooncake / custom
Disaggregated KV cache — prefill tokens stored as content-addressed blobs, shared across inference pods. Eliminates redundant prefill compute for repeated prompt prefixes.
Apache 2.0
Blob Orchestration
Custom Blob Engine
DHT router — consistent-hash key → stor-node mapping. Rebalances on node add/remove.
Blob lifecycle — tiering policy (HBM → NVMe → cold object).
Stripe planner — EC stripe assignment, parity computation coordination.
Custom · Apache 2.0
Erasure Coding
ISA-L (Intel)
SIMD-accelerated erasure coding in user space. 4+2 or 8+2 configurable. AVX-512 GF multiply — orders of magnitude faster than kernel RAID-6. No kernel MD involvement.
BSD License
Storage Engine
SPDK Blobstore
SPDK NVMe driver — user-space PCIe 5.0 NVMe, poll-mode, zero kernel.
SPDK Blobstore — blob primitive directly on NVMe. No VFS, no page cache.
SPDK NVMe-oF target — serves blobs to GPU nodes via RDMA fabric.
Apache 2.0
Network Data Path
DPDK Poll-Mode
ConnectX-7 driven by DPDK — no kernel TCP/IP. Replication and metadata traffic bypasses the OS network stack entirely. Huge pages pre-allocated at boot.
Apache 2.0
Control Plane
Kubernetes Operator
Custom CSI driver — provisions blob volumes as PersistentVolumes. StorageClass maps to EC policy + tier.
Scheduler extension — topology-aware pod placement (GPU node ↔ nearest stor-nodes).
Custom · Apache 2.0
Observability
DCGM + Prometheus
DCGM exporter — GPU utilization, HBM bandwidth, NVLink saturation.
SPDK vhost metrics — NVMe queue depth, latency percentiles per blob.
Grafana with AI storage dashboards. Alert on GPU idle caused by storage stall.
NVIDIA Open
Node Profiles — Best Available Hardware (2026)
Kubernetes Control Plane
ctrl-node
×3 nodes — HA quorum
PlatformSupermicro AS-2125HS-TNR
CPU2× AMD EPYC 9654 (96c / 192t)
RAM512 GB DDR5-4800 ECC RDIMM
OS / etcd disk2× 3.84 TB NVMe PCIe 5.0 (RAID-1)
NIC — fabric2× ConnectX-7 HDR/100 GbE
NIC — mgmt1× 25 GbE OOB + IPMI
GPUNone
Runs: kube-apiserver, etcd (3-node Raft), CoreDNS, Cilium operator, Blob Operator controller, Prometheus, Grafana. Tainted — no user workloads.
GPU Compute Node
gpu-node
×8–64 nodes (scale unit: 8)
PlatformNVIDIA DGX H200 or HGX H200
CPU2× Intel Xeon Platinum 8592+ (64c)
RAM2 TB DDR5-5600 (8-channel / CPU)
GPU8× NVIDIA H200 SXM5 — 141 GB HBM3e each
GPU interconnectNVLink 4.0 — 900 GB/s all-reduce
Local NVMe8× 3.84 TB PCIe 5.0 (scratch/spill only)
NIC — GDS fabric4× ConnectX-7 NDR InfiniBand 400Gb
DPUBlueField-3 — GDS RDMA peer, RDMA write → HBM3e
Runs: vLLM, PyTorch DataLoader pods. GDS enabled via cuFile mounted into pod. All persistent tensors on stor-nodes over NDR IB. Local NVMe = scratch only — never persistent.
Storage Node (SPDK/DPDK)
stor-node
×6–48 nodes (scale unit: 6)
PlatformSupermicro SSG-211E-NE312R (2U, 32× E3.S)
CPU2× AMD EPYC 9654 (96c / 192t)
RAM1 TB DDR5 (SPDK huge-pages: 512 GB reserved)
NVMe32× Samsung PM9D3a 30.72 TB E3.S PCIe 5.0
Raw capacity / node~983 TB
NVMe BW / node~320 GB/s read, ~120 GB/s write
NIC — data fabric2× ConnectX-7 NDR IB 400Gb (RDMA)
NIC — replication2× ConnectX-7 200 GbE (DPDK)
CXL memory4× 128 GB CXL 2.0 DIMM (hot metadata tier)
Runs: SPDK NVMe-oF target daemon (user-space), DPDK replication agent, custom blob engine, ISA-L EC worker threads. Zero kernel involvement in data path. CXL DIMMs hold DHT routing table and hot blob metadata in memory-semantic tier.
Storage Gateway
gw-node
×2–4 nodes (active-active pair)
PlatformSupermicro AS-2125HS-TNR
CPU2× AMD EPYC 9554P (64c / 128t)
RAM512 GB DDR5-4800
NVMe (metadata cache)8× 3.84 TB PCIe 5.0
NIC — uplink4× ConnectX-7 400 GbE (client-facing)
NIC — fabric2× ConnectX-7 NDR IB 400Gb
Runs: Blob API endpoint (native blob + POSIX compatibility shim), S3-compatible adapter for ingestion pipelines, NVMe-oF initiator for clients that lack GDS. ECMP load-balanced. Active-active pair — no single point of failure.
Network Fabric — Three Isolated Planes
① Data Plane — InfiniBand NDR 400Gb (RDMA)
Switch: NVIDIA Quantum-3 QM9700 — 64-port NDR 400Gb, 25.6 Tb/s bisection
Topology: 2-tier fat-tree (leaf + spine). Full bisection bandwidth. No oversubscription on this plane.
Protocol: InfiniBand native RDMA — lower latency than RoCEv2 (no ECN/PFC tuning required, no pause frame propagation risk)
GPUDirect: Storage node RDMA Write → GPU node BlueField-3 → HBM3e. Fully peer-to-peer, CPU not in path.
Latency: <1 µs fabric RTT; <200 µs end-to-end NVMe→GPU (target)
② Replication Plane — 200 GbE Ethernet (DPDK)
Switch: Arista 7060CX3-64D — 64× 100GbE, dedicated to stor-node replication only
Purpose: ISA-L parity writes between storage nodes. EC stripe replication via DPDK user-space transport (no kernel TCP/IP).
Isolation: Physically separate switch, separate NIC ports on stor-nodes. Replication traffic cannot compete with GDS reads.
Durability: Synchronous 3-node commit before ACK to writer. Cross-AZ async replication via 10 GbE WAN uplink from each gateway pair.
Target: <500 µs synchronous replication round-trip within fault domain
③ Management Plane — 25 GbE OOB + BMC
Switch: Dedicated OOB Arista 7020 — management traffic only, never touches data plane VLANs
Kubernetes traffic: Cilium eBPF CNI, Gateway API ingress, Wireguard inter-node encryption. No kube-proxy.
BMC: Redfish API on all nodes. Full IPMI OOB — power cycle, console, firmware update without touching OS.
BGP: eBGP between leaf and spine for pod IP advertisement. No flat L2 domain — L3 from the rack top.
Security: mTLS between all components. Cilium network policy enforces zero-trust between node roles.
Critical Data Paths
Path A — Inference Read: NVMe → GPU HBM (GPUDirect Storage, CPU DRAM = 0 bytes) Target: <200 µs
Samsung PM9D3a
NVMe E3.S
stor-node
PCIe 5.0
DMA burst
SPDK
Blobstore
user-space
DPDK
zero-copy
ConnectX-7
NDR 400Gb
RDMA Write
InfiniBand
<1 µs RTT
Quantum-3
QM9700
fabric
InfiniBand
<1 µs RTT
BlueField-3
DPU
gpu-node GDS peer
cuFile
p2p DMA
H200
HBM3e
141 GB
GPU kernel ready
CPU DRAM is never in this path. The RDMA Write from ConnectX-7 on the stor-node goes directly to the GPU's BAR-mapped HBM3e address space, brokered by BlueField-3 on the GPU node. This is the core latency claim — every other architecture that touches CPU DRAM adds 3–5× latency.
Path B — Checkpoint Write: GPU HBM → Durable NVMe (3-replica, synchronous) Target: <500 µs to durable
H200 HBM3e
gpu-node
cuFile
async write
BlueField-3
DPU
RDMA initiator
NDR IB
400Gb
SPDK
NVMe-oF
Target
stor-node-1
blob_put
DHT key
ISA-L EC
encoder
4+2 stripe
DPDK
repl fabric
NVMe SSDs
×3 nodes
sync commit
ACK to
writer
✓ Durable
3-node EC
ISA-L EC runs in user space on the primary stor-node CPU — AVX-512 GF multiply completes a 4+2 stripe of a 1 MB blob in ~2 µs. Parity shards dispatched via DPDK over the dedicated replication fabric. ACK returned to writer only after all 3 nodes confirm NVMe write completion.
Path C — KV Cache Offload: vLLM PagedAttention → Content-Addressed Blob Store Cross-pod dedup via token hash
vLLM
PagedAttn
KV block evict
NIXL
blob API
Blob Orch.
token hash
→ blob key
DHT lookup
CXL-cached
DHT Router
key → stor-node
cache hit?
skip write
SPDK
Blobstore
stor-node
NVMe
write
NVMe SSD
persisted
Content-addressing is the KV cache superpower: if two inference pods computed identical prompt prefix KV blocks, the second pod finds the blob already present (same token hash → same key → cache hit). This eliminates redundant prefill compute and redundant NVMe writes cluster-wide — a capability no filesystem-based system can offer natively.
Path D — CXL Memory Tier: Hot Metadata & DHT Table (Sub-microsecond lookup) Target: <300 ns lookup
Blob
Orchestrator
lookup request
memory-
semantic
CXL 2.0
DIMM ×4
512 GB total
on stor-node
load/store
no protocol
DHT routing
table entry
→ NVMe addr
direct
dispatch
SPDK I/O
submission
queue pair
CXL 2.0 DIMMs on each storage node act as a memory-semantic metadata tier — faster than NVMe by 10–100×, cheaper than DRAM by 3×, and capacity-persistent across reboots. The DHT routing table and hot blob metadata index live here. This eliminates a NVMe lookup hop for every blob access — critical for sub-200 µs end-to-end targets.
Cluster Sizing Profiles
Edge Inference Lab
Enterprise private AI · 16–32 GPUs
ctrl-node
gpu-node (8× H200)2–4×
stor-node (32× 30 TB NVMe)
gw-node
Raw NVMe~5.9 PB
Usable (EC 4+2)~3.9 PB
GPU HBM3e total16–32× 141 GB
Peak storage BW~1.9 TB/s
Storage : GPU ratio3:1
Mid-Scale AI Cloud
Neocloud inference service · 128–256 GPUs
ctrl-node
gpu-node (8× H200)16–32×
stor-node (32× 30 TB NVMe)24×
gw-node
Raw NVMe~23.6 PB
Usable (EC 4+2)~15.7 PB
GPU HBM3e total128–256× 141 GB
Peak storage BW~7.7 TB/s
Storage : GPU ratio3:2
Hyperscale AI DC Pod
Training + inference · 512–2048 GPUs
ctrl-node
gpu-node (8× H200)64–256×
stor-node (32× 30 TB NVMe)48–96×
gw-node
Raw NVMe47–94 PB
Usable (EC 8+2)37–75 PB
GPU HBM3e total512–2048× 141 GB
Peak storage BW15–31 TB/s
Storage : GPU ratio3:4 (BW-matched)