Ⅰ
Kernel-Free Data Path
Zero kernel involvement from NVMe to GPU HBM. SPDK + DPDK + cuFile. Every syscall is latency left on the table.
Ⅱ
Blob-Native Semantics
No POSIX. No inodes. No directory tree. Blobs are the primitive. Content-addressed by token hash. POSIX is a compatibility shim, never the hot path.
Ⅲ
GPUDirect Everywhere
NVMe → GPU HBM via RDMA DMA. CPU DRAM is never in the tensor data path. BlueField-3 DPU as RDMA peer on every GPU node.
Ⅳ
Commodity at Scale
Standard x86 + PCIe 5.0 NVMe + ConnectX-7 NICs. No proprietary ASICs. Any node replaceable in minutes. Hardware is fungible; software defines all performance.
Ⅴ
Kubernetes-Native Control
Every storage resource is a Kubernetes object. CSI, operators, and scheduler extensions — not bolted on but co-designed from day one.
Kubernetes Control Plane
ctrl-node
×3 nodes — HA quorum
PlatformSupermicro AS-2125HS-TNR
CPU2× AMD EPYC 9654 (96c / 192t)
RAM512 GB DDR5-4800 ECC RDIMM
OS / etcd disk2× 3.84 TB NVMe PCIe 5.0 (RAID-1)
NIC — fabric2× ConnectX-7 HDR/100 GbE
NIC — mgmt1× 25 GbE OOB + IPMI
GPUNone
Runs: kube-apiserver, etcd (3-node Raft), CoreDNS, Cilium operator, Blob Operator controller, Prometheus, Grafana. Tainted — no user workloads.
GPU Compute Node
gpu-node
×8–64 nodes (scale unit: 8)
PlatformNVIDIA DGX H200 or HGX H200
CPU2× Intel Xeon Platinum 8592+ (64c)
RAM2 TB DDR5-5600 (8-channel / CPU)
GPU8× NVIDIA H200 SXM5 — 141 GB HBM3e each
GPU interconnectNVLink 4.0 — 900 GB/s all-reduce
Local NVMe8× 3.84 TB PCIe 5.0 (scratch/spill only)
NIC — GDS fabric4× ConnectX-7 NDR InfiniBand 400Gb
DPUBlueField-3 — GDS RDMA peer, RDMA write → HBM3e
Runs: vLLM, PyTorch DataLoader pods. GDS enabled via cuFile mounted into pod. All persistent tensors on stor-nodes over NDR IB. Local NVMe = scratch only — never persistent.
Storage Node (SPDK/DPDK)
stor-node
×6–48 nodes (scale unit: 6)
PlatformSupermicro SSG-211E-NE312R (2U, 32× E3.S)
CPU2× AMD EPYC 9654 (96c / 192t)
RAM1 TB DDR5 (SPDK huge-pages: 512 GB reserved)
NVMe32× Samsung PM9D3a 30.72 TB E3.S PCIe 5.0
Raw capacity / node~983 TB
NVMe BW / node~320 GB/s read, ~120 GB/s write
NIC — data fabric2× ConnectX-7 NDR IB 400Gb (RDMA)
NIC — replication2× ConnectX-7 200 GbE (DPDK)
CXL memory4× 128 GB CXL 2.0 DIMM (hot metadata tier)
Runs: SPDK NVMe-oF target daemon (user-space), DPDK replication agent, custom blob engine, ISA-L EC worker threads. Zero kernel involvement in data path. CXL DIMMs hold DHT routing table and hot blob metadata in memory-semantic tier.
Storage Gateway
gw-node
×2–4 nodes (active-active pair)
PlatformSupermicro AS-2125HS-TNR
CPU2× AMD EPYC 9554P (64c / 128t)
RAM512 GB DDR5-4800
NVMe (metadata cache)8× 3.84 TB PCIe 5.0
NIC — uplink4× ConnectX-7 400 GbE (client-facing)
NIC — fabric2× ConnectX-7 NDR IB 400Gb
Runs: Blob API endpoint (native blob + POSIX compatibility shim), S3-compatible adapter for ingestion pipelines, NVMe-oF initiator for clients that lack GDS. ECMP load-balanced. Active-active pair — no single point of failure.
Path A — Inference Read: NVMe → GPU HBM (GPUDirect Storage, CPU DRAM = 0 bytes)
Target: <200 µs
Samsung PM9D3a
NVMe E3.S
stor-node
→
PCIe 5.0
DMA burst
SPDK
Blobstore
user-space
→
DPDK
zero-copy
ConnectX-7
NDR 400Gb
RDMA Write
→
InfiniBand
<1 µs RTT
Quantum-3
QM9700
fabric
→
InfiniBand
<1 µs RTT
BlueField-3
DPU
gpu-node GDS peer
→
cuFile
p2p DMA
H200
HBM3e
141 GB
GPU kernel ready
CPU DRAM is never in this path. The RDMA Write from ConnectX-7 on the stor-node goes directly to the GPU's BAR-mapped HBM3e address space, brokered by BlueField-3 on the GPU node. This is the core latency claim — every other architecture that touches CPU DRAM adds 3–5× latency.
Path B — Checkpoint Write: GPU HBM → Durable NVMe (3-replica, synchronous)
Target: <500 µs to durable
H200 HBM3e
gpu-node
→
cuFile
async write
BlueField-3
DPU
RDMA initiator
→
NDR IB
400Gb
SPDK
NVMe-oF
Target
stor-node-1
→
blob_put
DHT key
ISA-L EC
encoder
4+2 stripe
→
DPDK
repl fabric
NVMe SSDs
×3 nodes
sync commit
→
ACK to
writer
✓ Durable
3-node EC
ISA-L EC runs in user space on the primary stor-node CPU — AVX-512 GF multiply completes a 4+2 stripe of a 1 MB blob in ~2 µs. Parity shards dispatched via DPDK over the dedicated replication fabric. ACK returned to writer only after all 3 nodes confirm NVMe write completion.
Path C — KV Cache Offload: vLLM PagedAttention → Content-Addressed Blob Store
Cross-pod dedup via token hash
vLLM
PagedAttn
KV block evict
→
NIXL
blob API
Blob Orch.
token hash
→ blob key
→
DHT lookup
CXL-cached
DHT Router
key → stor-node
→
cache hit?
skip write
SPDK
Blobstore
stor-node
→
NVMe
write
NVMe SSD
persisted
Content-addressing is the KV cache superpower: if two inference pods computed identical prompt prefix KV blocks, the second pod finds the blob already present (same token hash → same key → cache hit). This eliminates redundant prefill compute and redundant NVMe writes cluster-wide — a capability no filesystem-based system can offer natively.
Path D — CXL Memory Tier: Hot Metadata & DHT Table (Sub-microsecond lookup)
Target: <300 ns lookup
Blob
Orchestrator
lookup request
→
memory-
semantic
CXL 2.0
DIMM ×4
512 GB total
on stor-node
→
load/store
no protocol
DHT routing
table entry
→ NVMe addr
→
direct
dispatch
SPDK I/O
submission
queue pair
CXL 2.0 DIMMs on each storage node act as a memory-semantic metadata tier — faster than NVMe by 10–100×, cheaper than DRAM by 3×, and capacity-persistent across reboots. The DHT routing table and hot blob metadata index live here. This eliminates a NVMe lookup hop for every blob access — critical for sub-200 µs end-to-end targets.
Edge Inference Lab
Enterprise private AI · 16–32 GPUs
ctrl-node3×
gpu-node (8× H200)2–4×
stor-node (32× 30 TB NVMe)6×
gw-node2×
Raw NVMe~5.9 PB
Usable (EC 4+2)~3.9 PB
GPU HBM3e total16–32× 141 GB
Peak storage BW~1.9 TB/s
Storage : GPU ratio3:1
Mid-Scale AI Cloud
Neocloud inference service · 128–256 GPUs
ctrl-node3×
gpu-node (8× H200)16–32×
stor-node (32× 30 TB NVMe)24×
gw-node4×
Raw NVMe~23.6 PB
Usable (EC 4+2)~15.7 PB
GPU HBM3e total128–256× 141 GB
Peak storage BW~7.7 TB/s
Storage : GPU ratio3:2
Hyperscale AI DC Pod
Training + inference · 512–2048 GPUs
ctrl-node3×
gpu-node (8× H200)64–256×
stor-node (32× 30 TB NVMe)48–96×
gw-node8×
Raw NVMe47–94 PB
Usable (EC 8+2)37–75 PB
GPU HBM3e total512–2048× 141 GB
Peak storage BW15–31 TB/s
Storage : GPU ratio3:4 (BW-matched)