Traditional Storage Is Not AI Storage

They share the word "storage." They share almost nothing else.

Dimension	Traditional Enterprise Storage	AI-Native Storage
I/O Pattern	Random reads and writes. Applications depend on in-place modification, POSIX locking, and byte-range updates.	Write-once / read-many. Model weights, checkpoints, and training samples are immutable blobs. No random writes needed.
Protocol	NFS, iSCSI, SMB — designed for spinning disk over TCP. Kernel-mediated I/O with multiple context switches per operation.	NVMe-oF over RDMA, GPUDirect Storage, CXL. Kernel-bypassed via DPDK/SPDK. Zero-copy from NVMe directly into GPU HBM.
Latency Target	Milliseconds acceptable. HDD-era designs persist even on all-flash arrays due to legacy software stack overhead.	Microseconds required. GPU costs $30k+ — every microsecond of idle GPU is measurable dollar waste. Sub-250µs is baseline.
Bottleneck	IOPS and capacity. Scale-up storage controllers dominate. Single-threaded I/O stacks inherited from HDD era.	Bandwidth and parallelism. Thousands of GPU cores demand simultaneous access. Scale-out across commodity nodes is mandatory.
New Primitives	No concept of KV cache, embedding vectors, or tensor blobs. Filesystems expose bytes and blocks — not AI-meaningful data units.	KV cache offload/sharing, vector embedding storage, content-addressable tensor blobs, and GPU-DMA-aligned chunking are first-class.
Hardware Model	Proprietary controllers (Pure DirectFlash, NetApp ONTAP ASIC). Vendor lock-in is a feature — margins depend on it.	Commodity NVMe + RDMA NICs on standard x86. Software defines all performance. Hardware is fungible and replaceable.
Moat	Hardware, support contracts, installed base inertia. Winner-take-most dynamics already resolved (NetApp, Pure, Dell EMC).	Software architecture, AI-primitive integration (NIXL, CUDA, PyTorch), protocol leadership. Category not yet decided.

Architecture: Blob-Native, Kernel-Free, GPU-Direct

Built from first principles — not adapted from general-purpose storage.

AI Application Layer

vLLM · PyTorch DataLoader · HuggingFace · RAG Pipelines

↓ Blob API (blob_get / blob_put / blob_stripe) ↓

Blob Orchestration

DHT key routing · Stripe logic · Erasure coding (ISA-L) · KV cache lifecycle

↓ SPDK Blobstore + DPDK transport primitives ↓

User-Space Data Path

SPDK NVMe driver · DPDK poll-mode NIC · GPUDirect Storage (cuFile) · CXL memory tier

↓ VFIO hardware mapping (single kernel crossing) ↓

Commodity Hardware

Standard x86 servers · NVMe SSDs · ConnectX RDMA NICs · NVIDIA GPUs

↓ Kubernetes Operator ↓

Control Plane

Kubernetes Operator · CSI plugin · Observability · Multi-tenant QoS

Why Blob-Native Beats Filesystem for AI

AI I/O is write-once / read-many — POSIX random access semantics are unnecessary overhead
Blob boundaries align naturally with GPU DMA — one blob = one tensor = one DMA transfer
No inode table, no directory tree, no POSIX locking — entire filesystem metadata complexity eliminated
Content-addressable keys (token hash → KV cache blob) enable cluster-wide cache deduplication natively
SPDK Blobstore already ships this primitive — we compose, not invent

Architecture & Techniques

Deep dives into each component of the storage stack, from GPU memory to NVMe, with interactive diagrams and protocol walkthroughs.

📐

Reference Architecture

The complete blueprint: five design principles, full software stack (blob orchestration through SPDK/DPDK to hardware), four node types (storage, GPU, metadata, management), three network fabrics, critical data paths with latency budgets, and cluster sizing for 1k-GPU deployments.

Full Stack Node Types Network Fabric Sizing

View reference architecture →

🗄️

Storage Node Architecture

Complete hardware and software stack of a NabuStore node: vLLM integration, CXL-backed Robin Hood index, content-addressed blob store, DHT ring with 150 vnodes, and the three-plane network topology (GPU IB, replication Ethernet, inference).

CXL DIMM SPDK DHT Ring vLLM

View architecture diagram →

⚡

Blob Transfer via NVMe-oF & RDMA

Zero-copy GPU blob retrieval: NIXL request triggers DHT lookup, NVMe-oF target registration, GPU HBM3e registered as RDMA memory region, then direct RDMA Write from NVMe DMA buffer through InfiniBand NDR 400 Gb/s fabric to GPU memory. No CPU copies.

NVMe-oF RDMA BlueField-3 GPUDirect

View data path →

🔄

DPDK Blob Replication Pipeline

Userspace replication using DPDK poll-mode drivers: EC shards encoded with ISA-L, wrapped in huge-page DMA buffers, dispatched via rte_eth_tx_burst over ConnectX-7 200 GbE. Dedicated replication VLAN on Arista 7060CX3 fabric, zero-copy receive path.

DPDK ISA-L Zero-Copy 200 GbE

View replication pipeline →

🛡️

EC 4+2 Blob Write Pipeline

Step-by-step erasure-coded write: 4 MB blob split into 4 data shards, ISA-L AVX-512 GF(2^8) Cauchy encoding produces 2 parity shards, DHT ring maps each shard to a distinct node, SPDK blobstore writes via PCIe 5.0 DMA, 6-way ACK confirms stripe durability.

Reed-Solomon AVX-512 PCIe 5.0 6-Node Stripe

View write pipeline →

🧩

Plugin Architecture

Three-tier extensibility model: Tier 1 static registry (init()-based, zero overhead), Tier 2 gRPC out-of-process (vendor binaries over UDS), Tier 3 hardware auto-detection. Server vendors supply drivers without modifying core code. Full API reference and vendor plugin guide.

Tier 1: In-Process Tier 2: gRPC Tier 3: Auto-Detect

View plugin guide →

Storage Tier Hierarchy

Three-tier caching from DRAM through CXL memory to NVMe, with automatic promotion and demotion based on access patterns.

⚡

L1 — Hot

DRAM

~100 ns

💎

L2 — Warm

CXL Memory

~200–300 ns

💾

L3 — Cold

NVMe SSD

~100 µs

🗂️

Index

Robin Hood Map

CXL-backed mmap

🔗

Data Plane

RDMA / DPDK

<2 µs wire

Frequently Asked Questions

Common questions about NabuStore.

NabuStore is in active early-stage development. The core blob storage engine, DHT ring routing, and NVMe-oF/RDMA data paths are functional, but the project is not yet recommended for production workloads. We welcome early adopters who want to evaluate the technology, run benchmarks, and help shape the roadmap. Reach out at contact@nabustore.com if you'd like to discuss your use case.

NabuStore runs on commodity hardware — no proprietary controllers required. You need standard x86 servers with NVMe SSDs, RDMA-capable NICs (e.g. NVIDIA ConnectX-6/7), and a Kubernetes cluster. A 3-node cluster is sufficient for evaluation. For production-grade performance, we recommend 100 GbE or InfiniBand NDR fabric and PCIe 5.0 NVMe drives.

Weka and VAST are excellent general-purpose storage platforms. NabuStore is a specialist: purpose-built for AI inference I/O patterns (write-once / read-many blobs), with kernel-bypassed data paths via SPDK and DPDK, native KV cache lifecycle management, and sub-microsecond metadata lookups via CXL-backed indexing. If you need a general-purpose filesystem, NabuStore is not the right tool. If you're running GPU inference clusters at scale, it is.

The best starting point is the GitHub repository and the documentation, which includes a reference architecture, deployment guides, and Kubernetes operator manifests. For questions or to discuss your specific environment, open a GitHub issue or email contact@nabustore.com.

NabuStore is Apache 2.0 open source and actively welcomes contributions. Browse open issues on GitHub for good first issues, review the plugin architecture docs if you want to extend the storage layer, or reach out directly to discuss a larger contribution or infrastructure integration.

AI Inference Storage for GPU Clusters