Nabu Store

AI Inference Storage for GPU Clusters

Kubernetes-native AI inference storage built for horizontal scalability. Handles model weights, KV cache blobs, and checkpoints across distributed GPU clusters with sub-microsecond metadata lookups.

Star on GitHub Read the Docs →
Apache 2.0 Go 1.24+ Kubernetes Native Plugin Architecture Documentation
<300ns
CXL Index Lookup
EC 8+2
Erasure Coding
400 Gb/s
RDMA Fabric
5.8 TB
NVMe per Node
NabuStore Dashboard — cluster ring visualization, topology graph, and node metrics

Live cluster dashboard with consistent hash ring, topology graph, and real-time node metrics

Traditional Storage Is Not AI Storage

They share the word "storage." They share almost nothing else.

Dimension Traditional Enterprise Storage AI-Native Storage
I/O Pattern Random reads and writes. Applications depend on in-place modification, POSIX locking, and byte-range updates. Write-once / read-many. Model weights, checkpoints, and training samples are immutable blobs. No random writes needed.
Protocol NFS, iSCSI, SMB — designed for spinning disk over TCP. Kernel-mediated I/O with multiple context switches per operation. NVMe-oF over RDMA, GPUDirect Storage, CXL. Kernel-bypassed via DPDK/SPDK. Zero-copy from NVMe directly into GPU HBM.
Latency Target Milliseconds acceptable. HDD-era designs persist even on all-flash arrays due to legacy software stack overhead. Microseconds required. GPU costs $30k+ — every microsecond of idle GPU is measurable dollar waste. Sub-250µs is baseline.
Bottleneck IOPS and capacity. Scale-up storage controllers dominate. Single-threaded I/O stacks inherited from HDD era. Bandwidth and parallelism. Thousands of GPU cores demand simultaneous access. Scale-out across commodity nodes is mandatory.
New Primitives No concept of KV cache, embedding vectors, or tensor blobs. Filesystems expose bytes and blocks — not AI-meaningful data units. KV cache offload/sharing, vector embedding storage, content-addressable tensor blobs, and GPU-DMA-aligned chunking are first-class.
Hardware Model Proprietary controllers (Pure DirectFlash, NetApp ONTAP ASIC). Vendor lock-in is a feature — margins depend on it. Commodity NVMe + RDMA NICs on standard x86. Software defines all performance. Hardware is fungible and replaceable.
Moat Hardware, support contracts, installed base inertia. Winner-take-most dynamics already resolved (NetApp, Pure, Dell EMC). Software architecture, AI-primitive integration (NIXL, CUDA, PyTorch), protocol leadership. Category not yet decided.

Architecture: Blob-Native, Kernel-Free, GPU-Direct

Built from first principles — not adapted from general-purpose storage.

AI Application Layer
vLLM · PyTorch DataLoader · HuggingFace · RAG Pipelines
↓ Blob API (blob_get / blob_put / blob_stripe) ↓
Blob Orchestration
DHT key routing · Stripe logic · Erasure coding (ISA-L) · KV cache lifecycle
↓ SPDK Blobstore + DPDK transport primitives ↓
User-Space Data Path
SPDK NVMe driver · DPDK poll-mode NIC · GPUDirect Storage (cuFile) · CXL memory tier
↓ VFIO hardware mapping (single kernel crossing) ↓
Commodity Hardware
Standard x86 servers · NVMe SSDs · ConnectX RDMA NICs · NVIDIA GPUs
↓ Kubernetes Operator ↓
Control Plane
Kubernetes Operator · CSI plugin · Observability · Multi-tenant QoS

Why Blob-Native Beats Filesystem for AI

  • AI I/O is write-once / read-many — POSIX random access semantics are unnecessary overhead
  • Blob boundaries align naturally with GPU DMA — one blob = one tensor = one DMA transfer
  • No inode table, no directory tree, no POSIX locking — entire filesystem metadata complexity eliminated
  • Content-addressable keys (token hash → KV cache blob) enable cluster-wide cache deduplication natively
  • SPDK Blobstore already ships this primitive — we compose, not invent

Architecture & Techniques

Deep dives into each component of the storage stack, from GPU memory to NVMe, with interactive diagrams and protocol walkthroughs.

📐

Reference Architecture

The complete blueprint: five design principles, full software stack (blob orchestration through SPDK/DPDK to hardware), four node types (storage, GPU, metadata, management), three network fabrics, critical data paths with latency budgets, and cluster sizing for 1k-GPU deployments.

Full Stack Node Types Network Fabric Sizing
View reference architecture →
🗄️

Storage Node Architecture

Complete hardware and software stack of a NabuStore node: vLLM integration, CXL-backed Robin Hood index, content-addressed blob store, DHT ring with 150 vnodes, and the three-plane network topology (GPU IB, replication Ethernet, inference).

CXL DIMM SPDK DHT Ring vLLM
View architecture diagram →

Blob Transfer via NVMe-oF & RDMA

Zero-copy GPU blob retrieval: NIXL request triggers DHT lookup, NVMe-oF target registration, GPU HBM3e registered as RDMA memory region, then direct RDMA Write from NVMe DMA buffer through InfiniBand NDR 400 Gb/s fabric to GPU memory. No CPU copies.

NVMe-oF RDMA BlueField-3 GPUDirect
View data path →
🔄

DPDK Blob Replication Pipeline

Userspace replication using DPDK poll-mode drivers: EC shards encoded with ISA-L, wrapped in huge-page DMA buffers, dispatched via rte_eth_tx_burst over ConnectX-7 200 GbE. Dedicated replication VLAN on Arista 7060CX3 fabric, zero-copy receive path.

DPDK ISA-L Zero-Copy 200 GbE
View replication pipeline →
🛡️

EC 4+2 Blob Write Pipeline

Step-by-step erasure-coded write: 4 MB blob split into 4 data shards, ISA-L AVX-512 GF(2^8) Cauchy encoding produces 2 parity shards, DHT ring maps each shard to a distinct node, SPDK blobstore writes via PCIe 5.0 DMA, 6-way ACK confirms stripe durability.

Reed-Solomon AVX-512 PCIe 5.0 6-Node Stripe
View write pipeline →
🧩

Plugin Architecture

Three-tier extensibility model: Tier 1 static registry (init()-based, zero overhead), Tier 2 gRPC out-of-process (vendor binaries over UDS), Tier 3 hardware auto-detection. Server vendors supply drivers without modifying core code. Full API reference and vendor plugin guide.

Tier 1: In-Process Tier 2: gRPC Tier 3: Auto-Detect
View plugin guide →

Storage Tier Hierarchy

Three-tier caching from DRAM through CXL memory to NVMe, with automatic promotion and demotion based on access patterns.

L1 — Hot
DRAM
~100 ns
💎
L2 — Warm
CXL Memory
~200–300 ns
💾
L3 — Cold
NVMe SSD
~100 µs
🗂️
Index
Robin Hood Map
CXL-backed mmap
🔗
Data Plane
RDMA / DPDK
<2 µs wire

Frequently Asked Questions

Common questions about NabuStore.

NabuStore is in active early-stage development. The core blob storage engine, DHT ring routing, and NVMe-oF/RDMA data paths are functional, but the project is not yet recommended for production workloads. We welcome early adopters who want to evaluate the technology, run benchmarks, and help shape the roadmap. Reach out at contact@nabustore.com if you'd like to discuss your use case.

NabuStore runs on commodity hardware — no proprietary controllers required. You need standard x86 servers with NVMe SSDs, RDMA-capable NICs (e.g. NVIDIA ConnectX-6/7), and a Kubernetes cluster. A 3-node cluster is sufficient for evaluation. For production-grade performance, we recommend 100 GbE or InfiniBand NDR fabric and PCIe 5.0 NVMe drives.

Weka and VAST are excellent general-purpose storage platforms. NabuStore is a specialist: purpose-built for AI inference I/O patterns (write-once / read-many blobs), with kernel-bypassed data paths via SPDK and DPDK, native KV cache lifecycle management, and sub-microsecond metadata lookups via CXL-backed indexing. If you need a general-purpose filesystem, NabuStore is not the right tool. If you're running GPU inference clusters at scale, it is.

The best starting point is the GitHub repository and the documentation, which includes a reference architecture, deployment guides, and Kubernetes operator manifests. For questions or to discuss your specific environment, open a GitHub issue or email contact@nabustore.com.

NabuStore is Apache 2.0 open source and actively welcomes contributions. Browse open issues on GitHub for good first issues, review the plugin architecture docs if you want to extend the storage layer, or reach out directly to discuss a larger contribution or infrastructure integration.