How We Built a Stateless Distributed Cache for ClickHouse
TL;DR: In a stateless distributed database, the network is the bottleneck — not CPU or memory. We built a distributed cache mesh where every ClickHouse node contributes its local NVMe to a shared cache layer, achieving near-100% virtual cache hit rates and reducing S3 reads to almost zero. It's available for all customers on all instance types.
When Latest Gen CPUs and Blazing Fast Memory Is Not the Bottleneck
When you run ClickHouse on a single node, performance is straightforward. Data lives on local NVMe, the OS page cache keeps hot data in memory, and queries fly. Add replicas, and not much changes — each replica has its own copy of the data on local disk. It adds network overhead for merging results of queries to distributed tables though.
But the moment you move to a stateless architecture, where compute nodes have no persistent storage and data lives in S3, everything changes.
The fastest CPUs and the most memory in the world don't help if you're waiting for data.
The Storage Hierarchy Gap
Let's look at the numbers:
| Storage | Latency | Throughput | IOPS |
|---|---|---|---|
| L1/L2 Cache | ~1–10 ns | TB/s | — |
| Memory (DDR5) | ~80 ns | ~50 GB/s | — |
| Local NVMe | ~100 µs | ~5 GB/s | ~1M |
| S3 (same region) | ~10–50 ms | ~1 GB/s | ~5,500 |
| S3 (cold/first byte) | ~100–200 ms | ~100 MB/s | — |
These are rough values. Actual throughput depends on file size, parallelism, and the type of workload.
From NVMe to S3, we're looking at a 100–1000× increase in latency and a 5× drop in throughput per stream. For a database engine that's optimized to scan columnar data at memory speed, this gap is brutal.
The Cold Cluster Problem
Consider what happens when a 20-node ObsessionDB cluster wakes up and needs to serve queries over a 500 GB dataset.
Without caching, every node fetches its portion from S3 independently. Even with parallel connections, you're looking at minutes of warm-up time and significant S3 GET ops. Worse, if the same data is requested by queries landing on different nodes, the same S3 objects get downloaded multiple times.
With per-node caching, things improve — but only for the node that served the query. Node 3 downloads and caches the data for Q1, but when Q2 lands on node 7, it goes back to S3 for the exact same data. Multiply this across 20 nodes, and you're downloading the dataset up to 20 times.
And it gets worse.
Merges: The Cache Killer
ClickHouse continuously merges smaller parts into larger ones in the background. This is fundamental to how MergeTree works and critical for query performance. But from the cache's perspective, merges are destructive: the source parts that were carefully cached become outdated, and the newly merged part needs to be cached again.
With 20 nodes all performing independent merges, cache invalidation becomes a constant churn. Even a dataset that fits comfortably in cache can generate significant S3 traffic just from merge activity — reading source parts, writing merged results, and re-populating caches across nodes.
We needed a fundamentally different approach.
Attempt 1: Centralized Cache Cluster
Our first idea was simple: place a dedicated cache tier between ClickHouse and S3.
We used consistent hashing to route requests: each S3 object maps to exactly one cache node based on its key. Cache hits are served from NVMe, misses go to S3 and get stored on the owning cache node.
This solved the duplication problem — each file is cached exactly once. But it introduced new ones:
- Bottleneck: All reads and writes funnel through e.g. 3 cache nodes. Their NVMe and network become the ceiling for the entire cluster.
- Waste: We're paying for those extra dedicated cache servers that do nothing but proxy data. Their CPUs sit mostly idle while their network saturates.
- Scaling mismatch: Adding ClickHouse nodes doesn't increase cache capacity. The cache tier scales independently and awkwardly.
We needed the cache to scale with compute, not instead of it.
Attempt 2: The Distributed Cache Mesh
The insight was obvious in hindsight: every ClickHouse node already has local NVMe. Why not make every node a member of a distributed cache?
No dedicated cache servers. No wasted resources. Every node contributes its NVMe to the shared cache pool. Add a node, and you automatically add cache capacity.
But how should the nodes coordinate?
The Design Decision: Two Valid Approaches
Option A: Consistent Hashing
Each node owns a slice of the key space. When node-3 needs file X, it hashes the path and routes the request to whichever node owns that hash range.
Pros: Deterministic placement, exactly one copy of each file, efficient NVMe usage.
Cons: Requires a hash ring, rebalancing on node add/remove, custom routing protocol, connection pooling. Every file access needs a network hop to the owning node.
Option B: Shared Filesystem (Stateless Distributed Layer)
Instead of routing individual requests, present all node-local NVMe drives as a single shared filesystem. Every node sees every file. No routing needed.
Pros: Zero routing logic, zero custom protocol, files naturally land on the node that writes them (data locality), standard POSIX I/O.
Cons: Adds a filesystem abstraction layer, remote reads are slower than local ones, slightly less deterministic placement.
Why We Chose Option B
Both approaches are valid and each has trade-offs depending on the workload. We couldn't determine a clear winner on cache efficiency or scaling behavior alone.
But Option B unlocked something that tipped the scales: locality-aware merges. Because cache files naturally reside on the node that wrote them, we can prefer merging parts that are already cached locally. This turns the cache from a passive read optimization into an active scheduling signal, and it dramatically reduces cross-node data movement during merges. The extra layer of complexity was worth it.
How It Works
From ClickHouse's perspective, it's writing cache files to a local directory. The magic is that this "local" directory is backed by a distributed filesystem, and all nodes see all files at all times.
ClickHouse doesn't know or care that the file is physically on another node. It just opens a file and reads. The distributed filesystem handles the rest.
The Math: Why This Scales
Let's think about bandwidth in a 20-node cluster.
Each node has a network bandwidth of B. In a centralized cache architecture, the cache nodes are the bottleneck — you get maybe 3B of total cache bandwidth regardless of how many compute nodes you add.
In our mesh, every node is both a producer and consumer of cache data. The total available bandwidth for intra-cache traffic is 20 × B.
But it gets better. Because data is naturally distributed across nodes (each node writes its own INSERT data to its local volume), reads are spread across the cluster. When 20 nodes query different parts of a dataset, they're pulling from 20 different source NVMe drives over 20 different network paths. No single node becomes a bottleneck. As the cluster grows, bandwidth scales linearly with node count, and the natural load distribution approaches ~100% bandwidth efficiency at scale.
The Cache Hit Rate Equation
With cache-on-write enabled, every INSERT immediately populates the distributed cache. This means:
- Virtual cache hit rate: ~100% — Any data ever written is available in the mesh. Every node can read every file without going to S3.
- Physical cache hit rate: ~1/N — On a given node, roughly 1/N of the mesh data resides on its local NVMe (where N is the number of nodes). The rest is fetched from other nodes via the network.
That "network fetch from another node" sounds expensive, but it's a fundamentally different kind of network access than S3:
| Intra-cluster (node-to-node) | S3 | |
|---|---|---|
| Latency | ~0.1 ms | ~10–50 ms |
| Small file overhead | Negligible | High (per-request overhead) |
| Connection setup | Already connected | HTTPS handshake |
For many small files — which is exactly what ClickHouse parts consist of — the intra-cluster network is orders of magnitude faster than S3. And because we increase the total network surface with every node we add, we reduce the chance of congestion dramatically.
The Result
With cache-on-write enabled and a warmed cluster, we reduced S3 reads to near zero during normal operation. S3 is only touched for:
- The initial INSERT (write path — data must be durably stored)
- True cold reads where data was evicted from the cache
Combined with the fast intra-node cache network, ObsessionDB stays snappy even at scale. Queries that would wait hundreds of milliseconds for S3 round-trips instead read from the mesh in microseconds.
For ObsessionDB the cache architecture is not an optimization — it's the architecture. It's available for all our instances without extra charge.
Related posts
Data Modelling in ClickHouse®: Engines, Tables, and Materialisations
How to choose between MergeTree engines, design ORDER BY and PARTITION BY for performance, and use materialised views for pre-aggregation. A practical guide from running ClickHouse at petabyte scale.
ObsessionDB vs. ClickHouse Cloud in the 10B Row Benchmark
ObsessionDB runs 35% faster than ClickHouse Cloud in the 10B row benchmark with more than 6x better performance per dollar. Full transparency on methodology, results, and cost breakdown.