Skip to main content
Engineering·February 22, 2026

How We Built a Stateless Distributed Cache for ClickHouse

Marc Leuthardt

TL;DR: In a stateless distributed database, the network is the bottleneck — not CPU or memory. We built a distributed cache mesh where every ClickHouse node contributes its local NVMe to a shared cache layer, achieving near-100% virtual cache hit rates and reducing S3 reads to almost zero. It's available for all customers on all instance types.


When Latest Gen CPUs and Blazing Fast Memory Is Not the Bottleneck

When you run ClickHouse on a single node, performance is straightforward. Data lives on local NVMe, the OS page cache keeps hot data in memory, and queries fly. Add replicas, and not much changes — each replica has its own copy of the data on local disk. It adds network overhead for merging results of queries to distributed tables though.

But the moment you move to a stateless architecture, where compute nodes have no persistent storage and data lives in S3, everything changes.

The fastest CPUs and the most memory in the world don't help if you're waiting for data.

The Storage Hierarchy Gap

Let's look at the numbers:

StorageLatencyThroughputIOPS
L1/L2 Cache~1–10 nsTB/s
Memory (DDR5)~80 ns~50 GB/s
Local NVMe~100 µs~5 GB/s~1M
S3 (same region)~10–50 ms~1 GB/s~5,500
S3 (cold/first byte)~100–200 ms~100 MB/s

These are rough values. Actual throughput depends on file size, parallelism, and the type of workload.

From NVMe to S3, we're looking at a 100–1000× increase in latency and a 5× drop in throughput per stream. For a database engine that's optimized to scan columnar data at memory speed, this gap is brutal.

The Cold Cluster Problem

Consider what happens when a 20-node ObsessionDB cluster wakes up and needs to serve queries over a 500 GB dataset.

Without caching, every node fetches its portion from S3 independently. Even with parallel connections, you're looking at minutes of warm-up time and significant S3 GET ops. Worse, if the same data is requested by queries landing on different nodes, the same S3 objects get downloaded multiple times.

With per-node caching, things improve — but only for the node that served the query. Node 3 downloads and caches the data for Q1, but when Q2 lands on node 7, it goes back to S3 for the exact same data. Multiply this across 20 nodes, and you're downloading the dataset up to 20 times.

And it gets worse.

Merges: The Cache Killer

ClickHouse continuously merges smaller parts into larger ones in the background. This is fundamental to how MergeTree works and critical for query performance. But from the cache's perspective, merges are destructive: the source parts that were carefully cached become outdated, and the newly merged part needs to be cached again.

With 20 nodes all performing independent merges, cache invalidation becomes a constant churn. Even a dataset that fits comfortably in cache can generate significant S3 traffic just from merge activity — reading source parts, writing merged results, and re-populating caches across nodes.

We needed a fundamentally different approach.

Attempt 1: Centralized Cache Cluster

Our first idea was simple: place a dedicated cache tier between ClickHouse and S3.

ClickHouse Nodes (1..N)
Cache Cluster (consistent hash)
cache-0
[NVMe]
cache-1
[NVMe]
cache-2
[NVMe]
S3

We used consistent hashing to route requests: each S3 object maps to exactly one cache node based on its key. Cache hits are served from NVMe, misses go to S3 and get stored on the owning cache node.

This solved the duplication problem — each file is cached exactly once. But it introduced new ones:

  • Bottleneck: All reads and writes funnel through e.g. 3 cache nodes. Their NVMe and network become the ceiling for the entire cluster.
  • Waste: We're paying for those extra dedicated cache servers that do nothing but proxy data. Their CPUs sit mostly idle while their network saturates.
  • Scaling mismatch: Adding ClickHouse nodes doesn't increase cache capacity. The cache tier scales independently and awkwardly.

We needed the cache to scale with compute, not instead of it.

Attempt 2: The Distributed Cache Mesh

The insight was obvious in hindsight: every ClickHouse node already has local NVMe. Why not make every node a member of a distributed cache?

Distributed Cache Mesh
node-0
[CH + NVMe]
<->
node-1
[CH + NVMe]
<->
node-2
[CH + NVMe]
<->
node-N
[CH + NVMe]
S3

No dedicated cache servers. No wasted resources. Every node contributes its NVMe to the shared cache pool. Add a node, and you automatically add cache capacity.

But how should the nodes coordinate?

The Design Decision: Two Valid Approaches

Option A: Consistent Hashing

Each node owns a slice of the key space. When node-3 needs file X, it hashes the path and routes the request to whichever node owns that hash range.

hash("s3://bucket/abc/data.bin") = 0x7A3F...
node-0
0x00-0x3F
node-1
0x40-0x6F
node-2
0x70-0x9F
owns it
node-3
0xA0-0xFF

Pros: Deterministic placement, exactly one copy of each file, efficient NVMe usage.

Cons: Requires a hash ring, rebalancing on node add/remove, custom routing protocol, connection pooling. Every file access needs a network hop to the owning node.

Option B: Shared Filesystem (Stateless Distributed Layer)

Instead of routing individual requests, present all node-local NVMe drives as a single shared filesystem. Every node sees every file. No routing needed.

Shared Filesystem View
/cache/part_1/data.bin
->
physically on node-0 NVMe
/cache/part_2/data.bin
->
physically on node-2 NVMe
/cache/part_3/data.bin
->
physically on node-1 NVMe
...
All nodes see all files. Read from any node.
Local reads hit NVMe directly. Remote reads go via network.

Pros: Zero routing logic, zero custom protocol, files naturally land on the node that writes them (data locality), standard POSIX I/O.

Cons: Adds a filesystem abstraction layer, remote reads are slower than local ones, slightly less deterministic placement.

Why We Chose Option B

Both approaches are valid and each has trade-offs depending on the workload. We couldn't determine a clear winner on cache efficiency or scaling behavior alone.

But Option B unlocked something that tipped the scales: locality-aware merges. Because cache files naturally reside on the node that wrote them, we can prefer merging parts that are already cached locally. This turns the cache from a passive read optimization into an active scheduling signal, and it dramatically reduces cross-node data movement during merges. The extra layer of complexity was worth it.

How It Works

From ClickHouse's perspective, it's writing cache files to a local directory. The magic is that this "local" directory is backed by a distributed filesystem, and all nodes see all files at all times.

node-0 INSERTs data
1.
Part written to S3 (durable storage)
2.
Cache files written to local cache path
3.
Files stored on node-0 NVMe
node-7 reads same data
1.
Open the same cache file path
2.
File exists (written by node-0)
3.
Read from node-0 NVMe via network
4.
No S3 download needed

ClickHouse doesn't know or care that the file is physically on another node. It just opens a file and reads. The distributed filesystem handles the rest.

The Math: Why This Scales

Let's think about bandwidth in a 20-node cluster.

Each node has a network bandwidth of B. In a centralized cache architecture, the cache nodes are the bottleneck — you get maybe 3B of total cache bandwidth regardless of how many compute nodes you add.

In our mesh, every node is both a producer and consumer of cache data. The total available bandwidth for intra-cache traffic is 20 × B.

But it gets better. Because data is naturally distributed across nodes (each node writes its own INSERT data to its local volume), reads are spread across the cluster. When 20 nodes query different parts of a dataset, they're pulling from 20 different source NVMe drives over 20 different network paths. No single node becomes a bottleneck. As the cluster grows, bandwidth scales linearly with node count, and the natural load distribution approaches ~100% bandwidth efficiency at scale.

The Cache Hit Rate Equation

With cache-on-write enabled, every INSERT immediately populates the distributed cache. This means:

  • Virtual cache hit rate: ~100% — Any data ever written is available in the mesh. Every node can read every file without going to S3.
  • Physical cache hit rate: ~1/N — On a given node, roughly 1/N of the mesh data resides on its local NVMe (where N is the number of nodes). The rest is fetched from other nodes via the network.

That "network fetch from another node" sounds expensive, but it's a fundamentally different kind of network access than S3:

Intra-cluster (node-to-node)S3
Latency~0.1 ms~10–50 ms
Small file overheadNegligibleHigh (per-request overhead)
Connection setupAlready connectedHTTPS handshake

For many small files — which is exactly what ClickHouse parts consist of — the intra-cluster network is orders of magnitude faster than S3. And because we increase the total network surface with every node we add, we reduce the chance of congestion dramatically.

The Result

With cache-on-write enabled and a warmed cluster, we reduced S3 reads to near zero during normal operation. S3 is only touched for:

  1. The initial INSERT (write path — data must be durably stored)
  2. True cold reads where data was evicted from the cache

Combined with the fast intra-node cache network, ObsessionDB stays snappy even at scale. Queries that would wait hundreds of milliseconds for S3 round-trips instead read from the mesh in microseconds.

For ObsessionDB the cache architecture is not an optimization — it's the architecture. It's available for all our instances without extra charge.

Marc Leuthardt·February 22, 2026