Engineering·February 7, 2026

ClickHouse® Database: What It Is, How It Works, and When It Falls Apart

Alvaro Garcia

Tesla has ingested over a quadrillion rows into ClickHouse®. Cloudflare processes 11 million rows per second. Netflix pushes roughly 5 petabytes of logs through it daily. At Numia, we run ClickHouse at 100 billion+ rows serving 10 million requests per day. This post is what I wish someone had explained before I started.

Most "what is ClickHouse" articles stop at the marketing pitch. This one goes deeper. By the end, you'll understand why ClickHouse is architected the way it is, when to reach for it, and where it falls apart.

Why ClickHouse Exists

In 2009, Yandex had a problem. Their web analytics platform, Yandex.Metrica, was processing 200 million events per day. By 2016, that number hit 25 billion events daily. The challenge wasn't storage — it was real-time analytics on non-aggregated data.

Traditional solutions failed at this scale. OLAP cubes required pre-aggregation, which killed query flexibility. Row-based databases like PostgreSQL couldn't scan billions of rows fast enough for interactive dashboards.

The numbers that broke everything: 374 servers, 20.3 trillion rows, 2 petabytes compressed. Report load times averaged 26 seconds. After migrating to ClickHouse: 0.8 seconds. Same reports, 32x faster. Not from better hardware, but from better architecture.

Two decisions made this possible:

Columnar storage: Read only the data your query needs
Vectorized execution: Process data in batches, not one row at a time

Columnar Storage

The difference between row-based and columnar storage is the single biggest factor in query performance. Consider a simple table:

sql

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String
)

Row-based storage (PostgreSQL, MySQL):

[timestamp1, user_id1, event_type1], [timestamp2, user_id2, event_type2]...

Every row lives together. Reading one column means reading them all.

Columnar storage (ClickHouse):

[timestamp1, timestamp2...], [user_id1, user_id2...], [event_type1, event_type2...]

Each column is stored separately. Reading one column means reading only that column.

SELECT count(timestamp) FROM events

Row-Based Storage

READS EVERYTHING

DISK LAYOUT — ROWS STORED TOGETHER

TIMESTAMP

USER_ID

EVENT_TYPE

2026-02-01

48291

click

2026-02-01

73104

view

2026-02-02

48291

purchase

2026-02-02

91537

click

2026-02-03

28463

view

Data read: 100% — all 3 columns loaded from disk

Columnar Storage

READS 1 COLUMN

DISK LAYOUT — EACH COLUMN STORED SEPARATELY

TIMESTAMP

USER_ID

EVENT_TYPE

2026-02-01

48291

click

2026-02-01

73104

view

2026-02-02

48291

purchase

2026-02-02

91537

click

2026-02-03

28463

view

READ

SKIPPED

SKIPPED

Data read: 33% — only the timestamp column

Row-based vs columnar storage — how ClickHouse reads only the data your query needs

This matters because of how analytics queries work. "How many events per day?" only needs the timestamp column. Row-based databases read 100% of the data. Columnar reads ~33% (one of three columns). A table with 50 columns where your query needs 3? Row-based reads everything. Columnar reads 6%.

The flip side: fetching a single complete row is worse. A row-store reads one contiguous block. A column-store reads from every column file separately, one access per column. With 50 columns, that's 50 reads instead of 1. This is why columnar databases are terrible for SELECT * WHERE id = 123 workloads.

Engines And MergeTree

MergeTree is where ClickHouse does most of its work. The name is literal: data arrives in parts, and a background process merges them.

The hierarchy matters:

Table: Collection of parts
Part: Immutable data chunk created on each INSERT
Granule: Smallest unit ClickHouse reads (default: 8,192 rows)
Mark: Pointer to a granule's location

Unlike traditional databases that index every row (B-tree), ClickHouse uses a sparse index: one entry per granule, not one entry per row. Each entry is called a mark, and it points to where a granule starts on disk.

1. Table

events

8.87M rows

PARTS:

202401_1_3

202401_4_6

202402_1_2

2. Part

202401_4_6

2.5M rows · sorted by ORDER BY

Primary key

column values

Granule size

8,192 rows

3. Granule

Granule #2

8,192 rows · smallest read unit

MARK (INDEX ENTRY)

File offset

0x1A9F03

ROW DATA (SORTED BY ORDER BY)

click11452026-02-01

click37822026-03-15

click94012026-05-22

4. Sparse Index

One mark per granule, not one per row

Total rows

8.87M

Marks

1,083

Index in RAM

96.93 KB

QUERY LOOKUP

Binary search over 1,083 marks

~10 steps to find matching granules

Scan at most 8,192 rows each

Always fits in RAM

Sparse indexes one per granule

How data is organized in MergeTree, from table to part to granule, with a sparse index that always fits in RAM

Take a table with 8.87 million rows:

8.87M / 8,192 = 1,083 granules
Primary index stores one mark per granule
Total index size: 96.93 KB in memory

The entire primary index fits in RAM. Always. This is by design. When you query, ClickHouse binary searches over these marks to find which granules might contain your data. For 1,083 marks, that's ~10 steps. Then it reads at most 8,192 rows per matching granule.

Creating a MergeTree table:

sql

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    payload String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, user_id, timestamp);

Two decisions here matter:

PARTITION BY controls physical separation. Monthly partitioning means January data lives in different files than February. Queries filtering by month skip entire partitions.

ORDER BY controls sorting within each part.

Data flows like this:

INSERT creates a new part (immutable)
Background thread merges small parts into larger ones
Merges can trigger computation for specialized engines

The most relevant variants:

Engine	Purpose
`MergeTree`	Base engine
`ReplacingMergeTree`	Deduplicates during merge
`AggregatingMergeTree`	Pre-aggregates during merge

Why ORDER BY Matters

The sparse index can only help if the data is sorted in a way that groups relevant rows into the same granules. That's what ORDER BY does. It controls how rows are sorted within each part, which directly controls how many granules ClickHouse can skip.

Consider:

sql

SELECT * FROM events
WHERE event_type = 'click' AND user_id = 12345;

With ORDER BY (event_type, user_id, timestamp):

Binary search finds event_type = 'click' granules
Within those, finds user_id = 12345 granules
Scans maybe 8,192 rows

With ORDER BY (timestamp, event_type, user_id):

event_type is second in order
Can't use index efficiently
Scans potentially millions of rows

Same query, same data, different ORDER BY. The difference can be 100x in query time.

The real rule: query patterns first, then cardinality. Your ORDER BY should start with the columns that appear most often in your WHERE clauses. A column that never gets filtered on is useless in the index, no matter its cardinality.

Once you've identified which columns you actually query by, order them from low cardinality to high cardinality. Why? A low-cardinality column at the front eliminates large chunks of granules in one step. If event_type has 10 unique values and user_id has 1 million, putting event_type first eliminates ~90% of granules immediately.

Compression benefits from good ordering too. The documentation shows 3:1 compression with poor ordering → 39:1 with optimal ordering. Same data, 13x better compression from column order alone.

Vectorized Execution

Traditional databases process row by row:

for each row:
    apply filter
    compute aggregation

ClickHouse processes in batches:

for each batch of 65,536 rows:
    apply filter to entire batch (SIMD)
    compute aggregation on batch (SIMD)

SIMD means Single Instruction, Multiple Data. One CPU instruction processes multiple values simultaneously. Modern CPUs support AVX2 (4 doubles at once) and AVX-512 (8 doubles at once).

ClickHouse has multiple SIMD implementations per operator and auto-selects based on your CPU at runtime. It has 30+ precompiled hash table variants for different data patterns.

In practice, this means 10-100x faster aggregations compared to row-at-a-time processing. It's why ClickHouse aggregates billions of rows in seconds while PostgreSQL takes minutes on the same data.

Compression

Columnar storage has another advantage: similar values sit next to each other, and similar values compress well. Timestamps next to timestamps achieve 10:1 to 30:1 compression. With monotonic sequences, ClickHouse hits 800:1 using DoubleDelta encoding.

Every column stored on disk gets compressed. The smaller it is on disk, the less I/O your query needs. ClickHouse applies compression in two layers:

A specialized codec transforms the data to make it more compressible
A general compression algorithm actually shrinks it

First you fold everything neatly, then you vacuum-seal the bag.

General compression

These algorithms work on any data. You choose one based on a simple tradeoff:

Codec	Speed	Compression
`LZ4`	Faster decompression	Good (2-3x reduction)
`ZSTD`	Slightly slower	~30% better than LZ4

Both decompress faster than most SSDs can read, so in practice you rarely notice the speed difference. LZ4 is the default and a safe choice. Use ZSTD when storage cost matters more than marginal CPU savings.

Specialized codecs

These don't compress data themselves. They transform it into a form that general compression can handle much better. Each one targets a specific pattern:

Delta stores the difference between consecutive values instead of the values themselves.

Original:      [1000, 1001, 1002, 1003]
After Delta:   [1000, 1, 1, 1]

A column full of small numbers like 1, 1, 1 compresses much better than 1000, 1001, 1002. Good for timestamps and any monotonically increasing column.

DoubleDelta goes one step further and stores the delta of the deltas.

Original:           [1000, 1005, 1010, 1015]
After Delta:        [1000, 5, 5, 5]
After DoubleDelta:  [1000, 5, 0, 0, 0]

When the interval between values is constant (like a sensor reporting every 5 seconds), the second delta is all zeros. This can hit 800:1 compression on regular time-series data. Good for counters and regular-interval sequences.

Gorilla uses XOR encoding, designed for floating-point numbers. It XORs each value with the previous one. When consecutive floats are similar, the result is mostly zeros.

Original:   [36.6, 36.7, 36.6, 36.8]
After XOR:  [36.6, (mostly zeros), (mostly zeros), (mostly zeros)]

Good for metrics and measurements where values don't jump around much between rows.

Putting it together

You pair one specialized codec with one general codec per column, choosing based on the data type:

sql

CREATE TABLE metrics (
    timestamp DateTime CODEC(Delta, ZSTD),
    value Float64 CODEC(Gorilla, LZ4),
    counter UInt64 CODEC(DoubleDelta, ZSTD)
)

If you're unsure, the defaults (LZ4, no specialized codec) are solid. Add specialized codecs when you understand your data patterns and want to squeeze out better compression.

When ClickHouse Shines

ClickHouse is at its best when you're scanning a lot of data and aggregating it down. Log analytics, time-series, event tracking, real-time dashboards. Cloudflare processes 11M rows/sec through it. PostHog built their entire analytics product on top of it. Timestamps sort naturally and compress well with Delta encoding, which makes time-series workloads particularly fast.

The query patterns that work well are aggregations (COUNT, SUM, AVG) over millions of rows, time-range filters like WHERE timestamp > now() - INTERVAL 7 DAY, and GROUP BY on low-cardinality columns like country or device type.

Some production numbers:

Company	Scale
Tesla	1 quadrillion+ rows
Netflix	~5 PB logs/day
Cloudflare	11M rows/sec, 47 Gbps
Yandex	20.3 trillion rows (2014)

When ClickHouse Falls Apart

ClickHouse is not a general-purpose database. I've hit each of these walls in production.

There are no ACID transactions and no row-level locking. If you need transactional guarantees, use PostgreSQL.

Point lookups on columns that aren't in the ORDER BY are painful:

sql

-- Slow if user_id isn't first in ORDER BY
SELECT * FROM events WHERE user_id = 12345;

The sparse index can't skip granules efficiently, so you end up scanning everything.

Updates don't exist in the way you'd expect. ClickHouse doesn't update in place. You can use ReplacingMergeTree, but merge timing isn't guaranteed. CollapsingMergeTree works but requires explicit delete/insert with sign columns. None of it is as clean as UPDATE ... WHERE.

JOINs on high-cardinality columns are another pain point:

sql

-- Millions of unique join keys = pain
SELECT * FROM events e JOIN users u ON e.user_id = u.id;

The right side of the JOIN gets loaded into memory. Large tables cause OOM. The workaround is to denormalize your data or use dictionaries.

And finally, small frequent inserts. Each INSERT creates a part on disk. Thousands of small inserts means thousands of parts, which creates merge overhead. Batch your inserts (1,000–100,000 rows at a time).

The Mental Model

ClickHouse is fast because it reads only the columns you need, keeps the entire primary index in RAM (96 KB for 8.87 million rows), processes rows in SIMD batches, and compresses everything 10-30x so there's less to read from disk in the first place.

It breaks when you need transactions, fast lookups on non-indexed columns, frequent updates, JOINs on millions of unique keys, or high-throughput single-row inserts.

That's the trade-off. ClickHouse gives up write flexibility and transactional guarantees in exchange for read speed on analytical queries. If that matches your workload, nothing else comes close. If it doesn't, you'll fight the system at every turn.

If you're running ClickHouse in production, the distributed setup is where the real complexity lives. I wrote about that separately in The Hard Truth About Self-Hosting ClickHouse® at Scale.

Alvaro Garcia·February 7, 2026

Engineering

How We Built a Stateless Distributed Cache for ClickHouse

In a stateless distributed database, the network is the bottleneck — not CPU or memory. We built a distributed cache mesh where every ClickHouse node contributes its local NVMe to a shared cache layer.

Marc Leuthardt·February 22, 2026

Engineering

Data Modelling in ClickHouse®: Engines, Tables, and Materialisations

How to choose between MergeTree engines, design ORDER BY and PARTITION BY for performance, and use materialised views for pre-aggregation. A practical guide from running ClickHouse at petabyte scale.

Alvaro Garcia·February 25, 2026

ClickHouse® Database: What It Is, How It Works, and When It Falls Apart

Why ClickHouse Exists

Columnar Storage

Engines And MergeTree

Why ORDER BY Matters

Vectorized Execution

Compression

General compression

Specialized codecs

Putting it together

When ClickHouse Shines

When ClickHouse Falls Apart

The Mental Model

Related posts

How We Built a Stateless Distributed Cache for ClickHouse

Data Modelling in ClickHouse®: Engines, Tables, and Materialisations