Skip to main content
Engineering·February 7, 2026

ClickHouse® Database: What It Is, How It Works, and When It Falls Apart

Alvaro Garcia

Tesla has ingested over a quadrillion rows into ClickHouse®. Cloudflare processes 11 million rows per second. Netflix pushes roughly 5 petabytes of logs through it daily. At Numia, we run ClickHouse at 100 billion+ rows serving 10 million requests per day. This post is what I wish someone had explained before I started.

Most "what is ClickHouse" articles stop at the marketing pitch. This one goes deeper. By the end, you'll understand why ClickHouse is architected the way it is, when to reach for it, and where it falls apart.

Why ClickHouse Exists

In 2009, Yandex had a problem. Their web analytics platform, Yandex.Metrica, was processing 200 million events per day. By 2016, that number hit 25 billion events daily. The challenge wasn't storage — it was real-time analytics on non-aggregated data.

Traditional solutions failed at this scale. OLAP cubes required pre-aggregation, which killed query flexibility. Row-based databases like PostgreSQL couldn't scan billions of rows fast enough for interactive dashboards.

The numbers that broke everything: 374 servers, 20.3 trillion rows, 2 petabytes compressed. Report load times averaged 26 seconds. After migrating to ClickHouse: 0.8 seconds. Same reports, 32x faster. Not from better hardware, but from better architecture.

Two decisions made this possible:

  1. Columnar storage: Read only the data your query needs
  2. Vectorized execution: Process data in batches, not one row at a time

Columnar Storage

The difference between row-based and columnar storage is the single biggest factor in query performance. Consider a simple table:

sql
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String
)

Row-based storage (PostgreSQL, MySQL):

[timestamp1, user_id1, event_type1], [timestamp2, user_id2, event_type2]...

Every row lives together. Reading one column means reading them all.

Columnar storage (ClickHouse):

[timestamp1, timestamp2...], [user_id1, user_id2...], [event_type1, event_type2...]

Each column is stored separately. Reading one column means reading only that column.

SELECT count(timestamp) FROM events
Row-Based Storage
READS EVERYTHING
DISK LAYOUT — ROWS STORED TOGETHER
TIMESTAMP
USER_ID
EVENT_TYPE
2026-02-01
48291
click
2026-02-01
73104
view
2026-02-02
48291
purchase
2026-02-02
91537
click
2026-02-03
28463
view
Data read: 100% — all 3 columns loaded from disk
Columnar Storage
READS 1 COLUMN
DISK LAYOUT — EACH COLUMN STORED SEPARATELY
TIMESTAMP
USER_ID
EVENT_TYPE
2026-02-01
48291
click
2026-02-01
73104
view
2026-02-02
48291
purchase
2026-02-02
91537
click
2026-02-03
28463
view
READ
SKIPPED
SKIPPED
Data read: 33% — only the timestamp column

Row-based vs columnar storage — how ClickHouse reads only the data your query needs

This matters because of how analytics queries work. "How many events per day?" only needs the timestamp column. Row-based databases read 100% of the data. Columnar reads ~33% (one of three columns). A table with 50 columns where your query needs 3? Row-based reads everything. Columnar reads 6%.

The flip side: fetching a single complete row is worse. A row-store reads one contiguous block. A column-store reads from every column file separately, one access per column. With 50 columns, that's 50 reads instead of 1. This is why columnar databases are terrible for SELECT * WHERE id = 123 workloads.

Engines And MergeTree

MergeTree is where ClickHouse does most of its work. The name is literal: data arrives in parts, and a background process merges them.

The hierarchy matters:

  • Table: Collection of parts
  • Part: Immutable data chunk created on each INSERT
  • Granule: Smallest unit ClickHouse reads (default: 8,192 rows)
  • Mark: Pointer to a granule's location

Unlike traditional databases that index every row (B-tree), ClickHouse uses a sparse index: one entry per granule, not one entry per row. Each entry is called a mark, and it points to where a granule starts on disk.

1. Table
events
8.87M rows
PARTS:
202401_1_3
202401_4_6
202402_1_2
2. Part
202401_4_6
2.5M rows · sorted by ORDER BY
Primary key
column values
Granule size
8,192 rows
3. Granule
Granule #2
8,192 rows · smallest read unit
MARK (INDEX ENTRY)
File offset
0x1A9F03
ROW DATA (SORTED BY ORDER BY)
click11452026-02-01
click37822026-03-15
click94012026-05-22
4. Sparse Index
One mark per granule, not one per row
Total rows
8.87M
Marks
1,083
Index in RAM
96.93 KB
QUERY LOOKUP
Binary search over 1,083 marks
~10 steps to find matching granules
Scan at most 8,192 rows each
Always fits in RAM
Sparse indexes one per granule

How data is organized in MergeTree, from table to part to granule, with a sparse index that always fits in RAM

Take a table with 8.87 million rows:

  • 8.87M / 8,192 = 1,083 granules
  • Primary index stores one mark per granule
  • Total index size: 96.93 KB in memory

The entire primary index fits in RAM. Always. This is by design. When you query, ClickHouse binary searches over these marks to find which granules might contain your data. For 1,083 marks, that's ~10 steps. Then it reads at most 8,192 rows per matching granule.

Creating a MergeTree table:

sql
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    payload String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, user_id, timestamp);

Two decisions here matter:

PARTITION BY controls physical separation. Monthly partitioning means January data lives in different files than February. Queries filtering by month skip entire partitions.

ORDER BY controls sorting within each part.

Data flows like this:

  1. INSERT creates a new part (immutable)
  2. Background thread merges small parts into larger ones
  3. Merges can trigger computation for specialized engines

The most relevant variants:

EnginePurpose
MergeTreeBase engine
ReplacingMergeTreeDeduplicates during merge
AggregatingMergeTreePre-aggregates during merge

Why ORDER BY Matters

The sparse index can only help if the data is sorted in a way that groups relevant rows into the same granules. That's what ORDER BY does. It controls how rows are sorted within each part, which directly controls how many granules ClickHouse can skip.

Consider:

sql
SELECT * FROM events
WHERE event_type = 'click' AND user_id = 12345;

With ORDER BY (event_type, user_id, timestamp):

  • Binary search finds event_type = 'click' granules
  • Within those, finds user_id = 12345 granules
  • Scans maybe 8,192 rows

With ORDER BY (timestamp, event_type, user_id):

  • event_type is second in order
  • Can't use index efficiently
  • Scans potentially millions of rows

Same query, same data, different ORDER BY. The difference can be 100x in query time.

The real rule: query patterns first, then cardinality. Your ORDER BY should start with the columns that appear most often in your WHERE clauses. A column that never gets filtered on is useless in the index, no matter its cardinality.

Once you've identified which columns you actually query by, order them from low cardinality to high cardinality. Why? A low-cardinality column at the front eliminates large chunks of granules in one step. If event_type has 10 unique values and user_id has 1 million, putting event_type first eliminates ~90% of granules immediately.

Compression benefits from good ordering too. The documentation shows 3:1 compression with poor ordering → 39:1 with optimal ordering. Same data, 13x better compression from column order alone.

Vectorized Execution

Traditional databases process row by row:

for each row:
    apply filter
    compute aggregation

ClickHouse processes in batches:

for each batch of 65,536 rows:
    apply filter to entire batch (SIMD)
    compute aggregation on batch (SIMD)

SIMD means Single Instruction, Multiple Data. One CPU instruction processes multiple values simultaneously. Modern CPUs support AVX2 (4 doubles at once) and AVX-512 (8 doubles at once).

ClickHouse has multiple SIMD implementations per operator and auto-selects based on your CPU at runtime. It has 30+ precompiled hash table variants for different data patterns.

In practice, this means 10-100x faster aggregations compared to row-at-a-time processing. It's why ClickHouse aggregates billions of rows in seconds while PostgreSQL takes minutes on the same data.

Compression

Columnar storage has another advantage: similar values sit next to each other, and similar values compress well. Timestamps next to timestamps achieve 10:1 to 30:1 compression. With monotonic sequences, ClickHouse hits 800:1 using DoubleDelta encoding.

Every column stored on disk gets compressed. The smaller it is on disk, the less I/O your query needs. ClickHouse applies compression in two layers:

  1. A specialized codec transforms the data to make it more compressible
  2. A general compression algorithm actually shrinks it

First you fold everything neatly, then you vacuum-seal the bag.

General compression

These algorithms work on any data. You choose one based on a simple tradeoff:

CodecSpeedCompression
LZ4Faster decompressionGood (2-3x reduction)
ZSTDSlightly slower~30% better than LZ4

Both decompress faster than most SSDs can read, so in practice you rarely notice the speed difference. LZ4 is the default and a safe choice. Use ZSTD when storage cost matters more than marginal CPU savings.

Specialized codecs

These don't compress data themselves. They transform it into a form that general compression can handle much better. Each one targets a specific pattern:

Delta stores the difference between consecutive values instead of the values themselves.

Original:      [1000, 1001, 1002, 1003]
After Delta:   [1000, 1, 1, 1]

A column full of small numbers like 1, 1, 1 compresses much better than 1000, 1001, 1002. Good for timestamps and any monotonically increasing column.

DoubleDelta goes one step further and stores the delta of the deltas.

Original:           [1000, 1005, 1010, 1015]
After Delta:        [1000, 5, 5, 5]
After DoubleDelta:  [1000, 5, 0, 0, 0]

When the interval between values is constant (like a sensor reporting every 5 seconds), the second delta is all zeros. This can hit 800:1 compression on regular time-series data. Good for counters and regular-interval sequences.

Gorilla uses XOR encoding, designed for floating-point numbers. It XORs each value with the previous one. When consecutive floats are similar, the result is mostly zeros.

Original:   [36.6, 36.7, 36.6, 36.8]
After XOR:  [36.6, (mostly zeros), (mostly zeros), (mostly zeros)]

Good for metrics and measurements where values don't jump around much between rows.

Putting it together

You pair one specialized codec with one general codec per column, choosing based on the data type:

sql
CREATE TABLE metrics (
    timestamp DateTime CODEC(Delta, ZSTD),
    value Float64 CODEC(Gorilla, LZ4),
    counter UInt64 CODEC(DoubleDelta, ZSTD)
)

If you're unsure, the defaults (LZ4, no specialized codec) are solid. Add specialized codecs when you understand your data patterns and want to squeeze out better compression.

When ClickHouse Shines

ClickHouse is at its best when you're scanning a lot of data and aggregating it down. Log analytics, time-series, event tracking, real-time dashboards. Cloudflare processes 11M rows/sec through it. PostHog built their entire analytics product on top of it. Timestamps sort naturally and compress well with Delta encoding, which makes time-series workloads particularly fast.

The query patterns that work well are aggregations (COUNT, SUM, AVG) over millions of rows, time-range filters like WHERE timestamp > now() - INTERVAL 7 DAY, and GROUP BY on low-cardinality columns like country or device type.

Some production numbers:

CompanyScale
Tesla1 quadrillion+ rows
Netflix~5 PB logs/day
Cloudflare11M rows/sec, 47 Gbps
Yandex20.3 trillion rows (2014)

When ClickHouse Falls Apart

ClickHouse is not a general-purpose database. I've hit each of these walls in production.

There are no ACID transactions and no row-level locking. If you need transactional guarantees, use PostgreSQL.

Point lookups on columns that aren't in the ORDER BY are painful:

sql
-- Slow if user_id isn't first in ORDER BY
SELECT * FROM events WHERE user_id = 12345;

The sparse index can't skip granules efficiently, so you end up scanning everything.

Updates don't exist in the way you'd expect. ClickHouse doesn't update in place. You can use ReplacingMergeTree, but merge timing isn't guaranteed. CollapsingMergeTree works but requires explicit delete/insert with sign columns. None of it is as clean as UPDATE ... WHERE.

JOINs on high-cardinality columns are another pain point:

sql
-- Millions of unique join keys = pain
SELECT * FROM events e JOIN users u ON e.user_id = u.id;

The right side of the JOIN gets loaded into memory. Large tables cause OOM. The workaround is to denormalize your data or use dictionaries.

And finally, small frequent inserts. Each INSERT creates a part on disk. Thousands of small inserts means thousands of parts, which creates merge overhead. Batch your inserts (1,000–100,000 rows at a time).

The Mental Model

ClickHouse is fast because it reads only the columns you need, keeps the entire primary index in RAM (96 KB for 8.87 million rows), processes rows in SIMD batches, and compresses everything 10-30x so there's less to read from disk in the first place.

It breaks when you need transactions, fast lookups on non-indexed columns, frequent updates, JOINs on millions of unique keys, or high-throughput single-row inserts.

That's the trade-off. ClickHouse gives up write flexibility and transactional guarantees in exchange for read speed on analytical queries. If that matches your workload, nothing else comes close. If it doesn't, you'll fight the system at every turn.


If you're running ClickHouse in production, the distributed setup is where the real complexity lives. I wrote about that separately in The Hard Truth About Self-Hosting ClickHouse® at Scale.

Alvaro Garcia·February 7, 2026