B-trees vs LSM Trees: A Deep Dive for System Designers

October 13, 2025 • 18 min read

#databases #data-structures #system-design #b-trees #lsm-trees

B-trees vs LSM Trees: A Deep Dive for System Designers

If you’re designing systems that need to store and retrieve billions of records efficiently, understanding B-trees and LSM trees is essential. These two data structures power most modern databases, but they make fundamentally different trade-offs. Let’s build your intuition from the ground up.

The Core Problem: Finding a Needle in a Billion Haystacks

Before we dive into B-trees and LSM trees, we need to understand the problem they’re solving. Imagine you have 1 billion user records stored on disk. When someone logs in, you need to find their specific record. How do you do that quickly?

Why This Is Hard: The Disk Speed Problem

Here’s the critical constraint that shapes everything in database design:

Reading from disk is roughly 100,000x slower than reading from RAM.

Let me put that in perspective:

RAM access: ~100 nanoseconds
SSD access: ~100 microseconds (1000x slower)
Hard drive access: ~10 milliseconds (100,000x slower)

If you could read from RAM in the time it takes to blink (100ms), reading from a hard drive would take 2.7 hours!

This massive speed difference means that the number of disk accesses matters more than almost anything else.

The Index: Our First Attempt

Your first instinct might be: “Let’s create an index!” Good thinking. An index is a lookup structure that maps keys to their disk locations.

Scenario: Building a Simple Index

Let’s say you have 1 billion user records:

Each entry needs: User ID (8 bytes) + disk location pointer (8 bytes)
Total index size: 16 bytes × 1 billion = 16 GB

Now we face a new problem: where do we store this 16 GB index?

Option 1: Keep it all in RAM

Works great for small datasets
For 1 million users = 16 MB index ✓
For 1 billion users = 16 GB index (uses half your server’s RAM!)
For 100 billion records = 1.6 TB index ✗ (doesn’t fit in RAM)

Option 2: Store the index on disk

But now we’ve just recreated the problem!
We need to search through a 16 GB file on disk
That’s millions of disk reads

The Key Insight

What we need is a structure that:

Keeps the “hot” parts in RAM
Stores the rest on disk
Organizes the disk part so efficiently that we only need 3-4 disk reads to find anything

This is where B-trees come in.

B-trees: The Phone Book Approach

Understanding B-trees Through Analogy

Think about an old-fashioned phone book with a million names. You don’t read every page to find “Rodriguez”. Instead:

Top level: Open to the middle - see sections “A-F | G-M | N-S | T-Z”
Middle level: You want “R”, so look under “N-S”, which splits into “N | O | P-Q | R | S”
Bottom level: Under “R” you find actual names: “Ramirez, Robertson, Rodriguez, Rogers…”

You only flipped to 3-4 pages total, not all 1000 pages. That’s the B-tree idea.

Why Not Just Use a Binary Tree?

You might think: “Why not use a binary tree? That’s a classic computer science structure!”

Here’s the problem:

Binary Tree:

Each node has 2 children
For 1 billion items, tree height ≈ 30 (log₂ 1 billion)
Each level requires one disk read
Total: 30 disk reads per lookup 😞

Remember, disk reads are the enemy. 30 disk reads is way too slow.

B-trees: Fat Nodes for Fewer Hops

B-trees use a clever trick: instead of each node having 2 children, each node has hundreds or thousands of children.

How it works:

Each node contains many keys (say, 100 keys) stored in sorted order
Each node has 101 possible children to follow
Each node is sized to fit exactly one disk page (typically 4 KB)

The math changes dramatically:

For 1 billion items with nodes of 100 keys
Tree height ≈ 4 (log₁₀₀ 1 billion)
Total: 3-4 disk reads per lookup! ✓

Walking Through a B-tree Lookup

Let’s make this concrete. You’re looking for user_id = 58392.

Step 1: Read root node from disk → RAM

Root: [10000, 20000, 30000, 40000, 50000, 60000, 70000, ...]
       ↓      ↓      ↓      ↓      ↓      ↓      ↓
     ptr1   ptr2   ptr3   ptr4   ptr5   ptr6   ptr7

Binary search within this node (fast - it’s in RAM!):

58392 is between 50000 and 60000
Follow ptr5 to the next level

Step 2: Read that child node from disk → RAM

Node: [50100, 51000, 52000, ..., 58000, 59000, ...]
       ↓      ↓      ↓           ↓      ↓
     ptr1   ptr2   ptr3       ptr58   ptr59

Binary search within this node:

58392 is between 58000 and 59000
Follow ptr58

Step 3: Read leaf node from disk → RAM

Leaf: [58001, 58050, 58200, 58392, 58500, ...]
       data1  data2  data3  data4  data5

Found it! This leaf also contains a pointer to where the actual user data lives.

Total: 3 disk reads to find one record among billions.

B-tree Writes: Maintaining Order

Reads are straightforward, but what about writes? This is where B-trees get interesting.

Scenario: Inserting a New User

You want to insert user_id = 58393 into our B-tree.

Step 1: Find the right location

Follow the same path as a read:

Root → choose path for 50000-60000
Middle → choose path for 58000-59000
Leaf → arrive at: [58001, 58050, 58200, 58392, 58500]

Step 2: Insert in sorted order

The new key 58393 belongs between 58392 and 58500:

Before: [58001, 58050, 58200, 58392, 58500]
After:  [58001, 58050, 58200, 58392, 58393, 58500]

But wait - what if the node is already full?

Handling Node Overflow: The Split

Let’s say each node can hold maximum 5 keys (in reality it’s hundreds, but 5 makes the example clearer).

Our node is full:

[58100, 58200, 58300, 58400, 58500] ← can't add more!

We need to insert 58393. Here’s what happens:

Split the node:

Take the middle key (58300)
Split into two nodes:
- Left: [58100, 58200]
- Right: [58400, 58500]
Promote 58300 to the parent

Now insert 58393:

It belongs in the right node (58393 > 58300)
Right node becomes: [58393, 58400, 58500]

Visual representation:

Before split:
  Leaf: [58100, 58200, 58300, 58400, 58500] FULL!

After split:
  Parent gets: [..., 58300, ...]
                      ↓
          ┌───────────┴───────────┐
    [58100, 58200]         [58400, 58500]

After inserting 58393:
  Parent: [..., 58300, ...]
                 ↓
         ┌───────┴────────┐
    [58100, 58200]  [58393, 58400, 58500]

When Splits Propagate Upward

Here’s where it gets more interesting. What if the parent node is also full?

Scenario: Cascading splits

Parent before promotion:

[50000, 55000, 60000, 65000, 70000] ← already full!

We need to insert 58300, but there’s no room. Solution: split the parent too!

Split parent:
  Left: [50000, 55000]
  Right: [65000, 70000]
  Promote: 60000 to grandparent

Now insert 58300:
  Left child: [50000, 55000, 58300] ✓

The split propagates upward like a ripple through the tree.

Growing the Tree: Root Split

Key insight: What happens when even the root is full?

The root has no parent to promote to. Solution: create a new root!

Before:
  Root: [20000, 40000, 60000, 80000] FULL
  (needs to split)

After:
      New Root: [60000]
           ↓
    ┌──────┴──────┐
  [20000, 40000]  [80000]
  (old root left) (old root right)

This is the ONLY way a B-tree grows taller. Trees grow from the bottom up, not top down!

B-tree Characteristics Summary

Strengths:

Fast reads: 3-4 disk accesses for any record
Data stays sorted (great for range queries)
Predictable performance
Good for read-heavy workloads

Weaknesses:

Writes require finding the right location (3-4 disk reads)
Then writing the modified node back (1+ disk writes)
Splits require additional writes
Not great for write-heavy workloads

Real-world usage:

PostgreSQL, MySQL (InnoDB)
SQLite
Most traditional relational databases

LSM Trees: The Append-Only Approach

Now let’s look at a completely different approach to the same problem. LSM trees (Log-Structured Merge trees) optimize for writes at the expense of reads.

The Cashier Analogy

Imagine you’re a cashier at a busy store during the holiday rush. Customers keep handing you receipts to file.

Approach A: Perfect Organization (B-tree)

Every receipt gets filed in the exact right spot immediately
Open the filing cabinet, find the folder, insert in sorted order
Very organized, but slow when you’re swamped with customers

Approach B: Pile It Up (Bad idea)

Just throw receipts in a pile on your desk
Super fast to “file”! But impossible to find anything later

Approach C: Batch Processing (LSM tree)

During rush hours: throw receipts in a box (fast!)
During quiet times: sort the box and merge into filing cabinet
When someone needs a receipt: check today’s box first, then yesterday’s box, then the filing cabinet

Approach C trades immediate organization for write speed. This is LSM trees.

LSM Tree Structure

An LSM tree has multiple levels of storage:

Level 0: MemTable (RAM)

New writes go here first
It’s a sorted tree structure in memory (like a red-black tree)
Writes are blazingly fast - just memory operations

Level 1-N: SSTables on Disk (Sorted String Tables)

When MemTable fills up, flush it to disk as an SSTable
SSTables are immutable - never modified after creation
Each SSTable is a sorted file
Multiple SSTables accumulate over time

LSM Tree Operations

Writing to an LSM Tree

Scenario: High-frequency writes

Imagine you’re ingesting 100,000 sensor readings per second:

Write sensor_id=58393, value=42.3
  → Insert into MemTable (in-memory sorted tree)
  → Done! (microseconds)

Write sensor_id=61250, value=38.7
  → Insert into MemTable
  → Done! (microseconds)

... 100,000 more writes ...

MemTable full (say, 64 MB)
  → Flush entire MemTable to disk as SSTable-7
  → Sequential write to disk (very fast!)
  → Create new empty MemTable
  → Continue accepting writes

Why this is fast:

Writes to memory are instant
Disk writes are sequential (much faster than random writes)
No need to find the “right spot” - just append
No read operations needed

Compare this to B-tree:

Find the right leaf (3-4 disk reads)
Modify the leaf (1 disk write)
Maybe split nodes (more disk writes)

LSM tree writes can be 100x faster than B-tree writes.

Reading from an LSM Tree

Here’s the trade-off. Reads are more complex:

Scenario: Looking up sensor_id=58393

Step 1: Check MemTable (RAM)
  → Binary search through in-memory tree
  → Not found

Step 2: Check SSTable-7 (newest on disk)
  → Use Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 3: Check SSTable-6
  → Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 4: Check SSTable-5
  → Bloom filter: "definitely not here" → Skip!

Step 5: Check SSTable-4
  → Bloom filter: "might be here"
  → Read index, then data block
  → Found it! ✓

The problem: Read Amplification

You might have to check multiple SSTables before finding your data (or confirming it doesn’t exist). This is called read amplification.

If you have 10 SSTables, worst case you read from 10 places. Compare this to B-tree’s guaranteed 3-4 reads.

Compaction: Keeping Reads Fast

To prevent reads from getting too slow, LSM trees periodically compact SSTables.

Compaction process:

Before compaction:
  SSTable-7: [45000, 47000, 58000, 61000]
  SSTable-6: [20000, 30000, 58000, 60000]  ← duplicate 58000!
  SSTable-5: [10000, 25000, 40000, 58000]  ← duplicate 58000!

Run compaction (merge + deduplicate):
  SSTable-8: [10000, 20000, 25000, 30000, 40000, 45000, 
              47000, 58000, 60000, 61000]
  
  Delete SSTable-5, SSTable-6, SSTable-7

Now reads only check fewer SSTables!

Compaction strategies:

Size-Tiered Compaction
- Merge SSTables of similar size
- Example: Four 10 MB tables → One 40 MB table
Leveled Compaction
- Organize into levels with size limits
- Level 0: 10 MB total
- Level 1: 100 MB total
- Level 2: 1 GB total
- Merge down levels as they fill up

The trade-off:

Compaction uses CPU and I/O resources
But it keeps read performance from degrading
More aggressive compaction = faster reads, more write overhead
Less compaction = slower reads, less write overhead

Optimizations: Bloom Filters

LSM trees use Bloom filters to avoid unnecessary disk reads.

What’s a Bloom filter?

A space-efficient probabilistic data structure that answers:

“Is X definitely NOT in this SSTable?” → 100% accurate
“Is X possibly in this SSTable?” → might be (check to confirm)

Example:

Looking for key=58393

SSTable-5 Bloom filter:
  → Bloom filter says: "Definitely not here" 
  → Skip reading this SSTable entirely! ✓

SSTable-4 Bloom filter:
  → Bloom filter says: "Might be here"
  → Read the SSTable to check (and it is there!)

Why this matters:

Without Bloom filters, you’d read every SSTable from disk. With Bloom filters, you can skip most of them. This dramatically improves read performance.

A Bloom filter for 1 million keys takes only ~1.2 MB of RAM (with 1% false positive rate).

LSM Tree Characteristics Summary

Strengths:

Extremely fast writes (10-100x faster than B-trees)
Excellent for write-heavy workloads
Sequential disk writes (SSD-friendly)
Good compression (entire files can be compressed)

Weaknesses:

Slower reads (must check multiple SSTables)
Read amplification in worst cases
Write amplification during compaction
More complex to tune and maintain

Real-world usage:

Apache Cassandra
RocksDB (used by Facebook, LinkedIn)
LevelDB (used by Bitcoin, Minecraft)
Apache HBase
ScyllaDB

B-trees vs LSM Trees: Head-to-Head Comparison

Performance Characteristics

Operation	B-tree	LSM Tree
Writes	Moderate (3-4 reads + 1+ writes)	Very fast (memory write + occasional flush)
Reads	Fast (3-4 disk reads, predictable)	Variable (depends on # of SSTables)
Range queries	Excellent (data is sorted)	Good (must merge from multiple SSTables)
Space overhead	Low (minimal wasted space)	Medium (multiple copies during compaction)
Write amplification	Low to medium (splits)	High (compaction rewrites data)
Read amplification	Low (3-4 reads)	High (check multiple SSTables)

When to Choose B-trees

Choose B-trees when:

Read-heavy workload (10:1 or higher read:write ratio)
Predictable performance is crucial
Point queries are your primary access pattern
Range scans are common
ACID transactions are needed (easier with B-trees)

Example scenarios:

Traditional web applications (lots of user lookups)
Inventory management systems
Banking systems (balance lookups)
Product catalogs

When to Choose LSM Trees

Choose LSM trees when:

Write-heavy workload (1:1 or higher write:read ratio)
Append-only or time-series data
Ingestion speed is critical
SSD storage (sequential writes benefit SSDs)
Large datasets where compression matters

Example scenarios:

Time-series databases (metrics, logs, IoT sensors)
Event streaming systems
Analytics databases
Social media feeds (constant updates)
Monitoring systems

Time-Series Databases: A Perfect LSM Use Case

Time-series data has special characteristics that make LSM trees ideal:

Why Time-Series Loves LSM

Characteristic 1: Write-heavy

Millions of data points per second (sensor readings, metrics, logs)
Writes far outnumber reads
LSM’s fast writes shine here

Characteristic 2: Time-ordered

Data naturally arrives in time order
LSM trees already write sequentially
Perfect match!

Characteristic 3: Immutable recent data

Recent data is rarely updated (append-only)
Fits LSM’s immutable SSTable model
No need for in-place updates

Characteristic 4: Retention policies

Old data gets deleted in bulk
Can delete entire old SSTables
Much easier than deleting from B-tree

Time-Series LSM Optimization

Scenario: IoT sensor network

1 million sensors × 1 reading per second = 1 million writes/sec

LSM tree optimization:

Partitioning by time:
  SSTable-2024-10-07-14:00: All data for 2-3 PM
  SSTable-2024-10-07-13:00: All data for 1-2 PM
  SSTable-2024-10-07-12:00: All data for 12-1 PM

Query: "Get sensor 58393 data from 2:15-2:30 PM"
  → Only check SSTable-2024-10-07-14:00
  → Avoid reading other time ranges

Retention: "Delete data older than 30 days"
  → Delete entire SSTables older than 30 days
  → No per-record deletion needed

Additional optimizations:

Compress old SSTables aggressively (Zstandard, LZ4)
Use different compaction for recent vs old data
Keep recent data in faster storage (SSD), old data in slower storage (HDD)

Real-world Time-Series Databases

InfluxDB (LSM-based):

Handles billions of data points
Optimized for time-range queries
Custom TSM (Time-Structured Merge tree) storage engine

Prometheus (LSM-inspired):

Metrics monitoring
2-hour blocks written as immutable chunks
Compacts blocks over time

TimescaleDB (Hybrid):

Built on PostgreSQL (B-tree)
But uses hypertables (time-based partitioning)
Combines B-tree reliability with time-series optimizations

Deep Dive: Write Amplification

Both B-trees and LSM trees suffer from write amplification - writing more data than the application requested. Understanding this helps you tune your database.

B-tree Write Amplification

Scenario: Insert one 100-byte record

Read leaf node (4 KB) into memory
Modify it to add the record
Write entire node back (4 KB write)

Write amplification: 4 KB / 100 bytes = 40x

If the insert causes splits:

Write left split node (4 KB)
Write right split node (4 KB)
Write modified parent node (4 KB)

Write amplification: 12 KB / 100 bytes = 120x

LSM Tree Write Amplification

Scenario: Insert one 100-byte record

Phase 1 (MemTable → SSTable):

Write to MemTable (100 bytes)
When MemTable flushes → write 64 MB SSTable

Phase 2 (Compaction Level 0 → 1):

Merge 4 SSTables (256 MB total) → write 256 MB

Phase 3 (Compaction Level 1 → 2):

Merge into level 2 → write 2.5 GB

Total write amplification: Can reach 50-100x over the lifetime of data

Mitigating Write Amplification

For B-trees:

Use larger nodes (reduce splits)
Batch writes when possible
Use write-ahead log (WAL) for durability

For LSM trees:

Tune compaction strategy
Use leveled compaction (less amplification than size-tiered)
Delay compaction during peak write times
Adjust compaction triggers

Practical System Design Scenarios

Requirements:

100 million active users
Average 10 posts per day per user = 1 billion writes/day
Average user checks feed 50 times/day = 5 billion reads/day
Recent posts are most accessed

Analysis:

Read-heavy: 5:1 read:write ratio
But writes are very bursty
Recent data needs fast access
Old posts rarely accessed

Recommended approach: Hybrid

LSM tree for write path (handle burst writes)
Cache recent posts in memory
B-tree indexes for user lookups
Archive old data to cold storage

Scenario 2: Banking Transactions

Requirements:

Account balance lookups (very frequent)
Transaction inserts (less frequent)
ACID guarantees required
Consistent read performance critical

Analysis:

Read-heavy
Strong consistency needed
Predictable latency critical
Can’t afford read amplification

Recommended approach: B-tree

PostgreSQL or similar
B-tree indexes on account_id
WAL for durability
Replication for read scaling

Scenario 3: Application Monitoring

Requirements:

10,000 servers × 100 metrics/sec = 1 million writes/sec
Query patterns: time-range aggregations
Retention: 90 days, then delete
Reads are dashboards and alerts

Analysis:

Extremely write-heavy
Time-series data
Bulk deletes needed
Aggregations over time ranges

Recommended approach: LSM Tree (Time-Series DB)

InfluxDB or TimescaleDB
Partition by time
Compress old data
Delete old SSTables wholesale

Scenario 4: E-commerce Product Catalog

Requirements:

10 million products
Frequent searches by category, price, name
Occasional inventory updates
Range queries (price ranges)

Analysis:

Read-dominated
Complex queries (multiple indexes)
Range scans common
ACID transactions helpful

Recommended approach: B-tree

PostgreSQL with multiple indexes
B-tree on product_id (primary key)
B-tree on category, price
Full-text search index for names

Conclusion: Choosing Your Data Structure

The choice between B-trees and LSM trees isn’t about which is “better” - it’s about understanding your workload:

Choose B-trees when:

Reads dominate your workload
You need predictable, consistent latency
ACID transactions are important
Traditional relational database features matter

Choose LSM trees when:

Writes dominate your workload
You’re willing to trade read latency for write throughput
Your data is append-mostly or time-series
You can benefit from compression

Consider hybrid approaches when:

You have complex requirements
Different parts of your system have different needs
You can cache hot data to mitigate read amplification

Remember: modern databases often use both structures in different ways. PostgreSQL (primarily B-tree) uses WAL (which is log-structured). RocksDB (LSM) can use B-tree indexes for SSTables. Understanding both helps you make better decisions.

Further Learning

Books:

“Designing Data-Intensive Applications” by Martin Kleppmann
“Database Internals” by Alex Petrov

Papers:

“The Log-Structured Merge-Tree (LSM-Tree)” - O’Neil et al. (1996)
“The Ubiquitous B-Tree” - Douglas Comer (1979)

Open Source:

Study RocksDB source code (excellent LSM implementation)
Study PostgreSQL B-tree implementation
Experiment with both using benchmark tools

The best way to truly understand these structures is to use them, measure them, and see their trade-offs in action. Happy building!

B-trees vs LSM Trees: A Deep Dive for System Designers

The Core Problem: Finding a Needle in a Billion Haystacks

Why This Is Hard: The Disk Speed Problem

The Index: Our First Attempt

Scenario: Building a Simple Index

The Key Insight

B-trees: The Phone Book Approach

Understanding B-trees Through Analogy

Why Not Just Use a Binary Tree?

B-trees: Fat Nodes for Fewer Hops

Walking Through a B-tree Lookup

B-tree Writes: Maintaining Order

Scenario: Inserting a New User

Handling Node Overflow: The Split

When Splits Propagate Upward

Growing the Tree: Root Split

B-tree Characteristics Summary

LSM Trees: The Append-Only Approach

The Cashier Analogy

LSM Tree Structure

LSM Tree Operations

Writing to an LSM Tree

Reading from an LSM Tree

Compaction: Keeping Reads Fast

Optimizations: Bloom Filters

LSM Tree Characteristics Summary

B-trees vs LSM Trees: Head-to-Head Comparison

Performance Characteristics

When to Choose B-trees

When to Choose LSM Trees

Time-Series Databases: A Perfect LSM Use Case

Why Time-Series Loves LSM

Time-Series LSM Optimization

Real-world Time-Series Databases

Deep Dive: Write Amplification

B-tree Write Amplification

LSM Tree Write Amplification

Mitigating Write Amplification

Practical System Design Scenarios

Scenario 1: Social Media Feed

Scenario 2: Banking Transactions

Scenario 3: Application Monitoring

Scenario 4: E-commerce Product Catalog

Conclusion: Choosing Your Data Structure

Further Learning