B-trees vs LSM Trees: A Deep Dive for System Designers

18 min read
#databases #data-structures #system-design #b-trees #lsm-trees

B-trees vs LSM Trees: A Deep Dive for System Designers

If you’re designing systems that need to store and retrieve billions of records efficiently, understanding B-trees and LSM trees is essential. These two data structures power most modern databases, but they make fundamentally different trade-offs. Let’s build your intuition from the ground up.

The Core Problem: Finding a Needle in a Billion Haystacks

Before we dive into B-trees and LSM trees, we need to understand the problem they’re solving. Imagine you have 1 billion user records stored on disk. When someone logs in, you need to find their specific record. How do you do that quickly?

Why This Is Hard: The Disk Speed Problem

Here’s the critical constraint that shapes everything in database design:

Reading from disk is roughly 100,000x slower than reading from RAM.

Let me put that in perspective:

If you could read from RAM in the time it takes to blink (100ms), reading from a hard drive would take 2.7 hours!

This massive speed difference means that the number of disk accesses matters more than almost anything else.

The Index: Our First Attempt

Your first instinct might be: “Let’s create an index!” Good thinking. An index is a lookup structure that maps keys to their disk locations.

Scenario: Building a Simple Index

Let’s say you have 1 billion user records:

Now we face a new problem: where do we store this 16 GB index?

Option 1: Keep it all in RAM

Option 2: Store the index on disk

The Key Insight

What we need is a structure that:

  1. Keeps the “hot” parts in RAM
  2. Stores the rest on disk
  3. Organizes the disk part so efficiently that we only need 3-4 disk reads to find anything

This is where B-trees come in.

B-trees: The Phone Book Approach

Understanding B-trees Through Analogy

Think about an old-fashioned phone book with a million names. You don’t read every page to find “Rodriguez”. Instead:

  1. Top level: Open to the middle - see sections “A-F | G-M | N-S | T-Z”
  2. Middle level: You want “R”, so look under “N-S”, which splits into “N | O | P-Q | R | S”
  3. Bottom level: Under “R” you find actual names: “Ramirez, Robertson, Rodriguez, Rogers…”

You only flipped to 3-4 pages total, not all 1000 pages. That’s the B-tree idea.

Why Not Just Use a Binary Tree?

You might think: “Why not use a binary tree? That’s a classic computer science structure!”

Here’s the problem:

Binary Tree:

Remember, disk reads are the enemy. 30 disk reads is way too slow.

B-trees: Fat Nodes for Fewer Hops

B-trees use a clever trick: instead of each node having 2 children, each node has hundreds or thousands of children.

How it works:

The math changes dramatically:

Walking Through a B-tree Lookup

Let’s make this concrete. You’re looking for user_id = 58392.

Step 1: Read root node from disk → RAM

Root: [10000, 20000, 30000, 40000, 50000, 60000, 70000, ...]
       ↓      ↓      ↓      ↓      ↓      ↓      ↓
     ptr1   ptr2   ptr3   ptr4   ptr5   ptr6   ptr7

Binary search within this node (fast - it’s in RAM!):

Step 2: Read that child node from disk → RAM

Node: [50100, 51000, 52000, ..., 58000, 59000, ...]
       ↓      ↓      ↓           ↓      ↓
     ptr1   ptr2   ptr3       ptr58   ptr59

Binary search within this node:

Step 3: Read leaf node from disk → RAM

Leaf: [58001, 58050, 58200, 58392, 58500, ...]
       data1  data2  data3  data4  data5

Found it! This leaf also contains a pointer to where the actual user data lives.

Total: 3 disk reads to find one record among billions.

B-tree Writes: Maintaining Order

Reads are straightforward, but what about writes? This is where B-trees get interesting.

Scenario: Inserting a New User

You want to insert user_id = 58393 into our B-tree.

Step 1: Find the right location

Follow the same path as a read:

Step 2: Insert in sorted order

The new key 58393 belongs between 58392 and 58500:

Before: [58001, 58050, 58200, 58392, 58500]
After:  [58001, 58050, 58200, 58392, 58393, 58500]

But wait - what if the node is already full?

Handling Node Overflow: The Split

Let’s say each node can hold maximum 5 keys (in reality it’s hundreds, but 5 makes the example clearer).

Our node is full:

[58100, 58200, 58300, 58400, 58500] ← can't add more!

We need to insert 58393. Here’s what happens:

Split the node:

Now insert 58393:

Visual representation:

Before split:
  Leaf: [58100, 58200, 58300, 58400, 58500] FULL!

After split:
  Parent gets: [..., 58300, ...]

          ┌───────────┴───────────┐
    [58100, 58200]         [58400, 58500]

After inserting 58393:
  Parent: [..., 58300, ...]

         ┌───────┴────────┐
    [58100, 58200]  [58393, 58400, 58500]

When Splits Propagate Upward

Here’s where it gets more interesting. What if the parent node is also full?

Scenario: Cascading splits

Parent before promotion:

[50000, 55000, 60000, 65000, 70000] ← already full!

We need to insert 58300, but there’s no room. Solution: split the parent too!

Split parent:
  Left: [50000, 55000]
  Right: [65000, 70000]
  Promote: 60000 to grandparent

Now insert 58300:
  Left child: [50000, 55000, 58300] ✓

The split propagates upward like a ripple through the tree.

Growing the Tree: Root Split

Key insight: What happens when even the root is full?

The root has no parent to promote to. Solution: create a new root!

Before:
  Root: [20000, 40000, 60000, 80000] FULL
  (needs to split)

After:
      New Root: [60000]

    ┌──────┴──────┐
  [20000, 40000]  [80000]
  (old root left) (old root right)

This is the ONLY way a B-tree grows taller. Trees grow from the bottom up, not top down!

B-tree Characteristics Summary

Strengths:

Weaknesses:

Real-world usage:

LSM Trees: The Append-Only Approach

Now let’s look at a completely different approach to the same problem. LSM trees (Log-Structured Merge trees) optimize for writes at the expense of reads.

The Cashier Analogy

Imagine you’re a cashier at a busy store during the holiday rush. Customers keep handing you receipts to file.

Approach A: Perfect Organization (B-tree)

Approach B: Pile It Up (Bad idea)

Approach C: Batch Processing (LSM tree)

Approach C trades immediate organization for write speed. This is LSM trees.

LSM Tree Structure

An LSM tree has multiple levels of storage:

Level 0: MemTable (RAM)

Level 1-N: SSTables on Disk (Sorted String Tables)

LSM Tree Operations

Writing to an LSM Tree

Scenario: High-frequency writes

Imagine you’re ingesting 100,000 sensor readings per second:

Write sensor_id=58393, value=42.3
  → Insert into MemTable (in-memory sorted tree)
  → Done! (microseconds)

Write sensor_id=61250, value=38.7
  → Insert into MemTable
  → Done! (microseconds)

... 100,000 more writes ...

MemTable full (say, 64 MB)
  → Flush entire MemTable to disk as SSTable-7
  → Sequential write to disk (very fast!)
  → Create new empty MemTable
  → Continue accepting writes

Why this is fast:

Compare this to B-tree:

LSM tree writes can be 100x faster than B-tree writes.

Reading from an LSM Tree

Here’s the trade-off. Reads are more complex:

Scenario: Looking up sensor_id=58393

Step 1: Check MemTable (RAM)
  → Binary search through in-memory tree
  → Not found

Step 2: Check SSTable-7 (newest on disk)
  → Use Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 3: Check SSTable-6
  → Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 4: Check SSTable-5
  → Bloom filter: "definitely not here" → Skip!

Step 5: Check SSTable-4
  → Bloom filter: "might be here"
  → Read index, then data block
  → Found it! ✓

The problem: Read Amplification

You might have to check multiple SSTables before finding your data (or confirming it doesn’t exist). This is called read amplification.

If you have 10 SSTables, worst case you read from 10 places. Compare this to B-tree’s guaranteed 3-4 reads.

Compaction: Keeping Reads Fast

To prevent reads from getting too slow, LSM trees periodically compact SSTables.

Compaction process:

Before compaction:
  SSTable-7: [45000, 47000, 58000, 61000]
  SSTable-6: [20000, 30000, 58000, 60000]  ← duplicate 58000!
  SSTable-5: [10000, 25000, 40000, 58000]  ← duplicate 58000!

Run compaction (merge + deduplicate):
  SSTable-8: [10000, 20000, 25000, 30000, 40000, 45000, 
              47000, 58000, 60000, 61000]
  
  Delete SSTable-5, SSTable-6, SSTable-7

Now reads only check fewer SSTables!

Compaction strategies:

  1. Size-Tiered Compaction

    • Merge SSTables of similar size
    • Example: Four 10 MB tables → One 40 MB table
  2. Leveled Compaction

    • Organize into levels with size limits
    • Level 0: 10 MB total
    • Level 1: 100 MB total
    • Level 2: 1 GB total
    • Merge down levels as they fill up

The trade-off:

Optimizations: Bloom Filters

LSM trees use Bloom filters to avoid unnecessary disk reads.

What’s a Bloom filter?

A space-efficient probabilistic data structure that answers:

Example:

Looking for key=58393

SSTable-5 Bloom filter:
  → Bloom filter says: "Definitely not here" 
  → Skip reading this SSTable entirely! ✓

SSTable-4 Bloom filter:
  → Bloom filter says: "Might be here"
  → Read the SSTable to check (and it is there!)

Why this matters:

Without Bloom filters, you’d read every SSTable from disk. With Bloom filters, you can skip most of them. This dramatically improves read performance.

A Bloom filter for 1 million keys takes only ~1.2 MB of RAM (with 1% false positive rate).

LSM Tree Characteristics Summary

Strengths:

Weaknesses:

Real-world usage:

B-trees vs LSM Trees: Head-to-Head Comparison

Performance Characteristics

OperationB-treeLSM Tree
WritesModerate (3-4 reads + 1+ writes)Very fast (memory write + occasional flush)
ReadsFast (3-4 disk reads, predictable)Variable (depends on # of SSTables)
Range queriesExcellent (data is sorted)Good (must merge from multiple SSTables)
Space overheadLow (minimal wasted space)Medium (multiple copies during compaction)
Write amplificationLow to medium (splits)High (compaction rewrites data)
Read amplificationLow (3-4 reads)High (check multiple SSTables)

When to Choose B-trees

Choose B-trees when:

  1. Read-heavy workload (10:1 or higher read:write ratio)
  2. Predictable performance is crucial
  3. Point queries are your primary access pattern
  4. Range scans are common
  5. ACID transactions are needed (easier with B-trees)

Example scenarios:

When to Choose LSM Trees

Choose LSM trees when:

  1. Write-heavy workload (1:1 or higher write:read ratio)
  2. Append-only or time-series data
  3. Ingestion speed is critical
  4. SSD storage (sequential writes benefit SSDs)
  5. Large datasets where compression matters

Example scenarios:

Time-Series Databases: A Perfect LSM Use Case

Time-series data has special characteristics that make LSM trees ideal:

Why Time-Series Loves LSM

Characteristic 1: Write-heavy

Characteristic 2: Time-ordered

Characteristic 3: Immutable recent data

Characteristic 4: Retention policies

Time-Series LSM Optimization

Scenario: IoT sensor network

1 million sensors × 1 reading per second = 1 million writes/sec

LSM tree optimization:

Partitioning by time:
  SSTable-2024-10-07-14:00: All data for 2-3 PM
  SSTable-2024-10-07-13:00: All data for 1-2 PM
  SSTable-2024-10-07-12:00: All data for 12-1 PM

Query: "Get sensor 58393 data from 2:15-2:30 PM"
  → Only check SSTable-2024-10-07-14:00
  → Avoid reading other time ranges

Retention: "Delete data older than 30 days"
  → Delete entire SSTables older than 30 days
  → No per-record deletion needed

Additional optimizations:

Real-world Time-Series Databases

InfluxDB (LSM-based):

Prometheus (LSM-inspired):

TimescaleDB (Hybrid):

Deep Dive: Write Amplification

Both B-trees and LSM trees suffer from write amplification - writing more data than the application requested. Understanding this helps you tune your database.

B-tree Write Amplification

Scenario: Insert one 100-byte record

  1. Read leaf node (4 KB) into memory
  2. Modify it to add the record
  3. Write entire node back (4 KB write)

Write amplification: 4 KB / 100 bytes = 40x

If the insert causes splits:

  1. Write left split node (4 KB)
  2. Write right split node (4 KB)
  3. Write modified parent node (4 KB)

Write amplification: 12 KB / 100 bytes = 120x

LSM Tree Write Amplification

Scenario: Insert one 100-byte record

Phase 1 (MemTable → SSTable):

  1. Write to MemTable (100 bytes)
  2. When MemTable flushes → write 64 MB SSTable

Phase 2 (Compaction Level 0 → 1):

Phase 3 (Compaction Level 1 → 2):

Total write amplification: Can reach 50-100x over the lifetime of data

Mitigating Write Amplification

For B-trees:

For LSM trees:

Practical System Design Scenarios

Scenario 1: Social Media Feed

Requirements:

Analysis:

Recommended approach: Hybrid

Scenario 2: Banking Transactions

Requirements:

Analysis:

Recommended approach: B-tree

Scenario 3: Application Monitoring

Requirements:

Analysis:

Recommended approach: LSM Tree (Time-Series DB)

Scenario 4: E-commerce Product Catalog

Requirements:

Analysis:

Recommended approach: B-tree

Conclusion: Choosing Your Data Structure

The choice between B-trees and LSM trees isn’t about which is “better” - it’s about understanding your workload:

Choose B-trees when:

Choose LSM trees when:

Consider hybrid approaches when:

Remember: modern databases often use both structures in different ways. PostgreSQL (primarily B-tree) uses WAL (which is log-structured). RocksDB (LSM) can use B-tree indexes for SSTables. Understanding both helps you make better decisions.

Further Learning

Books:

Papers:

Open Source:

The best way to truly understand these structures is to use them, measure them, and see their trade-offs in action. Happy building!

About the Author

Aniket Indulkar is an Android Engineer based in London with a Master's in Artificial Intelligence. He writes about AI, ML, Android development, and his continuous learning journey.

Connect on LinkedIn →