18 min read

B-trees vs LSM Trees: A Deep Dive for System Designers

Learn the fundamental data structures powering modern databases through an intuitive, scenario-based approach

#Databases#System-design#Learning-log

Table of contents

  1. B-trees vs LSM Trees: A Deep Dive for System Designers
  2. The Core Problem: Finding a Needle in a Billion Haystacks
  3. Why This Is Hard: The Disk Speed Problem
  4. The Index: Our First Attempt
  5. Scenario: Building a Simple Index
  6. The Key Insight
  7. B-trees: The Phone Book Approach
  8. Understanding B-trees Through Analogy
  9. Why Not Just Use a Binary Tree?
  10. B-trees: Fat Nodes for Fewer Hops
  11. Walking Through a B-tree Lookup
  12. B-tree Writes: Maintaining Order
  13. B-tree Characteristics Summary
  14. LSM Trees: The Append-Only Approach
  15. The Cashier Analogy
  16. LSM Tree Structure
  17. LSM Tree Operations
  18. Compaction: Keeping Reads Fast
  19. Optimizations: Bloom Filters
  20. LSM Tree Characteristics Summary
  21. B-trees vs LSM Trees: Head-to-Head Comparison
  22. Performance Characteristics
  23. When to Choose B-trees
  24. When to Choose LSM Trees
  25. Time-Series Databases: A Perfect LSM Use Case
  26. Why Time-Series Loves LSM
  27. Time-Series LSM Optimization
  28. Real-world Time-Series Databases
  29. Deep Dive: Write Amplification
  30. B-tree Write Amplification
  31. LSM Tree Write Amplification
  32. Mitigating Write Amplification
  33. Practical System Design Scenarios
  34. Scenario 1: Social Media Feed
  35. Scenario 2: Banking Transactions
  36. Scenario 3: Application Monitoring
  37. Scenario 4: E-commerce Product Catalog
  38. Conclusion: Choosing Your Data Structure
  39. Further Learning

B-trees vs LSM Trees: A Deep Dive for System Designers

If you’re designing systems that need to store and retrieve billions of records efficiently, understanding B-trees and LSM trees is essential. These two data structures power most modern databases, but they make fundamentally different trade-offs. Let’s build your intuition from the ground up.

The Core Problem: Finding a Needle in a Billion Haystacks

Before we dive into B-trees and LSM trees, we need to understand the problem they’re solving. Imagine you have 1 billion user records stored on disk. When someone logs in, you need to find their specific record. How do you do that quickly?

Why This Is Hard: The Disk Speed Problem

Here’s the critical constraint that shapes everything in database design:

Reading from disk is roughly 100,000x slower than reading from RAM.

Let me put that in perspective:

  • RAM access: ~100 nanoseconds
  • SSD access: ~100 microseconds (1000x slower)
  • Hard drive access: ~10 milliseconds (100,000x slower)

If you could read from RAM in the time it takes to blink (100ms), reading from a hard drive would take 2.7 hours!

This massive speed difference means that the number of disk accesses matters more than almost anything else.

The Index: Our First Attempt

Your first instinct might be: “Let’s create an index!” Good thinking. An index is a lookup structure that maps keys to their disk locations.

Scenario: Building a Simple Index

Let’s say you have 1 billion user records:

  • Each entry needs: User ID (8 bytes) + disk location pointer (8 bytes)
  • Total index size: 16 bytes × 1 billion = 16 GB

Now we face a new problem: where do we store this 16 GB index?

Option 1: Keep it all in RAM

  • Works great for small datasets
  • For 1 million users = 16 MB index ✓
  • For 1 billion users = 16 GB index (uses half your server’s RAM!)
  • For 100 billion records = 1.6 TB index ✗ (doesn’t fit in RAM)

Option 2: Store the index on disk

  • But now we’ve just recreated the problem!
  • We need to search through a 16 GB file on disk
  • That’s millions of disk reads

The Key Insight

What we need is a structure that:

  1. Keeps the “hot” parts in RAM
  2. Stores the rest on disk
  3. Organizes the disk part so efficiently that we only need 3-4 disk reads to find anything

This is where B-trees come in.

B-trees: The Phone Book Approach

Understanding B-trees Through Analogy

Think about an old-fashioned phone book with a million names. You don’t read every page to find “Rodriguez”. Instead:

  1. Top level: Open to the middle - see sections “A-F | G-M | N-S | T-Z”
  2. Middle level: You want “R”, so look under “N-S”, which splits into “N | O | P-Q | R | S”
  3. Bottom level: Under “R” you find actual names: “Ramirez, Robertson, Rodriguez, Rogers…”

You only flipped to 3-4 pages total, not all 1000 pages. That’s the B-tree idea.

Why Not Just Use a Binary Tree?

You might think: “Why not use a binary tree? That’s a classic computer science structure!”

Here’s the problem:

Binary Tree:

  • Each node has 2 children
  • For 1 billion items, tree height ≈ 30 (log₂ 1 billion)
  • Each level requires one disk read
  • Total: 30 disk reads per lookup 😞

Remember, disk reads are the enemy. 30 disk reads is way too slow.

B-trees: Fat Nodes for Fewer Hops

B-trees use a clever trick: instead of each node having 2 children, each node has hundreds or thousands of children.

How it works:

  • Each node contains many keys (say, 100 keys) stored in sorted order
  • Each node has 101 possible children to follow
  • Each node is sized to fit exactly one disk page (typically 4 KB)

The math changes dramatically:

  • For 1 billion items with nodes of 100 keys
  • Tree height ≈ 4 (log₁₀₀ 1 billion)
  • Total: 3-4 disk reads per lookup! ✓

Walking Through a B-tree Lookup

Let’s make this concrete. You’re looking for user_id = 58392.

Step 1: Read root node from disk → RAM

Root: [10000, 20000, 30000, 40000, 50000, 60000, 70000, ...]
       ↓      ↓      ↓      ↓      ↓      ↓      ↓
     ptr1   ptr2   ptr3   ptr4   ptr5   ptr6   ptr7

Binary search within this node (fast - it’s in RAM!):

  • 58392 is between 50000 and 60000
  • Follow ptr5 to the next level

Step 2: Read that child node from disk → RAM

Node: [50100, 51000, 52000, ..., 58000, 59000, ...]
       ↓      ↓      ↓           ↓      ↓
     ptr1   ptr2   ptr3       ptr58   ptr59

Binary search within this node:

  • 58392 is between 58000 and 59000
  • Follow ptr58

Step 3: Read leaf node from disk → RAM

Leaf: [58001, 58050, 58200, 58392, 58500, ...]
       data1  data2  data3  data4  data5

Found it! This leaf also contains a pointer to where the actual user data lives.

Total: 3 disk reads to find one record among billions.

B-tree Writes: Maintaining Order

Reads are straightforward, but what about writes? This is where B-trees get interesting.

Scenario: Inserting a New User

You want to insert user_id = 58393 into our B-tree.

Step 1: Find the right location

Follow the same path as a read:

  • Root → choose path for 50000-60000
  • Middle → choose path for 58000-59000
  • Leaf → arrive at: [58001, 58050, 58200, 58392, 58500]

Step 2: Insert in sorted order

The new key 58393 belongs between 58392 and 58500:

Before: [58001, 58050, 58200, 58392, 58500]
After:  [58001, 58050, 58200, 58392, 58393, 58500]

But wait - what if the node is already full?

Handling Node Overflow: The Split

Let’s say each node can hold maximum 5 keys (in reality it’s hundreds, but 5 makes the example clearer).

Our node is full:

[58100, 58200, 58300, 58400, 58500] ← can't add more!

We need to insert 58393. Here’s what happens:

Split the node:

  • Take the middle key (58300)
  • Split into two nodes:
    • Left: [58100, 58200]
    • Right: [58400, 58500]
  • Promote 58300 to the parent

Now insert 58393:

  • It belongs in the right node (58393 > 58300)
  • Right node becomes: [58393, 58400, 58500]

Visual representation:

Before split:
  Leaf: [58100, 58200, 58300, 58400, 58500] FULL!

After split:
  Parent gets: [..., 58300, ...]

          ┌───────────┴───────────┐
    [58100, 58200]         [58400, 58500]

After inserting 58393:
  Parent: [..., 58300, ...]

         ┌───────┴────────┐
    [58100, 58200]  [58393, 58400, 58500]

When Splits Propagate Upward

Here’s where it gets more interesting. What if the parent node is also full?

Scenario: Cascading splits

Parent before promotion:

[50000, 55000, 60000, 65000, 70000] ← already full!

We need to insert 58300, but there’s no room. Solution: split the parent too!

Split parent:
  Left: [50000, 55000]
  Right: [65000, 70000]
  Promote: 60000 to grandparent

Now insert 58300:
  Left child: [50000, 55000, 58300] ✓

The split propagates upward like a ripple through the tree.

Growing the Tree: Root Split

Key insight: What happens when even the root is full?

The root has no parent to promote to. Solution: create a new root!

Before:
  Root: [20000, 40000, 60000, 80000] FULL
  (needs to split)

After:
      New Root: [60000]

    ┌──────┴──────┐
  [20000, 40000]  [80000]
  (old root left) (old root right)

This is the ONLY way a B-tree grows taller. Trees grow from the bottom up, not top down!

B-tree Characteristics Summary

Strengths:

  • Fast reads: 3-4 disk accesses for any record
  • Data stays sorted (great for range queries)
  • Predictable performance
  • Good for read-heavy workloads

Weaknesses:

  • Writes require finding the right location (3-4 disk reads)
  • Then writing the modified node back (1+ disk writes)
  • Splits require additional writes
  • Not great for write-heavy workloads

Real-world usage:

  • PostgreSQL, MySQL (InnoDB)
  • SQLite
  • Most traditional relational databases

LSM Trees: The Append-Only Approach

Now let’s look at a completely different approach to the same problem. LSM trees (Log-Structured Merge trees) optimize for writes at the expense of reads.

The Cashier Analogy

Imagine you’re a cashier at a busy store during the holiday rush. Customers keep handing you receipts to file.

Approach A: Perfect Organization (B-tree)

  • Every receipt gets filed in the exact right spot immediately
  • Open the filing cabinet, find the folder, insert in sorted order
  • Very organized, but slow when you’re swamped with customers

Approach B: Pile It Up (Bad idea)

  • Just throw receipts in a pile on your desk
  • Super fast to “file”! But impossible to find anything later

Approach C: Batch Processing (LSM tree)

  • During rush hours: throw receipts in a box (fast!)
  • During quiet times: sort the box and merge into filing cabinet
  • When someone needs a receipt: check today’s box first, then yesterday’s box, then the filing cabinet

Approach C trades immediate organization for write speed. This is LSM trees.

LSM Tree Structure

An LSM tree has multiple levels of storage:

Level 0: MemTable (RAM)

  • New writes go here first
  • It’s a sorted tree structure in memory (like a red-black tree)
  • Writes are blazingly fast - just memory operations

Level 1-N: SSTables on Disk (Sorted String Tables)

  • When MemTable fills up, flush it to disk as an SSTable
  • SSTables are immutable - never modified after creation
  • Each SSTable is a sorted file
  • Multiple SSTables accumulate over time

LSM Tree Operations

Writing to an LSM Tree

Scenario: High-frequency writes

Imagine you’re ingesting 100,000 sensor readings per second:

Write sensor_id=58393, value=42.3
  → Insert into MemTable (in-memory sorted tree)
  → Done! (microseconds)

Write sensor_id=61250, value=38.7
  → Insert into MemTable
  → Done! (microseconds)

... 100,000 more writes ...

MemTable full (say, 64 MB)
  → Flush entire MemTable to disk as SSTable-7
  → Sequential write to disk (very fast!)
  → Create new empty MemTable
  → Continue accepting writes

Why this is fast:

  • Writes to memory are instant
  • Disk writes are sequential (much faster than random writes)
  • No need to find the “right spot” - just append
  • No read operations needed

Compare this to B-tree:

  • Find the right leaf (3-4 disk reads)
  • Modify the leaf (1 disk write)
  • Maybe split nodes (more disk writes)

LSM tree writes can be 100x faster than B-tree writes.

Reading from an LSM Tree

Here’s the trade-off. Reads are more complex:

Scenario: Looking up sensor_id=58393

Step 1: Check MemTable (RAM)
  → Binary search through in-memory tree
  → Not found

Step 2: Check SSTable-7 (newest on disk)
  → Use Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 3: Check SSTable-6
  → Bloom filter: "might be here"
  → Read index, then data block
  → Not found

Step 4: Check SSTable-5
  → Bloom filter: "definitely not here" → Skip!

Step 5: Check SSTable-4
  → Bloom filter: "might be here"
  → Read index, then data block
  → Found it! ✓

The problem: Read Amplification

You might have to check multiple SSTables before finding your data (or confirming it doesn’t exist). This is called read amplification.

If you have 10 SSTables, worst case you read from 10 places. Compare this to B-tree’s guaranteed 3-4 reads.

Compaction: Keeping Reads Fast

To prevent reads from getting too slow, LSM trees periodically compact SSTables.

Compaction process:

Before compaction:
  SSTable-7: [45000, 47000, 58000, 61000]
  SSTable-6: [20000, 30000, 58000, 60000]  ← duplicate 58000!
  SSTable-5: [10000, 25000, 40000, 58000]  ← duplicate 58000!

Run compaction (merge + deduplicate):
  SSTable-8: [10000, 20000, 25000, 30000, 40000, 45000, 
              47000, 58000, 60000, 61000]
  
  Delete SSTable-5, SSTable-6, SSTable-7

Now reads only check fewer SSTables!

Compaction strategies:

  1. Size-Tiered Compaction

    • Merge SSTables of similar size
    • Example: Four 10 MB tables → One 40 MB table
  2. Leveled Compaction

    • Organize into levels with size limits
    • Level 0: 10 MB total
    • Level 1: 100 MB total
    • Level 2: 1 GB total
    • Merge down levels as they fill up

The trade-off:

  • Compaction uses CPU and I/O resources
  • But it keeps read performance from degrading
  • More aggressive compaction = faster reads, more write overhead
  • Less compaction = slower reads, less write overhead

Optimizations: Bloom Filters

LSM trees use Bloom filters to avoid unnecessary disk reads.

What’s a Bloom filter?

A space-efficient probabilistic data structure that answers:

  • “Is X definitely NOT in this SSTable?” → 100% accurate
  • “Is X possibly in this SSTable?” → might be (check to confirm)

Example:

Looking for key=58393

SSTable-5 Bloom filter:
  → Bloom filter says: "Definitely not here" 
  → Skip reading this SSTable entirely! ✓

SSTable-4 Bloom filter:
  → Bloom filter says: "Might be here"
  → Read the SSTable to check (and it is there!)

Why this matters:

Without Bloom filters, you’d read every SSTable from disk. With Bloom filters, you can skip most of them. This dramatically improves read performance.

A Bloom filter for 1 million keys takes only ~1.2 MB of RAM (with 1% false positive rate).

LSM Tree Characteristics Summary

Strengths:

  • Extremely fast writes (10-100x faster than B-trees)
  • Excellent for write-heavy workloads
  • Sequential disk writes (SSD-friendly)
  • Good compression (entire files can be compressed)

Weaknesses:

  • Slower reads (must check multiple SSTables)
  • Read amplification in worst cases
  • Write amplification during compaction
  • More complex to tune and maintain

Real-world usage:

  • Apache Cassandra
  • RocksDB (used by Facebook, LinkedIn)
  • LevelDB (used by Bitcoin, Minecraft)
  • Apache HBase
  • ScyllaDB

B-trees vs LSM Trees: Head-to-Head Comparison

Performance Characteristics

OperationB-treeLSM Tree
WritesModerate (3-4 reads + 1+ writes)Very fast (memory write + occasional flush)
ReadsFast (3-4 disk reads, predictable)Variable (depends on # of SSTables)
Range queriesExcellent (data is sorted)Good (must merge from multiple SSTables)
Space overheadLow (minimal wasted space)Medium (multiple copies during compaction)
Write amplificationLow to medium (splits)High (compaction rewrites data)
Read amplificationLow (3-4 reads)High (check multiple SSTables)

When to Choose B-trees

Choose B-trees when:

  1. Read-heavy workload (10:1 or higher read:write ratio)
  2. Predictable performance is crucial
  3. Point queries are your primary access pattern
  4. Range scans are common
  5. ACID transactions are needed (easier with B-trees)

Example scenarios:

  • Traditional web applications (lots of user lookups)
  • Inventory management systems
  • Banking systems (balance lookups)
  • Product catalogs

When to Choose LSM Trees

Choose LSM trees when:

  1. Write-heavy workload (1:1 or higher write:read ratio)
  2. Append-only or time-series data
  3. Ingestion speed is critical
  4. SSD storage (sequential writes benefit SSDs)
  5. Large datasets where compression matters

Example scenarios:

  • Time-series databases (metrics, logs, IoT sensors)
  • Event streaming systems
  • Analytics databases
  • Social media feeds (constant updates)
  • Monitoring systems

Time-Series Databases: A Perfect LSM Use Case

Time-series data has special characteristics that make LSM trees ideal:

Why Time-Series Loves LSM

Characteristic 1: Write-heavy

  • Millions of data points per second (sensor readings, metrics, logs)
  • Writes far outnumber reads
  • LSM’s fast writes shine here

Characteristic 2: Time-ordered

  • Data naturally arrives in time order
  • LSM trees already write sequentially
  • Perfect match!

Characteristic 3: Immutable recent data

  • Recent data is rarely updated (append-only)
  • Fits LSM’s immutable SSTable model
  • No need for in-place updates

Characteristic 4: Retention policies

  • Old data gets deleted in bulk
  • Can delete entire old SSTables
  • Much easier than deleting from B-tree

Time-Series LSM Optimization

Scenario: IoT sensor network

1 million sensors × 1 reading per second = 1 million writes/sec

LSM tree optimization:

Partitioning by time:
  SSTable-2024-10-07-14:00: All data for 2-3 PM
  SSTable-2024-10-07-13:00: All data for 1-2 PM
  SSTable-2024-10-07-12:00: All data for 12-1 PM

Query: "Get sensor 58393 data from 2:15-2:30 PM"
  → Only check SSTable-2024-10-07-14:00
  → Avoid reading other time ranges

Retention: "Delete data older than 30 days"
  → Delete entire SSTables older than 30 days
  → No per-record deletion needed

Additional optimizations:

  • Compress old SSTables aggressively (Zstandard, LZ4)
  • Use different compaction for recent vs old data
  • Keep recent data in faster storage (SSD), old data in slower storage (HDD)

Real-world Time-Series Databases

InfluxDB (LSM-based):

  • Handles billions of data points
  • Optimized for time-range queries
  • Custom TSM (Time-Structured Merge tree) storage engine

Prometheus (LSM-inspired):

  • Metrics monitoring
  • 2-hour blocks written as immutable chunks
  • Compacts blocks over time

TimescaleDB (Hybrid):

  • Built on PostgreSQL (B-tree)
  • But uses hypertables (time-based partitioning)
  • Combines B-tree reliability with time-series optimizations

Deep Dive: Write Amplification

Both B-trees and LSM trees suffer from write amplification - writing more data than the application requested. Understanding this helps you tune your database.

B-tree Write Amplification

Scenario: Insert one 100-byte record

  1. Read leaf node (4 KB) into memory
  2. Modify it to add the record
  3. Write entire node back (4 KB write)

Write amplification: 4 KB / 100 bytes = 40x

If the insert causes splits:

  1. Write left split node (4 KB)
  2. Write right split node (4 KB)
  3. Write modified parent node (4 KB)

Write amplification: 12 KB / 100 bytes = 120x

LSM Tree Write Amplification

Scenario: Insert one 100-byte record

Phase 1 (MemTable → SSTable):

  1. Write to MemTable (100 bytes)
  2. When MemTable flushes → write 64 MB SSTable

Phase 2 (Compaction Level 0 → 1):

  • Merge 4 SSTables (256 MB total) → write 256 MB

Phase 3 (Compaction Level 1 → 2):

  • Merge into level 2 → write 2.5 GB

Total write amplification: Can reach 50-100x over the lifetime of data

Mitigating Write Amplification

For B-trees:

  • Use larger nodes (reduce splits)
  • Batch writes when possible
  • Use write-ahead log (WAL) for durability

For LSM trees:

  • Tune compaction strategy
  • Use leveled compaction (less amplification than size-tiered)
  • Delay compaction during peak write times
  • Adjust compaction triggers

Practical System Design Scenarios

Scenario 1: Social Media Feed

Requirements:

  • 100 million active users
  • Average 10 posts per day per user = 1 billion writes/day
  • Average user checks feed 50 times/day = 5 billion reads/day
  • Recent posts are most accessed

Analysis:

  • Read-heavy: 5:1 read:write ratio
  • But writes are very bursty
  • Recent data needs fast access
  • Old posts rarely accessed

Recommended approach: Hybrid

  • LSM tree for write path (handle burst writes)
  • Cache recent posts in memory
  • B-tree indexes for user lookups
  • Archive old data to cold storage

Scenario 2: Banking Transactions

Requirements:

  • Account balance lookups (very frequent)
  • Transaction inserts (less frequent)
  • ACID guarantees required
  • Consistent read performance critical

Analysis:

  • Read-heavy
  • Strong consistency needed
  • Predictable latency critical
  • Can’t afford read amplification

Recommended approach: B-tree

  • PostgreSQL or similar
  • B-tree indexes on account_id
  • WAL for durability
  • Replication for read scaling

Scenario 3: Application Monitoring

Requirements:

  • 10,000 servers × 100 metrics/sec = 1 million writes/sec
  • Query patterns: time-range aggregations
  • Retention: 90 days, then delete
  • Reads are dashboards and alerts

Analysis:

  • Extremely write-heavy
  • Time-series data
  • Bulk deletes needed
  • Aggregations over time ranges

Recommended approach: LSM Tree (Time-Series DB)

  • InfluxDB or TimescaleDB
  • Partition by time
  • Compress old data
  • Delete old SSTables wholesale

Scenario 4: E-commerce Product Catalog

Requirements:

  • 10 million products
  • Frequent searches by category, price, name
  • Occasional inventory updates
  • Range queries (price ranges)

Analysis:

  • Read-dominated
  • Complex queries (multiple indexes)
  • Range scans common
  • ACID transactions helpful

Recommended approach: B-tree

  • PostgreSQL with multiple indexes
  • B-tree on product_id (primary key)
  • B-tree on category, price
  • Full-text search index for names

Conclusion: Choosing Your Data Structure

The choice between B-trees and LSM trees isn’t about which is “better” - it’s about understanding your workload:

Choose B-trees when:

  • Reads dominate your workload
  • You need predictable, consistent latency
  • ACID transactions are important
  • Traditional relational database features matter

Choose LSM trees when:

  • Writes dominate your workload
  • You’re willing to trade read latency for write throughput
  • Your data is append-mostly or time-series
  • You can benefit from compression

Consider hybrid approaches when:

  • You have complex requirements
  • Different parts of your system have different needs
  • You can cache hot data to mitigate read amplification

Remember: modern databases often use both structures in different ways. PostgreSQL (primarily B-tree) uses WAL (which is log-structured). RocksDB (LSM) can use B-tree indexes for SSTables. Understanding both helps you make better decisions.

Further Learning

Books:

  • “Designing Data-Intensive Applications” by Martin Kleppmann
  • “Database Internals” by Alex Petrov

Papers:

  • “The Log-Structured Merge-Tree (LSM-Tree)” - O’Neil et al. (1996)
  • “The Ubiquitous B-Tree” - Douglas Comer (1979)

Open Source:

  • Study RocksDB source code (excellent LSM implementation)
  • Study PostgreSQL B-tree implementation
  • Experiment with both using benchmark tools

The best way to truly understand these structures is to use them, measure them, and see their trade-offs in action. Happy building!

Share

More to explore

Keep exploring

Previous

Understanding Database Replication and WAL Through Real-World Scenarios

Next

Taming 4KB vs 16KB Page Sizes on Android: A Developer’s Learning Log