B-trees vs LSM Trees: A Deep Dive for System Designers
B-trees vs LSM Trees: A Deep Dive for System Designers
If you’re designing systems that need to store and retrieve billions of records efficiently, understanding B-trees and LSM trees is essential. These two data structures power most modern databases, but they make fundamentally different trade-offs. Let’s build your intuition from the ground up.
The Core Problem: Finding a Needle in a Billion Haystacks
Before we dive into B-trees and LSM trees, we need to understand the problem they’re solving. Imagine you have 1 billion user records stored on disk. When someone logs in, you need to find their specific record. How do you do that quickly?
Why This Is Hard: The Disk Speed Problem
Here’s the critical constraint that shapes everything in database design:
Reading from disk is roughly 100,000x slower than reading from RAM.
Let me put that in perspective:
- RAM access: ~100 nanoseconds
- SSD access: ~100 microseconds (1000x slower)
- Hard drive access: ~10 milliseconds (100,000x slower)
If you could read from RAM in the time it takes to blink (100ms), reading from a hard drive would take 2.7 hours!
This massive speed difference means that the number of disk accesses matters more than almost anything else.
The Index: Our First Attempt
Your first instinct might be: “Let’s create an index!” Good thinking. An index is a lookup structure that maps keys to their disk locations.
Scenario: Building a Simple Index
Let’s say you have 1 billion user records:
- Each entry needs: User ID (8 bytes) + disk location pointer (8 bytes)
- Total index size: 16 bytes × 1 billion = 16 GB
Now we face a new problem: where do we store this 16 GB index?
Option 1: Keep it all in RAM
- Works great for small datasets
- For 1 million users = 16 MB index ✓
- For 1 billion users = 16 GB index (uses half your server’s RAM!)
- For 100 billion records = 1.6 TB index ✗ (doesn’t fit in RAM)
Option 2: Store the index on disk
- But now we’ve just recreated the problem!
- We need to search through a 16 GB file on disk
- That’s millions of disk reads
The Key Insight
What we need is a structure that:
- Keeps the “hot” parts in RAM
- Stores the rest on disk
- Organizes the disk part so efficiently that we only need 3-4 disk reads to find anything
This is where B-trees come in.
B-trees: The Phone Book Approach
Understanding B-trees Through Analogy
Think about an old-fashioned phone book with a million names. You don’t read every page to find “Rodriguez”. Instead:
- Top level: Open to the middle - see sections “A-F | G-M | N-S | T-Z”
- Middle level: You want “R”, so look under “N-S”, which splits into “N | O | P-Q | R | S”
- Bottom level: Under “R” you find actual names: “Ramirez, Robertson, Rodriguez, Rogers…”
You only flipped to 3-4 pages total, not all 1000 pages. That’s the B-tree idea.
Why Not Just Use a Binary Tree?
You might think: “Why not use a binary tree? That’s a classic computer science structure!”
Here’s the problem:
Binary Tree:
- Each node has 2 children
- For 1 billion items, tree height ≈ 30 (log₂ 1 billion)
- Each level requires one disk read
- Total: 30 disk reads per lookup 😞
Remember, disk reads are the enemy. 30 disk reads is way too slow.
B-trees: Fat Nodes for Fewer Hops
B-trees use a clever trick: instead of each node having 2 children, each node has hundreds or thousands of children.
How it works:
- Each node contains many keys (say, 100 keys) stored in sorted order
- Each node has 101 possible children to follow
- Each node is sized to fit exactly one disk page (typically 4 KB)
The math changes dramatically:
- For 1 billion items with nodes of 100 keys
- Tree height ≈ 4 (log₁₀₀ 1 billion)
- Total: 3-4 disk reads per lookup! ✓
Walking Through a B-tree Lookup
Let’s make this concrete. You’re looking for user_id = 58392.
Step 1: Read root node from disk → RAM
Root: [10000, 20000, 30000, 40000, 50000, 60000, 70000, ...]
↓ ↓ ↓ ↓ ↓ ↓ ↓
ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7
Binary search within this node (fast - it’s in RAM!):
- 58392 is between 50000 and 60000
- Follow ptr5 to the next level
Step 2: Read that child node from disk → RAM
Node: [50100, 51000, 52000, ..., 58000, 59000, ...]
↓ ↓ ↓ ↓ ↓
ptr1 ptr2 ptr3 ptr58 ptr59
Binary search within this node:
- 58392 is between 58000 and 59000
- Follow ptr58
Step 3: Read leaf node from disk → RAM
Leaf: [58001, 58050, 58200, 58392, 58500, ...]
data1 data2 data3 data4 data5
Found it! This leaf also contains a pointer to where the actual user data lives.
Total: 3 disk reads to find one record among billions.
B-tree Writes: Maintaining Order
Reads are straightforward, but what about writes? This is where B-trees get interesting.
Scenario: Inserting a New User
You want to insert user_id = 58393 into our B-tree.
Step 1: Find the right location
Follow the same path as a read:
- Root → choose path for 50000-60000
- Middle → choose path for 58000-59000
- Leaf → arrive at:
[58001, 58050, 58200, 58392, 58500]
Step 2: Insert in sorted order
The new key 58393 belongs between 58392 and 58500:
Before: [58001, 58050, 58200, 58392, 58500]
After: [58001, 58050, 58200, 58392, 58393, 58500]
But wait - what if the node is already full?
Handling Node Overflow: The Split
Let’s say each node can hold maximum 5 keys (in reality it’s hundreds, but 5 makes the example clearer).
Our node is full:
[58100, 58200, 58300, 58400, 58500] ← can't add more!
We need to insert 58393. Here’s what happens:
Split the node:
- Take the middle key (58300)
- Split into two nodes:
- Left:
[58100, 58200] - Right:
[58400, 58500]
- Left:
- Promote 58300 to the parent
Now insert 58393:
- It belongs in the right node (58393 > 58300)
- Right node becomes:
[58393, 58400, 58500]
Visual representation:
Before split:
Leaf: [58100, 58200, 58300, 58400, 58500] FULL!
After split:
Parent gets: [..., 58300, ...]
↓
┌───────────┴───────────┐
[58100, 58200] [58400, 58500]
After inserting 58393:
Parent: [..., 58300, ...]
↓
┌───────┴────────┐
[58100, 58200] [58393, 58400, 58500]
When Splits Propagate Upward
Here’s where it gets more interesting. What if the parent node is also full?
Scenario: Cascading splits
Parent before promotion:
[50000, 55000, 60000, 65000, 70000] ← already full!
We need to insert 58300, but there’s no room. Solution: split the parent too!
Split parent:
Left: [50000, 55000]
Right: [65000, 70000]
Promote: 60000 to grandparent
Now insert 58300:
Left child: [50000, 55000, 58300] ✓
The split propagates upward like a ripple through the tree.
Growing the Tree: Root Split
Key insight: What happens when even the root is full?
The root has no parent to promote to. Solution: create a new root!
Before:
Root: [20000, 40000, 60000, 80000] FULL
(needs to split)
After:
New Root: [60000]
↓
┌──────┴──────┐
[20000, 40000] [80000]
(old root left) (old root right)
This is the ONLY way a B-tree grows taller. Trees grow from the bottom up, not top down!
B-tree Characteristics Summary
Strengths:
- Fast reads: 3-4 disk accesses for any record
- Data stays sorted (great for range queries)
- Predictable performance
- Good for read-heavy workloads
Weaknesses:
- Writes require finding the right location (3-4 disk reads)
- Then writing the modified node back (1+ disk writes)
- Splits require additional writes
- Not great for write-heavy workloads
Real-world usage:
- PostgreSQL, MySQL (InnoDB)
- SQLite
- Most traditional relational databases
LSM Trees: The Append-Only Approach
Now let’s look at a completely different approach to the same problem. LSM trees (Log-Structured Merge trees) optimize for writes at the expense of reads.
The Cashier Analogy
Imagine you’re a cashier at a busy store during the holiday rush. Customers keep handing you receipts to file.
Approach A: Perfect Organization (B-tree)
- Every receipt gets filed in the exact right spot immediately
- Open the filing cabinet, find the folder, insert in sorted order
- Very organized, but slow when you’re swamped with customers
Approach B: Pile It Up (Bad idea)
- Just throw receipts in a pile on your desk
- Super fast to “file”! But impossible to find anything later
Approach C: Batch Processing (LSM tree)
- During rush hours: throw receipts in a box (fast!)
- During quiet times: sort the box and merge into filing cabinet
- When someone needs a receipt: check today’s box first, then yesterday’s box, then the filing cabinet
Approach C trades immediate organization for write speed. This is LSM trees.
LSM Tree Structure
An LSM tree has multiple levels of storage:
Level 0: MemTable (RAM)
- New writes go here first
- It’s a sorted tree structure in memory (like a red-black tree)
- Writes are blazingly fast - just memory operations
Level 1-N: SSTables on Disk (Sorted String Tables)
- When MemTable fills up, flush it to disk as an SSTable
- SSTables are immutable - never modified after creation
- Each SSTable is a sorted file
- Multiple SSTables accumulate over time
LSM Tree Operations
Writing to an LSM Tree
Scenario: High-frequency writes
Imagine you’re ingesting 100,000 sensor readings per second:
Write sensor_id=58393, value=42.3
→ Insert into MemTable (in-memory sorted tree)
→ Done! (microseconds)
Write sensor_id=61250, value=38.7
→ Insert into MemTable
→ Done! (microseconds)
... 100,000 more writes ...
MemTable full (say, 64 MB)
→ Flush entire MemTable to disk as SSTable-7
→ Sequential write to disk (very fast!)
→ Create new empty MemTable
→ Continue accepting writes
Why this is fast:
- Writes to memory are instant
- Disk writes are sequential (much faster than random writes)
- No need to find the “right spot” - just append
- No read operations needed
Compare this to B-tree:
- Find the right leaf (3-4 disk reads)
- Modify the leaf (1 disk write)
- Maybe split nodes (more disk writes)
LSM tree writes can be 100x faster than B-tree writes.
Reading from an LSM Tree
Here’s the trade-off. Reads are more complex:
Scenario: Looking up sensor_id=58393
Step 1: Check MemTable (RAM)
→ Binary search through in-memory tree
→ Not found
Step 2: Check SSTable-7 (newest on disk)
→ Use Bloom filter: "might be here"
→ Read index, then data block
→ Not found
Step 3: Check SSTable-6
→ Bloom filter: "might be here"
→ Read index, then data block
→ Not found
Step 4: Check SSTable-5
→ Bloom filter: "definitely not here" → Skip!
Step 5: Check SSTable-4
→ Bloom filter: "might be here"
→ Read index, then data block
→ Found it! ✓
The problem: Read Amplification
You might have to check multiple SSTables before finding your data (or confirming it doesn’t exist). This is called read amplification.
If you have 10 SSTables, worst case you read from 10 places. Compare this to B-tree’s guaranteed 3-4 reads.
Compaction: Keeping Reads Fast
To prevent reads from getting too slow, LSM trees periodically compact SSTables.
Compaction process:
Before compaction:
SSTable-7: [45000, 47000, 58000, 61000]
SSTable-6: [20000, 30000, 58000, 60000] ← duplicate 58000!
SSTable-5: [10000, 25000, 40000, 58000] ← duplicate 58000!
Run compaction (merge + deduplicate):
SSTable-8: [10000, 20000, 25000, 30000, 40000, 45000,
47000, 58000, 60000, 61000]
Delete SSTable-5, SSTable-6, SSTable-7
Now reads only check fewer SSTables!
Compaction strategies:
-
Size-Tiered Compaction
- Merge SSTables of similar size
- Example: Four 10 MB tables → One 40 MB table
-
Leveled Compaction
- Organize into levels with size limits
- Level 0: 10 MB total
- Level 1: 100 MB total
- Level 2: 1 GB total
- Merge down levels as they fill up
The trade-off:
- Compaction uses CPU and I/O resources
- But it keeps read performance from degrading
- More aggressive compaction = faster reads, more write overhead
- Less compaction = slower reads, less write overhead
Optimizations: Bloom Filters
LSM trees use Bloom filters to avoid unnecessary disk reads.
What’s a Bloom filter?
A space-efficient probabilistic data structure that answers:
- “Is X definitely NOT in this SSTable?” → 100% accurate
- “Is X possibly in this SSTable?” → might be (check to confirm)
Example:
Looking for key=58393
SSTable-5 Bloom filter:
→ Bloom filter says: "Definitely not here"
→ Skip reading this SSTable entirely! ✓
SSTable-4 Bloom filter:
→ Bloom filter says: "Might be here"
→ Read the SSTable to check (and it is there!)
Why this matters:
Without Bloom filters, you’d read every SSTable from disk. With Bloom filters, you can skip most of them. This dramatically improves read performance.
A Bloom filter for 1 million keys takes only ~1.2 MB of RAM (with 1% false positive rate).
LSM Tree Characteristics Summary
Strengths:
- Extremely fast writes (10-100x faster than B-trees)
- Excellent for write-heavy workloads
- Sequential disk writes (SSD-friendly)
- Good compression (entire files can be compressed)
Weaknesses:
- Slower reads (must check multiple SSTables)
- Read amplification in worst cases
- Write amplification during compaction
- More complex to tune and maintain
Real-world usage:
- Apache Cassandra
- RocksDB (used by Facebook, LinkedIn)
- LevelDB (used by Bitcoin, Minecraft)
- Apache HBase
- ScyllaDB
B-trees vs LSM Trees: Head-to-Head Comparison
Performance Characteristics
| Operation | B-tree | LSM Tree |
|---|---|---|
| Writes | Moderate (3-4 reads + 1+ writes) | Very fast (memory write + occasional flush) |
| Reads | Fast (3-4 disk reads, predictable) | Variable (depends on # of SSTables) |
| Range queries | Excellent (data is sorted) | Good (must merge from multiple SSTables) |
| Space overhead | Low (minimal wasted space) | Medium (multiple copies during compaction) |
| Write amplification | Low to medium (splits) | High (compaction rewrites data) |
| Read amplification | Low (3-4 reads) | High (check multiple SSTables) |
When to Choose B-trees
Choose B-trees when:
- Read-heavy workload (10:1 or higher read:write ratio)
- Predictable performance is crucial
- Point queries are your primary access pattern
- Range scans are common
- ACID transactions are needed (easier with B-trees)
Example scenarios:
- Traditional web applications (lots of user lookups)
- Inventory management systems
- Banking systems (balance lookups)
- Product catalogs
When to Choose LSM Trees
Choose LSM trees when:
- Write-heavy workload (1:1 or higher write:read ratio)
- Append-only or time-series data
- Ingestion speed is critical
- SSD storage (sequential writes benefit SSDs)
- Large datasets where compression matters
Example scenarios:
- Time-series databases (metrics, logs, IoT sensors)
- Event streaming systems
- Analytics databases
- Social media feeds (constant updates)
- Monitoring systems
Time-Series Databases: A Perfect LSM Use Case
Time-series data has special characteristics that make LSM trees ideal:
Why Time-Series Loves LSM
Characteristic 1: Write-heavy
- Millions of data points per second (sensor readings, metrics, logs)
- Writes far outnumber reads
- LSM’s fast writes shine here
Characteristic 2: Time-ordered
- Data naturally arrives in time order
- LSM trees already write sequentially
- Perfect match!
Characteristic 3: Immutable recent data
- Recent data is rarely updated (append-only)
- Fits LSM’s immutable SSTable model
- No need for in-place updates
Characteristic 4: Retention policies
- Old data gets deleted in bulk
- Can delete entire old SSTables
- Much easier than deleting from B-tree
Time-Series LSM Optimization
Scenario: IoT sensor network
1 million sensors × 1 reading per second = 1 million writes/sec
LSM tree optimization:
Partitioning by time:
SSTable-2024-10-07-14:00: All data for 2-3 PM
SSTable-2024-10-07-13:00: All data for 1-2 PM
SSTable-2024-10-07-12:00: All data for 12-1 PM
Query: "Get sensor 58393 data from 2:15-2:30 PM"
→ Only check SSTable-2024-10-07-14:00
→ Avoid reading other time ranges
Retention: "Delete data older than 30 days"
→ Delete entire SSTables older than 30 days
→ No per-record deletion needed
Additional optimizations:
- Compress old SSTables aggressively (Zstandard, LZ4)
- Use different compaction for recent vs old data
- Keep recent data in faster storage (SSD), old data in slower storage (HDD)
Real-world Time-Series Databases
InfluxDB (LSM-based):
- Handles billions of data points
- Optimized for time-range queries
- Custom TSM (Time-Structured Merge tree) storage engine
Prometheus (LSM-inspired):
- Metrics monitoring
- 2-hour blocks written as immutable chunks
- Compacts blocks over time
TimescaleDB (Hybrid):
- Built on PostgreSQL (B-tree)
- But uses hypertables (time-based partitioning)
- Combines B-tree reliability with time-series optimizations
Deep Dive: Write Amplification
Both B-trees and LSM trees suffer from write amplification - writing more data than the application requested. Understanding this helps you tune your database.
B-tree Write Amplification
Scenario: Insert one 100-byte record
- Read leaf node (4 KB) into memory
- Modify it to add the record
- Write entire node back (4 KB write)
Write amplification: 4 KB / 100 bytes = 40x
If the insert causes splits:
- Write left split node (4 KB)
- Write right split node (4 KB)
- Write modified parent node (4 KB)
Write amplification: 12 KB / 100 bytes = 120x
LSM Tree Write Amplification
Scenario: Insert one 100-byte record
Phase 1 (MemTable → SSTable):
- Write to MemTable (100 bytes)
- When MemTable flushes → write 64 MB SSTable
Phase 2 (Compaction Level 0 → 1):
- Merge 4 SSTables (256 MB total) → write 256 MB
Phase 3 (Compaction Level 1 → 2):
- Merge into level 2 → write 2.5 GB
Total write amplification: Can reach 50-100x over the lifetime of data
Mitigating Write Amplification
For B-trees:
- Use larger nodes (reduce splits)
- Batch writes when possible
- Use write-ahead log (WAL) for durability
For LSM trees:
- Tune compaction strategy
- Use leveled compaction (less amplification than size-tiered)
- Delay compaction during peak write times
- Adjust compaction triggers
Practical System Design Scenarios
Scenario 1: Social Media Feed
Requirements:
- 100 million active users
- Average 10 posts per day per user = 1 billion writes/day
- Average user checks feed 50 times/day = 5 billion reads/day
- Recent posts are most accessed
Analysis:
- Read-heavy: 5:1 read:write ratio
- But writes are very bursty
- Recent data needs fast access
- Old posts rarely accessed
Recommended approach: Hybrid
- LSM tree for write path (handle burst writes)
- Cache recent posts in memory
- B-tree indexes for user lookups
- Archive old data to cold storage
Scenario 2: Banking Transactions
Requirements:
- Account balance lookups (very frequent)
- Transaction inserts (less frequent)
- ACID guarantees required
- Consistent read performance critical
Analysis:
- Read-heavy
- Strong consistency needed
- Predictable latency critical
- Can’t afford read amplification
Recommended approach: B-tree
- PostgreSQL or similar
- B-tree indexes on account_id
- WAL for durability
- Replication for read scaling
Scenario 3: Application Monitoring
Requirements:
- 10,000 servers × 100 metrics/sec = 1 million writes/sec
- Query patterns: time-range aggregations
- Retention: 90 days, then delete
- Reads are dashboards and alerts
Analysis:
- Extremely write-heavy
- Time-series data
- Bulk deletes needed
- Aggregations over time ranges
Recommended approach: LSM Tree (Time-Series DB)
- InfluxDB or TimescaleDB
- Partition by time
- Compress old data
- Delete old SSTables wholesale
Scenario 4: E-commerce Product Catalog
Requirements:
- 10 million products
- Frequent searches by category, price, name
- Occasional inventory updates
- Range queries (price ranges)
Analysis:
- Read-dominated
- Complex queries (multiple indexes)
- Range scans common
- ACID transactions helpful
Recommended approach: B-tree
- PostgreSQL with multiple indexes
- B-tree on product_id (primary key)
- B-tree on category, price
- Full-text search index for names
Conclusion: Choosing Your Data Structure
The choice between B-trees and LSM trees isn’t about which is “better” - it’s about understanding your workload:
Choose B-trees when:
- Reads dominate your workload
- You need predictable, consistent latency
- ACID transactions are important
- Traditional relational database features matter
Choose LSM trees when:
- Writes dominate your workload
- You’re willing to trade read latency for write throughput
- Your data is append-mostly or time-series
- You can benefit from compression
Consider hybrid approaches when:
- You have complex requirements
- Different parts of your system have different needs
- You can cache hot data to mitigate read amplification
Remember: modern databases often use both structures in different ways. PostgreSQL (primarily B-tree) uses WAL (which is log-structured). RocksDB (LSM) can use B-tree indexes for SSTables. Understanding both helps you make better decisions.
Further Learning
Books:
- “Designing Data-Intensive Applications” by Martin Kleppmann
- “Database Internals” by Alex Petrov
Papers:
- “The Log-Structured Merge-Tree (LSM-Tree)” - O’Neil et al. (1996)
- “The Ubiquitous B-Tree” - Douglas Comer (1979)
Open Source:
- Study RocksDB source code (excellent LSM implementation)
- Study PostgreSQL B-tree implementation
- Experiment with both using benchmark tools
The best way to truly understand these structures is to use them, measure them, and see their trade-offs in action. Happy building!