Sep 2024

Database Performance: The Fundamentals That Actually Matter

After 25+ years of working with databases, I’ve learned that performance problems usually come down to a few fundamental issues. The specific database matters less than understanding these core concepts.

The Performance Triangle

Database performance boils down to three things:

How much data do we read? (reducing I/O)
How efficiently do we process it? (query optimization)
How often do we repeat work? (caching and batching)

Most optimization efforts should focus on these fundamentals before reaching for exotic solutions.

Indexes: The First and Last Defense

The Basics

An index is a data structure that allows the database to find rows without scanning the entire table. Without proper indexes, even simple queries become expensive at scale.

-- Without index: Full table scan
SELECT * FROM orders WHERE customer_id = 12345;
-- Reads potentially millions of rows

-- With index: Direct lookup
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Reads only matching rows

Composite Index Order Matters

For multi-column indexes, column order is critical:

-- For queries filtering on status AND date range
CREATE INDEX idx_orders_status_date ON orders(status, created_at);

-- This query uses the index efficiently:
SELECT * FROM orders
WHERE status = 'pending'
  AND created_at > '2024-01-01';

-- This query can only use part of the index:
SELECT * FROM orders
WHERE created_at > '2024-01-01';

Put equality conditions first, range conditions last.

The Index Trade-off

Every index speeds up reads but slows down writes. Each INSERT/UPDATE must maintain all indexes. Tables with 20 indexes might have slow writes even if each query is fast.

Query Patterns That Kill Performance

N+1 Queries

The classic ORM mistake:

# Bad: N+1 queries
orders = Order.objects.all()
for order in orders:
    print(order.customer.name)  # Each access hits the database

# Good: Eager loading
orders = Order.objects.select_related('customer').all()
for order in orders:
    print(order.customer.name)  # Customer data already loaded

This pattern appears in every language and framework. Watch for it.

SELECT *

Retrieving all columns when you only need a few wastes I/O:

-- Bad: Returns all columns
SELECT * FROM products WHERE category = 'electronics';

-- Better: Only needed columns
SELECT id, name, price FROM products WHERE category = 'electronics';

This matters especially for wide tables or when columns contain large text/blob data.

Understanding Execution Plans

Every database provides execution plans. Learn to read them:

-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 12345;

-- SQL Server
SET STATISTICS IO ON;
SELECT * FROM orders WHERE customer_id = 12345;

-- Oracle
EXPLAIN PLAN FOR SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

-- MySQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

Look for:

Full table scans on large tables (usually bad)
Index seeks vs index scans (seeks are better)

Index seeks are generally more efficient for retrieving a small number of rows, navigating the Binary-tree structure directly, while index scans read the entire index, making them better for accessing large portions of data. Seeks are ideal for highly selective queries (e.g., WHERE ID = 5), whereas scans are used when retrieving many rows or for ordering data, often proportional to the table’s total rows.

Row estimates vs actual rows (big differences indicate stale statistics)
Sort operations on large result sets (often expensive)

Outdated statistics

Updating database statistics is essential to provide the query optimizer with accurate data distribution information (histograms and density), allowing it to generate efficient execution plans. Outdated statistics cause poor choices in join strategies, index usage, and memory allocation, leading to severe performance degradation, especially after significant data changes.

Use scheduled tasks to periodically recalculate statistics, then this should be a problem.

Connection Management

Connection Pools

Opening database connections is expensive. Use connection pools:

// Configure pool appropriately
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);
config.setConnectionTimeout(30000);

Connection Limits

Every database has a maximum connection limit. With microservices, it’s easy to exhaust them:

10 services x 20 connections each = 200 connections
Add scaling: 10 services x 5 replicas x 20 connections = 1000 connections

Consider connection proxies (PgBouncer for PostgreSQL) for large deployments.

Caching Strategies

Query Result Caching

For read-heavy workloads, cache query results:

# Check cache first
cache_key = f"user:{user_id}"
user = cache.get(cache_key)
if not user:
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    cache.set(cache_key, user, ttl=300)

Cache Invalidation

The hard part isn’t caching—it’s knowing when cached data is stale. Strategies:

TTL-based: Simple but may serve stale data
Event-based: Invalidate on writes, more complex but accurate
Versioning: Include version in cache key, increment on changes

Batch Operations

Bulk Inserts

Individual inserts are slow. Batch them:

-- Slow: Individual inserts
INSERT INTO logs (message) VALUES ('Event 1');
INSERT INTO logs (message) VALUES ('Event 2');
-- ... 998 more

-- Fast: Bulk insert
INSERT INTO logs (message) VALUES
    ('Event 1'), ('Event 2'), ... ('Event 1000');

Batch Updates

Same principle applies to updates and deletes. Process in batches of 1000-10000 rows rather than one at a time or all at once.

Monitor, Measure, Repeat

Performance optimization without measurement is guesswork. Establish:

Baseline metrics: Query times, throughput, connection counts
Monitoring: Track slow queries, lock waits, resource usage
Alerting: Know when performance degrades before users complain

Every database provides tools for this. Use them.

The 80/20 Rule

In my experience, 80% of database performance problems come from:

Missing or wrong indexes
N+1 query patterns
No connection pooling
Queries returning too much data

Fix these basics before pursuing advanced optimizations.