More bad query is worse than less bad query

Stop hiding performance problems and start solving them

January 15, 2024

performance

database

engineering

The Wrong Way to Fix Performance Problems

Moving It to the Background Doesn't Fix It

When faced with slow operations, there's a natural tendency to reach for the same solution: "Let's move it to the background."

This doesn't solve the problem. It just moves it.

The Background Job Fallacy

Here's what typically happens:

Feature works great in development (10 records)
Gets slow in staging (1,000 records)
Becomes unusable in production (100,000 records)
Solution: "Let's add a background job!"

Now you have:

The same slow query running somewhere else
Added complexity of job queues
Delayed user feedback
More infrastructure to maintain
The same performance problem, just hidden
Users wondering why their data isn't ready yet
Support tickets about "missing" data that's just processing

The Cascade of Complications

Once you move something to the background, the problems multiply:

Now You Need Progress Indicators

Users can't see results immediately, so you add progress bars, spinners, "check back later" messages. More code, more complexity.

Now You Need Notifications

When the job finishes, you need to tell users. Email? Push notifications? In-app alerts? More infrastructure.

Now You Need Retry Logic

Background jobs fail. Networks hiccup. Databases go down. So you add retries, exponential backoff, dead letter queues. More complexity.

Now You Need Monitoring

Is the queue backed up? Are jobs failing? How long is the average wait? You need dashboards, alerts, on-call rotations.

The Real Problem

The queries you're running are one of the only things that matter for performance. Everything else is just moving deck chairs on the Titanic.

When something is slow, the answer isn't to hide it. The answer is to understand why it's slow and fix it.

Common Cop-Outs

"We'll paginate it"

Great, now it's slow 50 times instead of once. And users have to click through pages to find what they need.

"We'll cache it"

Caching seems like a silver bullet until:

The cache expires during peak traffic
You need to invalidate it (cache invalidation is one of the two hard problems in computer science)
Different users need different data (per-user caching gets expensive fast)
The first user after expiry hits the slow path and times out
You realize you're now maintaining two systems: the database and the cache

"We'll pre-compute it"

Now you're running the slow query all the time instead of on-demand. And dealing with stale data. And managing another background job.

"We'll use a faster language"

Your N+1 query problem is still an N+1 query problem in Rust. Bad algorithms are bad in any language.

"We'll throw hardware at it"

Scaling vertically has limits. That query taking 30 seconds on 8 cores will take 15 seconds on 16 cores. Still too slow.

"We'll shard/partition the data"

Now you have distributed systems problems on top of your query problems. Plus the complexity of routing requests to the right shard.

The Right Approach

Profile First: Use EXPLAIN ANALYZE. Understand what's actually happening.
Fix the Query: Add the right indexes. Remove unnecessary joins. Fetch only what you need.
Measure Again: Confirm it's actually faster.

Only after exhausting query optimization should you consider architectural changes.

A Quick Example

Instead of moving this to a background job:

-- Takes 30 seconds
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at > '2024-01-01';

Fix the actual problem:

-- Takes 0.3 seconds with proper indexes
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_user_product ON orders(user_id, product_id);

SELECT o.id, u.name, p.name, o.total
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at > '2024-01-01';

The Bottom Line

Performance problems don't disappear when you move them to the background. They just become someone else's problem - usually yours at 3 AM when the job queue backs up.

Fix the query. It's almost always the query.