"Software engineering is more than just centering a div." — howtocenterdiv.com
TLDR: You profiled Node.js, added Redis, switched runtimes. Still slow. The database was the problem the whole time. Fix N+1 queries and missing indexes before you touch anything else.
Same story every time. App is slow. Someone swaps the framework. Still slow. Someone adds Redis. Still slow. Finally somebody opens up the query plan and there it is — a sequential scan across 4 million rows because nobody put an index on user_id. Four seconds per request. Was sitting there the whole time.
Nobody ever looks at the database first though. It's always "let's try Fastify" or "let's add a cache" because those feel like real engineering work. An index is one line of SQL. You can't demo one line of SQL in standup. A framework migration gets you a PR, a benchmark chart, maybe even a blog post. But it won't fix a missing index, obviously.
And honestly? Most backend devs have never been taught to read a query plan. Bootcamps don't go there, tutorials barely mention EXPLAIN, and you end up figuring it all out at 2am during a production fire. After that you never forget it. But it shouldn't take a production incident to learn this stuff.
N+1 Queries
We all learn about N+1 early on. We all still write one at some point:
javascript
1const orders =await db.query('SELECT * FROM orders LIMIT 50');23for(const order of orders){4 order.user=await db.query(5'SELECT * FROM users WHERE id = $1',6[order.userId]7);8}
51 round trips for 50 orders. PM says "can we do 500 per page?" and now it's 501. The page is slow because of a loop hitting the database on every iteration — nothing to do with Node being slow or not.
What's really annoying is that N+1 hides in dev. You've got 10 rows locally, it returns instantly, you move on. Deploys, data piles up, and six months from now somebody files a Jira ticket about the orders page taking 8 seconds. By then everyone forgot this code existed.
One JOIN:
sql
SELECT orders.*, users.name, users.email
FROM orders
JOIN users ON orders.user_id = users.id
LIMIT50;
ORMs make it worse because the code looks clean:
javascript
1// Reads beautifully. Fires 51 queries.2const orders =await prisma.order.findMany({take:50});3for(const order of orders){4const user =await prisma.user.findUnique({where:{id: order.userId}});5}67// This does the same thing in 1 query.8const orders =await prisma.order.findMany({9take:50,10include:{user:true},11});
Nothing about that first version looks wrong. Prisma, Sequelize, TypeORM, all of them will do this to you if you don't pay attention. The whole selling point of an ORM is that it hides SQL. But it hides the query count too, and that's the part that bites.
Turn on query logging. Prisma has log: ['query'] on the client, Sequelize takes logging: console.log. Go look at what gets generated. I've opened up codebases where one endpoint was doing 200+ queries because somebody nested includes three levels deep. Endpoint worked, was just painfully slow, and nobody thought to check the logs until users started complaining.
GraphQL makes it uglier because resolvers fire per field, so a single query can spawn dozens of DB calls without anyone realizing. DataLoader was built for exactly this — batches and deduplicates calls within one tick. Retrofitting it later is painful so do it early if you can.
EXPLAIN ANALYZE
Slow query? Don't touch the code yet. Run this first:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
The output tells you how Postgres actually executed the query. Seq Scan is the bad one — it means Postgres read through the entire table, row by row, to find your data. On a table with 200 rows nobody cares. On a table with millions of rows that's where your response time went. Index Scan is what you want — Postgres found an index and skipped straight to the relevant rows.
sql
1-- Seq Scan: reading 4,999,999 rows to find 1. painful.2Seq Scan on orders (cost=0.00..89234.00rows=1 width=120)3 Filter: (user_id =42)4Rows Removed by Filter: 499999956-- One index fixes it7CREATEINDEX idx_orders_user_id ON orders(user_id);89-- Now look at the difference10Index Scan using idx_orders_user_id on orders
11Index Cond: (user_id =42)
4 seconds to 4 milliseconds on a production table with real data. Not theoretical.
There's a third one you might run into: Bitmap Heap Scan. Postgres builds a map of which disk pages probably contain matching rows and reads those instead of scanning everything. Not usually a problem, but if your query should return like 3 rows and you see a bitmap scan, dig into it. Index might not match well or Postgres has outdated info about the data in that table.
Also look at estimated rows versus actual rows in the output. When those are wildly different (Postgres expected 10, actually got 50,000) the planner is making decisions on wrong assumptions. Fix it by running ANALYZE tablename — different command from EXPLAIN ANALYZE, it just tells Postgres to rescan and update its internal statistics. I've seen terrible query plans go to perfectly fine after that one command.
Do this for every query before it ships. Not just the complicated ones — the boring three-line SELECTs are the ones that slip through review without anyone running EXPLAIN, and those tend to be the worst performers.
Indexes
Indexes speed up reads but slow down writes. Every INSERT, UPDATE, DELETE also has to update every index on that table. I've debugged write latency issues that turned out to be nothing more than too many indexes on one table — 15 of them, most added "just in case" and never used by any query.
Only index what you actually filter, join, or sort on. WHERE, JOIN, ORDER BY. If someone says "let's index this column in case we need it later" — push back. You probably won't need it, and writes just got slower for nothing.
One thing that confuses people: a composite index on (user_id, created_at) is a completely different thing from two separate indexes on each column. Not interchangeable even though it feels like they should be. Your query does WHERE user_id = 42 ORDER BY created_at? The composite covers both filter and sort in one pass. Separate indexes — Postgres picks one for the WHERE, grabs the rows, sorts them in memory separately.
Column order in composites matters a lot. (user_id, created_at) works for queries that filter on user_id alone. It does absolutely nothing for queries filtering only on created_at. Leftmost column has to be in the query. Works like a phone book — sorted by last name first, then first name within that. Easy to find all the Smiths. Impossible to efficiently find every John across all last names.
Partial indexes are another trick not enough people know about. If you've got a table where 95% of the rows are 'completed' and you're only ever querying for the 'pending' ones, you can tell Postgres to only index those:
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
Tiny index, covers exactly what you need, barely any write overhead.
Connection Pools
Postgres defaults to 100 max connections. Your app uses a pool — reusable connections instead of opening new ones per request. Most libraries default to 10 and nobody revisits that number after initial setup.
Real traffic breaks this immediately:
code
1Pool size: 10
2Requests per second: 300
3Each request holds a connection for 50ms
45Connections needed: 300 * 0.05 = 15
6You have 10. Five requests queued at any moment.
7Waits compound. p99 latency climbs.
People reference the formula (core_count * 2) + effective_spindle_count. Four-core SSD box gives roughly 9. But this is a starting point at best — you need to load test with realistic traffic to find what actually works.
One thing people get wrong here: cranking the pool up doesn't help. Setting it to 50 per instance can actually make performance worse. Postgres has to context-switch between connections and each one takes memory on the server side. The sweet spot is usually lower than you'd expect. PostgreSQL wiki has a good writeup on why.
Kubernetes makes it worse. 10 pods × 10 pool size = 100 connections just from your app. Already hitting the default limit. Then a rolling deploy starts, new pods open connections before old pods release theirs, and for a few seconds you're way over. Seen this crash databases.
PgBouncer fixes this. Sits between your app and Postgres, multiplexes connections at the protocol level. Your 10 pods each think they've got 10 connections but PgBouncer maps all of that to maybe 20 actual Postgres connections. Set it up before you actually need it — debugging connection limits at 3am while Googling "FATAL: too many clients already" is not a fun time.
The Slow Query Log
The single most underrated tool in Postgres and almost nobody has it turned on:
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
That 1000 is milliseconds. Logs every query over one second. You can lower it to 500 or 200 later but start at a second so you don't drown in output.
Leave it on in production for a day then go read the file. You will find queries you completely forgot about. Some ancient cron job nobody's looked at since it was written, an admin panel endpoint that hits a full table scan twice a year when someone actually clicks the button.
Here's the thing about optimization — most teams do it by guessing. "This query has three JOINs so it's probably the slow one." Then you check and the three-JOIN query runs in 2ms because the indexes are right. Meanwhile some tiny SELECT with no WHERE index has been quietly running 10,000 times a day, each one doing a seq scan. The log doesn't guess. It just records what happened.
pg_stat_statements goes even further if you want it. Extension that tracks execution count, total time, average time for every distinct query. Basically gives you a leaderboard of which queries eat the most database time. Takes five minutes to turn on.
When Redis Actually Makes Sense
After queries are cleaned up. After indexes are right. After the slow query log stopped turning up problems.
Then sure, put Redis in front of it.
Good fit for data that changes rarely but gets read all the time — sessions, feature flags, config, product catalogs. Bad fit as a patch over a slow query. The slow query doesn't go away. It still fires on cache misses: traffic spikes, cold starts after deploys, TTL expiration at 3am when nobody's monitoring.
Check if you even need a separate service. Postgres materialized views are cached query results that live right in the database. Refresh them on a schedule or trigger. For dashboards and aggregated stats, they might be enough without adding Redis to your infrastructure.
If you do go with Redis, figure out invalidation before writing any code. "We'll handle cache invalidation later" always ends with users seeing stale data and filing bugs nobody can reproduce because the cache expired between the report and the investigation. TTL, write-through, or both — pick one upfront.
Fix the query first. Cache if the numbers justify it after.
Transactions
Transactions hold locks. Longer the transaction, longer the lock, bigger the pileup:
javascript
1// Holds a lock while sendEmail does... whatever sendEmail does2await db.transaction(async(trx)=>{3awaittrx('orders').where({id: orderId }).update({status:'confirmed'});4awaitsendEmail(order.userEmail);// could take 2+ seconds5awaittrx('inventory').where({ productId }).decrement('stock',1);6});
Email has no business being inside a database transaction. SMTP is slow, sometimes very slow, and during that entire time you've got a lock held on the orders row while every other request that needs it waits.
"But what if the email fails?" Putting it inside the transaction doesn't actually solve that. If sendEmail throws between the two DB calls, the whole transaction rolls back — now your order isn't confirmed AND the email didn't send. Worse than before. Better approach: commit the DB work, send the email separately, handle failures with a retry queue or dead letter table. Side effects and transactions don't mix.
Transactions should be short and contain only DB operations. HTTP calls, file I/O, email sends — none of that belongs in there.
SELECT * Considered Harmful (Sometimes)
Problem with SELECT * is that it pulls back everything on the row. Every column. Including that massive TEXT field full of HTML from the CMS feature someone bolted on two years ago that nobody remembers. You need id, name, and price for a product list — but you're also dragging 10KB of product description HTML across the wire for every single row.
sql
-- Grabs everything including the giant description columnSELECT*FROM products WHERE category_id =5;-- Just what you needSELECT id, name, price, thumbnail_url FROM products WHERE category_id =5;
Non-issue on small tables. But wide tables returning hundreds of rows — the bandwidth waste shows up in response times. Postgres also has this thing called covering indexes where if every column in your SELECT is already in the index, it never has to touch the actual table. Really fast when it kicks in. SELECT * prevents that from ever happening because obviously no index has every column.
ORMs select everything by default. Prisma has select, Sequelize has attributes. Worth using for list endpoints where you don't need every field.
Before You Add Infrastructure
Before Redis. Before read replicas. Before anyone brings up sharding:
code
11. Run EXPLAIN ANALYZE on your slowest queries.
22. Hunt for N+1 patterns in your ORM code.
33. Add indexes on columns in WHERE, JOIN, ORDER BY.
44. Turn on the slow query log.
55. Check pool size against real traffic.
66. Nothing slow inside transactions.
Step 2 or 3 alone fixes most apps. Rest is cleanup.
New infrastructure looks great on an architecture diagram and ships fast. But if the actual problem is a missing index or a loop full of queries, none of it matters. A bad query behind three caches is still a bad query — now you just also have three caches to debug.
This also isn't a one-time thing. Queries that work at 100K rows collapse at 10M. An index you added for a feature that got killed last quarter is just slowing down writes now. Check the slow query log every few weeks, not just during outages. It's ongoing work.