Six months. 200x user growth. Zero unplanned downtime. This is the engineering story of how we dismantled a monolith under live traffic, re-architected it as a set of independently deployable services, and kept the platform running throughout — no maintenance windows, no user-visible degradation.
Where we started: a monolith at its limits
The client had a logistics coordination platform that had grown organically over four years. The codebase was a single Django application: one database, one deployment, one team responsible for everything. At 10,000 active users it was manageable. At 80,000, the cracks began to show.
Deploy times exceeded 40 minutes because every change required a full application restart. The PostgreSQL database had grown to 2.8TB with no partitioning. A spike in shipment-tracking requests would starve the background job queue. A failed background job could corrupt auth session state. Everything was coupled to everything else.
The business needed to scale to 500,000 users within six months to meet a contractual obligation. Our assessment: the existing architecture could not get there without a fundamental redesign.
The architecture decision: strangler fig, not big bang
We rejected the "rewrite from scratch" approach immediately. A parallel rewrite would take 8–12 months minimum, require maintaining two systems simultaneously, and introduce enormous risk at the cutover moment. The business needed continuous delivery of features throughout the migration.
Instead, we applied the strangler fig pattern: extract one service at a time from the monolith, route traffic to the new service, and let the old code path die gradually. The monolith shrinks as new services take over its responsibilities.
The sequencing was critical. We extracted services in order of highest blast radius first — the ones whose failure caused the most cascading damage. This meant the riskiest work happened while the team was still learning the new infrastructure, but it also meant that once those extractions were complete, the remaining monolith was far more stable.
Phase 1: The extraction order
We mapped every domain in the monolith by two axes: coupling (how many other domains touched it) and traffic volume. The four candidates for first extraction:
- Shipment tracking — highest read traffic, pure query workload, minimal writes. Perfect first candidate.
- Notification service — already quasi-isolated in the codebase, fire-and-forget pattern. Low risk.
- User auth & sessions — highest coupling, but the session store was causing the most production incidents. Worth the risk.
- Document generation — CPU-intensive, caused latency spikes that affected all other requests. Needed isolation.
We left the core shipment management logic — creation, assignment, status transitions — in the monolith until Phase 2. It had the deepest business logic and the most test coverage. Touching it first would have been reckless.
The traffic routing layer: a proxy that earned its keep
Every extraction followed the same pattern: build the new service, run it alongside the monolith in shadow mode (receiving real traffic, returning responses we discarded), compare outputs, then flip the router. We built a lightweight routing proxy using Nginx + Lua that could switch traffic per-endpoint without a deployment.
-- route /api/tracking/* to new service when flag is active local flags = require("feature_flags") local target = ngx.var.uri if target:match("^/api/tracking") then if flags.is_enabled("tracking_v2", ngx.var.arg_tenant_id) then ngx.var.upstream = "tracking-service:8080" else ngx.var.upstream = "monolith:8000" end end
Feature flags were scoped per tenant. We rolled out each new service to 1% of tenants, monitored for a week, then 10%, then 50%, then 100%. If something went wrong, we flipped the flag back — no redeploy, no incident, no user impact.
Database: the hardest problem
The monolith's PostgreSQL instance was the most dangerous thing we touched. Every service was reading from and writing to the same 2.8TB database. Extracting a service without extracting its data is pointless — you've just added a network hop.
The migration approach
For each extracted service we followed a three-step database migration:
- Dual-write phase: The monolith continues to write to the original tables. The new service writes to its own schema simultaneously. Both reads go to the monolith's data. Duration: 2 weeks per service.
- Backfill and validation: Migrate historical data to the new schema. Run automated reconciliation scripts every 6 hours to catch divergence. Duration: 1 week.
- Read cutover: Flip reads to the new service's schema. Keep dual-write running for another week as a safety net. Then remove monolith writes.
The entire process took 4–5 weeks per domain. It was slower than we wanted. It was worth every day — we caught three data inconsistencies during reconciliation that would have caused silent corruption if we'd moved faster.
Phase 2: Handling the growth spike
By month three, we had four services live. User counts were already past 400,000 — ahead of the contractual target, because a new enterprise customer had onboarded early. The tracking service was handling 3,200 requests per second at peak.
Two things saved us here that we hadn't fully planned for:
Read replicas with connection pooling. We'd added PgBouncer in front of every database, but hadn't configured it optimally. Under load, we were exhausting connection limits. Reconfiguring PgBouncer's pool mode from session to transaction halved connection count and cut p99 latency from 340ms to 80ms in the tracking service.
Aggressive caching at the edge. Shipment status queries are read-heavy and tolerate 30-second staleness. Adding Redis in front of the tracking service dropped database load by 73%. The insight that unlocked this: most users check shipment status obsessively during the final 4 hours of delivery — exactly when the underlying data changes most slowly.
async def get_shipment_status(shipment_id: str) -> ShipmentStatus: cache_key = f"status:v2:{shipment_id}" cached = await redis.get(cache_key) if cached: return ShipmentStatus.parse_raw(cached) status = await db.fetch_status(shipment_id) # TTL varies by delivery phase — shorter when near delivery ttl = 30 if status.hours_to_delivery > 4 else 8 await redis.setex(cache_key, ttl, status.json()) return status
The migration timeline
Month 1
Infrastructure foundation
Kubernetes cluster provisioned, service mesh (Linkerd) configured, observability stack deployed (Prometheus, Grafana, Tempo). Feature flag service live. Proxy routing layer tested.
Month 2
First extractions: Tracking + Notifications
Tracking service extracted and handling 100% of traffic. Notification service decoupled via async queue (RabbitMQ). First database migration complete. 85,000 users.
Month 3
Auth extraction + caching layer
Session state migrated to Redis-backed auth service. Document generation service isolated with dedicated worker pool. Redis caching added to tracking. 430,000 users.
Month 4
Core domain extraction begins
Shipment management (creation, assignment, status) extraction starts. Dual-write phase active. PgBouncer reconfiguration resolves connection exhaustion at scale. 820,000 users.
Month 5
Core domain live + horizontal scaling
Core shipment service fully migrated. Auto-scaling policies configured per service. Monolith reduced to thin API gateway. 1.4M users without performance degradation.
Month 6
Monolith decommissioned
Final monolith routes migrated. Original server retired. Full microservices architecture in production. 2.1M users. Zero unplanned outages recorded across the entire migration.
What we'd do differently
The migration succeeded, but three things cost us more time than they should have:
We underestimated service discovery complexity. The moment you have more than three services, hardcoded inter-service URLs become a maintenance burden. We retrofitted Consul-based service discovery in month three when we should have built it on day one.
Distributed tracing was added too late. We deployed Tempo in month one but didn't propagate trace IDs through the full request path until month three. The two-month gap meant we debugged cross-service issues with logs and guesswork. Never again.
The dual-write reconciliation scripts were manual too long. We wrote bespoke reconciliation scripts for each service. By service four, we'd built enough shared tooling that we could have automated this from the start. The time savings on services five and six were significant — we should have invested in that infrastructure on service one.
The actual hard part: organizational, not technical
The technology was solvable. The harder challenge was maintaining a shared mental model across a team that was simultaneously migrating existing functionality and delivering new features on a contractual timeline.
What made it work: every engineer owned both the migration work for their domain and the feature work on top of it. We didn't create a separate "platform team" to do the migration while product engineers kept building on the monolith. That separation creates two systems and two codebases that diverge. Instead, the engineers who knew the domain did the extraction — they were the only ones who understood the edge cases well enough to do it safely.
The velocity cost was real: feature output dropped by roughly 30% during peak migration periods. We communicated this upfront to stakeholders with a clear model of when velocity would recover. It did, on schedule, in month five — and then exceeded pre-migration velocity because deployments went from 40 minutes to 4 minutes.