How do you migrate a database during a microservices extraction?

A safe approach uses three phases: dual-write (both old and new schemas receive writes), backfill and reconciliation (migrate historical data and validate with automated scripts), then read cutover (flip reads to the new schema while keeping dual-write as a safety net for one more week).

How long does a monolith to microservices migration take?

For a production platform under live traffic using the strangler fig approach, extracting each domain service takes 4 to 6 weeks including database migration and validation. A full migration of a mid-size monolith typically takes 4 to 6 months.

Scaling from 10K to 2.1M Users Without Downtime

Q: What is the strangler fig pattern in software architecture?

The strangler fig pattern is a migration strategy where you extract one service at a time from a monolith, route traffic to the new service, and let the old code path die gradually. The monolith shrinks as new services take over its responsibilities — avoiding a risky big-bang rewrite.

Six months. 200x user growth. Zero unplanned downtime. This is the engineering story of how we dismantled a monolith under live traffic, re-architected it as a set of independently deployable services, and kept the platform running throughout — no maintenance windows, no user-visible degradation.

Where we started: a monolith at its limits

The client had a logistics coordination platform that had grown organically over four years. The codebase was a single Django application: one database, one deployment, one team responsible for everything. At 10,000 active users it was manageable. At 80,000, the cracks began to show.

Deploy times exceeded 40 minutes because every change required a full application restart. The PostgreSQL database had grown to 2.8TB with no partitioning. A spike in shipment-tracking requests would starve the background job queue. A failed background job could corrupt auth session state. Everything was coupled to everything else.

The business needed to scale to 500,000 users within six months to meet a contractual obligation. Our assessment: the existing architecture could not get there without a fundamental redesign.

200×

User growth in 6 months

Unplanned outages

99.97%

Uptime across migration

The architecture decision: strangler fig, not big bang

We rejected the "rewrite from scratch" approach immediately. A parallel rewrite would take 8–12 months minimum, require maintaining two systems simultaneously, and introduce enormous risk at the cutover moment. The business needed continuous delivery of features throughout the migration.

Instead, we applied the strangler fig pattern: extract one service at a time from the monolith, route traffic to the new service, and let the old code path die gradually. The monolith shrinks as new services take over its responsibilities.

The sequencing was critical. We extracted services in order of highest blast radius first — the ones whose failure caused the most cascading damage. This meant the riskiest work happened while the team was still learning the new infrastructure, but it also meant that once those extractions were complete, the remaining monolith was far more stable.

Phase 1: The extraction order

We mapped every domain in the monolith by two axes: coupling (how many other domains touched it) and traffic volume. The four candidates for first extraction:

Shipment tracking — highest read traffic, pure query workload, minimal writes. Perfect first candidate.
Notification service — already quasi-isolated in the codebase, fire-and-forget pattern. Low risk.
User auth & sessions — highest coupling, but the session store was causing the most production incidents. Worth the risk.
Document generation — CPU-intensive, caused latency spikes that affected all other requests. Needed isolation.

We left the core shipment management logic — creation, assignment, status transitions — in the monolith until Phase 2. It had the deepest business logic and the most test coverage. Touching it first would have been reckless.

"Extract in order of blast radius, not order of ease. The services that hurt most when they fail deserve the earliest independence."

The traffic routing layer: a proxy that earned its keep

Every extraction followed the same pattern: build the new service, run it alongside the monolith in shadow mode (receiving real traffic, returning responses we discarded), compare outputs, then flip the router. We built a lightweight routing proxy using Nginx + Lua that could switch traffic per-endpoint without a deployment.

nginx / lua Dynamic routing by feature flag

-- route /api/tracking/* to new service when flag is active
local flags = require("feature_flags")
local target = ngx.var.uri

if target:match("^/api/tracking") then
  if flags.is_enabled("tracking_v2", ngx.var.arg_tenant_id) then
    ngx.var.upstream = "tracking-service:8080"
  else
    ngx.var.upstream = "monolith:8000"
  end
end

Feature flags were scoped per tenant. We rolled out each new service to 1% of tenants, monitored for a week, then 10%, then 50%, then 100%. If something went wrong, we flipped the flag back — no redeploy, no incident, no user impact.

Database: the hardest problem

The monolith's PostgreSQL instance was the most dangerous thing we touched. Every service was reading from and writing to the same 2.8TB database. Extracting a service without extracting its data is pointless — you've just added a network hop.

The migration approach

For each extracted service we followed a three-step database migration:

Dual-write phase: The monolith continues to write to the original tables. The new service writes to its own schema simultaneously. Both reads go to the monolith's data. Duration: 2 weeks per service.
Backfill and validation: Migrate historical data to the new schema. Run automated reconciliation scripts every 6 hours to catch divergence. Duration: 1 week.
Read cutover: Flip reads to the new service's schema. Keep dual-write running for another week as a safety net. Then remove monolith writes.

The entire process took 4–5 weeks per domain. It was slower than we wanted. It was worth every day — we caught three data inconsistencies during reconciliation that would have caused silent corruption if we'd moved faster.

Phase 2: Handling the growth spike

By month three, we had four services live. User counts were already past 400,000 — ahead of the contractual target, because a new enterprise customer had onboarded early. The tracking service was handling 3,200 requests per second at peak.

Two things saved us here that we hadn't fully planned for:

Read replicas with connection pooling. We'd added PgBouncer in front of every database, but hadn't configured it optimally. Under load, we were exhausting connection limits. Reconfiguring PgBouncer's pool mode from session to transaction halved connection count and cut p99 latency from 340ms to 80ms in the tracking service.

Aggressive caching at the edge. Shipment status queries are read-heavy and tolerate 30-second staleness. Adding Redis in front of the tracking service dropped database load by 73%. The insight that unlocked this: most users check shipment status obsessively during the final 4 hours of delivery — exactly when the underlying data changes most slowly.

python Cache-aside pattern with TTL tuning

async def get_shipment_status(shipment_id: str) -> ShipmentStatus:
    cache_key = f"status:v2:{shipment_id}"
    cached = await redis.get(cache_key)
    if cached:
        return ShipmentStatus.parse_raw(cached)

    status = await db.fetch_status(shipment_id)

    # TTL varies by delivery phase — shorter when near delivery
    ttl = 30 if status.hours_to_delivery > 4 else 8
    await redis.setex(cache_key, ttl, status.json())

    return status

The migration timeline

Month 1

Infrastructure foundation

Kubernetes cluster provisioned, service mesh (Linkerd) configured, observability stack deployed (Prometheus, Grafana, Tempo). Feature flag service live. Proxy routing layer tested.

Month 2

First extractions: Tracking + Notifications

Tracking service extracted and handling 100% of traffic. Notification service decoupled via async queue (RabbitMQ). First database migration complete. 85,000 users.

Month 3

Auth extraction + caching layer

Session state migrated to Redis-backed auth service. Document generation service isolated with dedicated worker pool. Redis caching added to tracking. 430,000 users.

Month 4

Core domain extraction begins

Shipment management (creation, assignment, status) extraction starts. Dual-write phase active. PgBouncer reconfiguration resolves connection exhaustion at scale. 820,000 users.

Month 5

Core domain live + horizontal scaling

Core shipment service fully migrated. Auto-scaling policies configured per service. Monolith reduced to thin API gateway. 1.4M users without performance degradation.

Month 6

Monolith decommissioned

Final monolith routes migrated. Original server retired. Full microservices architecture in production. 2.1M users. Zero unplanned outages recorded across the entire migration.

What we'd do differently

The migration succeeded, but three things cost us more time than they should have:

We underestimated service discovery complexity. The moment you have more than three services, hardcoded inter-service URLs become a maintenance burden. We retrofitted Consul-based service discovery in month three when we should have built it on day one.

Distributed tracing was added too late. We deployed Tempo in month one but didn't propagate trace IDs through the full request path until month three. The two-month gap meant we debugged cross-service issues with logs and guesswork. Never again.

The dual-write reconciliation scripts were manual too long. We wrote bespoke reconciliation scripts for each service. By service four, we'd built enough shared tooling that we could have automated this from the start. The time savings on services five and six were significant — we should have invested in that infrastructure on service one.

"The scariest moment wasn't the migration. It was realizing that the monolith we were replacing had been held together by implicit knowledge that lived only in people's heads."

The actual hard part: organizational, not technical

The technology was solvable. The harder challenge was maintaining a shared mental model across a team that was simultaneously migrating existing functionality and delivering new features on a contractual timeline.

What made it work: every engineer owned both the migration work for their domain and the feature work on top of it. We didn't create a separate "platform team" to do the migration while product engineers kept building on the monolith. That separation creates two systems and two codebases that diverge. Instead, the engineers who knew the domain did the extraction — they were the only ones who understood the edge cases well enough to do it safely.

The velocity cost was real: feature output dropped by roughly 30% during peak migration periods. We communicated this upfront to stakeholders with a clear model of when velocity would recover. It did, on schedule, in month five — and then exceeded pre-migration velocity because deployments went from 40 minutes to 4 minutes.

How We Scaled a Platform from 10K to 2.1M Users Without a Single Outage

Where we started: a monolith at its limits

The architecture decision: strangler fig, not big bang

Phase 1: The extraction order

The traffic routing layer: a proxy that earned its keep

Database: the hardest problem

The migration approach

Phase 2: Handling the growth spike

The migration timeline

What we'd do differently

The actual hard part: organizational, not technical

Scaling challenges ahead?
Let's talk architecture.

Where we started: a monolith at its limits

The architecture decision: strangler fig, not big bang

Phase 1: The extraction order

The traffic routing layer: a proxy that earned its keep

Database: the hardest problem

The migration approach

Phase 2: Handling the growth spike

The migration timeline

What we'd do differently

The actual hard part: organizational, not technical

Scaling challenges ahead?Let's talk architecture.

Scaling challenges ahead?
Let's talk architecture.