Seven months to modular: re-architecting a multi-tenant SaaS platform without stopping delivery

The system before

The platform was a multi-tenant SaaS application serving organizations that needed financial ledger, identity, and reporting capabilities in one product. It had been built and extended over several years by teams with different priorities at different moments. The result was predictable: a system that worked but could not evolve quickly.

Four constraints triggered the re-architecture.

Adding a new module took months. The core application had no extension points, so a new capability meant modifying the shared data model, the monolithic application layer, and the test suite. Engineers building new capabilities had to navigate the entire existing system just to work out what they might break. Estimates for new modules ran from two to four months.

Tenant isolation depended on developer discipline. Tenant scoping was enforced at the application layer, so a developer had to remember to add the right filter to every query. Security reviews had found several cases where the filter was missing. There was no structural guarantee, and the risk kept coming back.

Deployment was a coordinated event. Every deployment touched the whole system. A change to the reporting module meant deploying everything, including the financial ledger module it had nothing to do with. Deployment frequency was low and falling, and change batching was producing larger, riskier releases.

Module communication was hardcoded. When the ledger module needed to notify the reporting module, it called it directly, a hard in-process dependency. Moving to async communication later would mean rewriting both sides.

The first hypothesis

The initial approach was a standard layered modular monolith: extract modules with clear boundaries, keep them in a single deployable, communicate in-process. That would handle the "adding a new module takes months" constraint and give teams independent ownership of their code.

The hypothesis held for the first phase. Then it exposed a second problem. The in-process communication pattern was simple, but it was wiring modules together in ways that would make independent deployment impossible later. Module A called module B the same way whether the two were logically separate or not.

The re-strategization at month three was a big one. The communication layer had to be abstracted before it became too embedded to change. The dispatch abstraction, a unified interface that sends commands and publishes events without the caller knowing or caring whether the transport is in-process or a message bus, moved from a Phase 4 concern to a Phase 2 decision.

That meant rework. It was still the right call.

The architecture that emerged

The module contract

Every module follows the same structure:

Module/
├── Contracts       (commands, queries, integration events - shared across modules)
├── Domain          (entities, value objects, domain events - zero infrastructure)
├── Application     (handlers, validators, DTOs - consumes domain and contracts)
├── Infrastructure  (database context, repositories, external clients)
└── Api             (IModuleStartup implementation - wires DI and HTTP endpoints)

The Contracts layer is the only public surface. Other modules import contracts - never domain or application code. The domain layer has zero infrastructure dependencies enforced at build time by architecture tests. A domain layer that imports Entity Framework Core fails the build.

The IModuleStartup contract is how the platform discovers and loads modules:

IModuleStartup
  ├── ModuleName         - identity
  ├── ConfigureServices  - register all module dependencies
  └── MapEndpoints       - declare HTTP routes under /api/v1/{module}/*

At application startup, the module loader scans assemblies, discovers all IModuleStartup implementations via reflection, and invokes them in dependency order. Adding a new module means: implement the interface, register the assembly. Nothing else in the platform changes.

The transport abstraction

The dispatch layer is the most consequential architectural decision in the system.

graph TD
    subgraph CALLER["Handler (any module)"]
        CMD[Command or Event]
    end

    subgraph DISPATCHER["IModuleDispatcher"]
        ROUTE{Transport Config}
    end

    subgraph INPROC["In-Process (Phase 1-3)"]
        MT_MED[MassTransit Mediator]
        HANDLER_A[Target Handler - same thread]
    end

    subgraph BUS["Out-of-Process (Phase 4+)"]
        MT_BUS[MassTransit Bus]
        RABBIT[RabbitMQ]
        HANDLER_B[Target Handler - remote consumer]
    end

    CMD --> DISPATCHER
    ROUTE -->|inproc| MT_MED
    ROUTE -->|bus| MT_BUS
    MT_MED --> HANDLER_A
    MT_BUS --> RABBIT
    RABBIT --> HANDLER_B

The calling module uses IModuleDispatcher.SendAsync or PublishAsync. Whether the message travels in-process or over the bus is a configuration value:

"Dispatch": {
  "Ledger → Reporting": "inproc",
  "Identity → Notifications": "bus"
}

No handler code changes when a module edge moves from in-process to bus. The abstraction isolates the transport detail. It proved its worth when two module edges moved to async bus communication in month six without touching either module's business logic.

Tenant isolation: defense in depth

The tenant isolation design is the piece I would not compromise on, and it took the most explaining to stakeholders who wanted to ship faster.

The system enforces tenant isolation at two independent layers.

The first is an application-level ambient context. When an HTTP request arrives with a JWT carrying a tenant_id claim, middleware extracts it and stores it in an AsyncLocal variable. Every database query in that async call tree automatically picks up WHERE tenant_id = [current] through a global query filter. Handlers cannot forget this filter, because they do not apply it. The context does.

The second is database row-level security. Before executing SQL, a connection interceptor issues SET LOCAL app.tenant_id = '[guid]' at the database session level, and a Postgres row security policy enforces USING (tenant_id = current_setting('app.tenant_id')::uuid). Even if the application filter is missing, the database returns zero rows for cross-tenant access.

Both layers have to fail at the same time for cross-tenant data to leak. The combination turns a per-developer discipline problem into a structural guarantee. Security review stopped flagging it after month four.

Load testing as hypothesis validation

We did not assume the system's performance claims. We tested them before cutover. The load testing strategy was built to answer specific questions, not to produce pass/fail numbers.

Three questions drove the design.

Does the dispatch abstraction add meaningful latency when running in-process versus direct calls?
At what request volume does the in-process transport need to move to async bus?
Does the tenant context propagation hold correctly under concurrent multi-tenant load?

SLO targets established before testing:

Endpoint	Target
POST transaction (write)	p95 under 150ms
GET account balance (read)	p95 under 80ms
Token endpoint (auth)	p95 under 250ms
Saga completion (multi-step)	p99 under 2 seconds

The k6 load tests ran against a production-equivalent environment with representative multi-tenant data volumes. The dispatch abstraction overhead was measurable but under 5ms in-process, well within bounds. The tenant context propagation held correctly under concurrent load, confirmed by cross-tenant isolation checks in the test suite.

The load tests also surfaced something we had not expected: outbox delivery lag under high write volume. The transactional outbox pattern (write the message to a database table atomically with the business data, then deliver it to the bus asynchronously) was the right call for reliability, but the delivery service needed tuning at sustained high write rates. We found and fixed it before cutover, not in production.

The re-strategization moments

Seven months of architectural work does not run in a straight line. Here are the points where the plan changed.

Month 3, transport abstraction moved earlier. The original plan deferred the dispatch abstraction to Phase 4. We caught the risk of embedding direct in-process calls early enough to fix it without a full rewrite.

Month 4, saga compensation scope reduced. The original saga design included compensation workflows for every possible failure mode. Stakeholder review showed that some failure modes had acceptable manual resolution paths, so we scoped the saga implementation to the flows where automated compensation was genuinely necessary. That saved three weeks of implementation and test work.

Month 5, migration sequencing changed. The original plan migrated tenants alphabetically for simplicity. Integration testing showed that two tenants had unusual configuration patterns that would stress the new system in ways the standard test suite did not cover. We moved those tenants earlier in the sequence to find problems while there was still time to fix them.

Month 6, architecture test coverage expanded. The initial NetArchTest coverage caught obvious boundary violations. After a code review found a subtle dependency violation the tests had missed, we expanded coverage to cross-module contract usage patterns. That was extra work, not in the original plan, but it removed a class of risk.

Every one of these changes had to be explained to stakeholders. The explanation always had the same shape: here is what we planned, here is what we learned, here is the specific risk the change addresses, here is the timeline impact.

The results

After seven months of incremental work, parallel running, and phased cutover, here is where the platform landed.

New module delivery time. A new module implementing IModuleStartup, with its own schema, handlers, and tests, can be added without modifying any existing module. Delivery time for a new capability module dropped from two to four months in the old system, navigating entangled code, to five to ten days against a known interface.

Tenant isolation. Structural, with two independent enforcement layers. Security review confirmed no missing filters in the new architecture, because the filters cannot be missing. Individual queries do not apply them.

Deployment. Module boundaries are enforced at the code level. A change to the reporting module does not touch the ledger module's deployment artifact. The blast radius shrank to the modules actually changing.

Transport flexibility. Two module edges have moved to async bus communication since cutover. Zero handler rewrites. Configuration change only.

What it took beyond the architecture

The technical architecture was the solvable part. The harder part was holding stakeholder confidence over seven months of work that was incrementally valuable but not fully visible until the parallel running phase.

What worked was a weekly one-pager that translated each phase's technical progress into the business constraint it addressed. Not "we implemented the dispatch abstraction this week" but "the communication layer is now configuration, so the two module edges we discussed can move to async without rewriting handlers, which unblocks the reporting scale requirement from Q2."

The team that delivered this work was small. It was people who understood both why the architecture needed to change and what business outcome the change was supposed to produce. That combination, not the specific technology choices, is what made the re-strategization moments navigable instead of disruptive.