Scaling an AI engineering team from 3 to 6: the rituals that held us together

The problem with doubling a team mid-flight

Scaling an engineering team sounds like a good problem to have. It is, right up until you notice that every new hire adds communication overhead, dilutes institutional knowledge, and adds failure points to a system that worked precisely because the team was small and tight.

At an AI-powered loyalty and gifting platform, we ran straight into this. We were shipping production LLM systems, including RAG pipelines, AI-powered loyalty logic, and financial forecasting models handling $7.5M in projected transaction flows, with three engineers who knew the system deeply. Then we needed six.

The naive approach is to hire fast and onboard as you go. I have watched that destroy team velocity for six months while the business waits. We did not have six months.

The starting point: what made three work

Before adding anyone, I spent time working out why the team of three was effective. Keeping those properties alive under scale was the real challenge.

Three things made the small team work. Shared context: all three engineers had been in the room for every architectural decision, so no knowledge lived only in one person's head. Short feedback loops: code review happened within hours, and architectural questions got resolved in a conversation rather than a ticket. And high trust: the team had shipped together under pressure and knew each other's judgment.

Doubling the team would dilute all three. New engineers would not have the shared context. The feedback loop would stretch. Trust takes time to build.

The rituals we built were designed to rebuild those properties under the new conditions.

The four rituals

1. Structured onboarding sprint

Every new engineer joined on a dedicated onboarding sprint: two weeks, no feature delivery commitments.

The sprint had a fixed curriculum. Days 1 and 2 were a system architecture walkthrough with the existing tech lead, and the new engineer wrote a summary afterward as a comprehension check. Days 3 to 5 were paired work on a pre-selected archaeology task, a low-risk but representative piece of the codebase that forced them to understand the system's key patterns. Week 2 was an independent contribution to a scoped, well-defined task with a designated reviewer.

The archaeology task mattered most. It forced new engineers to read code, ask questions, and build a mental model before they wrote anything. Engineers who skip this step spend their first two months making assumptions the existing team has already tested and discarded.

By the end of the sprint, the new engineer could take part in architecture reviews with real context, not just sit and listen.

2. Weekly architecture reviews

As the team grew, I introduced a weekly 45-minute architecture review. Not a status meeting. Not a demo. A structured conversation about a single technical question or decision in play.

The format was simple. One engineer wrote up a problem or proposed approach and shared it 24 hours ahead. The team reviewed it asynchronously and showed up with written reactions. The session resolved the open questions and produced a documented decision.

This kept the full team's context in sync as the system evolved. It also gave newer engineers an explicit window into how architectural decisions got made, the reasoning and not just the outcomes.

3. Prompt review process

We were building production LLM systems. Prompt engineering is real engineering, and it carries real risk. A prompt change that degrades output quality in the loyalty recommendation engine hits live users and has downstream financial implications.

So I introduced a prompt review process modeled on code review. Every prompt change went through a pull request with a structured template: what changed, why, what was tested, what failure modes were considered. At least one engineer with context on the downstream system reviewed it before merge. And every prompt change triggered a targeted evaluation run against a labeled test set before deployment.

This slowed individual prompt iterations slightly. It also prevented three production incidents in the first 60 days that would have cost far more to remediate than to prevent.

4. Incident retrospectives without blame

Production LLM systems fail in genuinely surprising ways. Hallucinations reach users. RAG retrieval returns irrelevant context. Financial calculations hit edge cases that were not in the test suite.

When an incident happened, I ran a structured retrospective within 48 hours: timeline reconstruction, contributing factors rather than blame, and explicit action items with owners and deadlines.

The rule was simple. The retrospective is about the system, not the engineer.

In the first one, a newer engineer admitted they had spotted an anomaly two days before the incident and had not raised it, because they were not sure it mattered. That became the case study we used to set expectations going forward.

How the team structure scaled

graph TD
    subgraph T3["Team of 3 - Original"]
        CTO3[Fractional CTO / Architect]
        E1[Engineer 1 - AI / RAG]
        E2[Engineer 2 - Platform]
    end

    subgraph T6["Team of 6 - Scaled"]
        CTO6[Fractional CTO / Architect]
        TL1[Tech Lead - AI Systems]
        TL2[Tech Lead - Platform]
        NE1[Engineer 3 - AI]
        NE2[Engineer 4 - AI]
        NE3[Engineer 5 - Platform]
        NE4[Engineer 6 - Platform]
        CTO6 --> TL1
        CTO6 --> TL2
        TL1 --> NE1
        TL1 --> NE2
        TL2 --> NE3
        TL2 --> NE4
        TL1 -. weekly arch review .- TL2
    end

The key structural call was promoting from within to create two tech lead roles instead of hiring mid-senior engineers externally. The original engineers had the context. Giving them ownership reinforced that context and created a natural mentorship layer for the incoming engineers.

Results: 60 days after doubling

Metric	Before	After
Regression releases per month	1 average	0
New engineer time to independent contribution	6-8 weeks typical	3 weeks
Production incidents from prompt changes	3 in prior quarter	0 in 60 days
Open architectural decisions deferred	11 items	Cleared

The financial forecasting module, the highest-stakes component of the platform, went through a substantial refactor during this period. A new engineer did it under tech lead supervision, without incident.

What I would tell anyone scaling an AI team

Hiring is the easy part. Transferring context is the hard part.

An AI engineering team carries unusual risk because the failure modes are less visible than in traditional software. A bug in a CRUD API is obvious. A degraded LLM response is subtle, sometimes invisible to automated testing, and potentially harmful at scale.

The rituals are not overhead. The onboarding sprints, the architecture reviews, the prompt review, the retrospectives: together they are the system that makes it safe to scale.

A team of six that ships safely and consistently is worth more than a team of eight that ships fast and breaks production.