Adventures in orchestrating AI to build software at scale
A few things lined up at once. I’d just finished a data science master’s at Berkeley, so those skills were fresh. The AI coding tools had crossed a real line — they could finally operate on files and call tools directly, not just suggest snippets in a chat. And I’d been piloting ways to drive them through a structured development process: enough to suspect there was something there, not enough to know. The question I couldn’t answer from the outside was a skeptic’s question: can you actually use AI to build real software at scale — not just code that runs, but code that’s well-designed, documented, and maintainable, the kind you’d sign your name to — and still come out ahead on effort, enough that one person could do the work of a team? If someone had told me they had AI that could do that, I’d have said show me. So I set out to show myself.
It was a real bet: three months of my own time, in full immersion, on a hunch. The immersion was the part that made it possible. You can’t answer a question like this in evenings and weekends, and you certainly can’t answer it while running teams at a large company, where the day goes to coordination and reviews rather than to building something new end to end. I happened to be at a point where I could clear the decks and go all in — and that window, more than the skills or the tools, is the rare part.
The hunch came from a decade at Google, leading mixed teams across data and software. The limit on shipping was rarely the coding; it was coordination — design reviews, cross-team negotiation, security and compliance gates, the wait for a few hours of the design-system team’s time. AI was already making the coding itself faster. The open question was what one person could do with the coordination stripped away and the tools pushed as far as they’d go — and whether what came out would actually hold up as engineering.
1. AI hit a structural wall, not a model limit
While working on the capstone for my Berkeley masters — a large group project on commodity forecasting — my team used Claude Code (a command-line AI coding tool, the same kind as Cursor or Windsurf) to rebuild weeks of a teammate’s work in a couple of days. At that scope, the tool held the whole job in one session and kept it consistent; well-scoped work clearly wasn’t the bottleneck anymore.
The same project showed where the tool would fail at larger scope: sessions drifted apart. I’d have Claude write a function, then ask it to connect that function to another part of the code, and it would rewrite the function differently — sometimes inconsistent, sometimes incompatible. Half an hour in, a session would be working from its own version of the project that no longer matched the real one. The model wasn’t the limit; it kept getting better the whole time. Each session worked on its own slice with no way to see how the pieces fit together, and a bigger context window wouldn’t have closed the gap — the information had no structure that carried from one session to the next.
That was the problem my three-month sprint was meant to solve. The first build was Crucible itself: an orchestration layer designed to keep sessions working off one shared, persistent record of the project rather than their own drifting copies. The architecture was clear from the capstone; the sprint was about delivering it in a timeline and cost that made sense for one person.
2. Orchestration got me past it
The orchestration layer is straightforward to describe: one shared, persistent record of the project that every session could read and write, specialized agents with their own tools, checks that hold each step until the work it depends on is done, and me weighing in on the calls that need real judgment. About two weeks of focused work produced a working alpha. With it in place, the drift all but stopped — the shared record held what individual sessions couldn’t. From there, I used Crucible to build Crucible.
There’s an explosion right now of systems augmenting LLMs with memory and structure — retrieval, shared context, agent checks. After the capstone, I went looking for tools that handled what it had shown. What I found was a lot of activity, with many tools offering real value in their own way; I also found low-hanging fruit no one seemed to be picking up: agent memory, cross-session communication, and the integration with actual development work. AI has lowered the bar enough that one person can take on this kind of work — where not long ago it would have taken fundraising and a team. So I leaned into the moment and built it, first and foremost for my own use, and to develop deep expertise on the gaps and how to close them.
With the orchestration in place I’m dramatically more productive than without it — enough that I suspect others would find it useful too. The leverage is in the integration: how documents get into the shared record in the first place, how agents search and reference them while working on individual tasks, and the checks that keep sessions in sync on each other’s work.
This isn’t the best approach or the only one. But it’s a good one — the key pieces in one place in a way I haven’t seen elsewhere — and it works for me.
There’s a threshold here, lower than I think most people assume. Below it, the tools work fine on their own, and plenty of ambitious projects ship that way. Above it, they stop. That point is the coherence wall.
Figure 1 — The coherence wall, qualitatively. This is a rough sketch of my own experience, not a measured curve. “The tools on their own” means command-line AI coding tools used without orchestration — Claude Code in my case, with Cursor and Windsurf in the same family. On their own they held up on small, well-scoped tasks, then lost track once a project had a couple of weeks of work in it; the orchestration layer is what kept the work consistent as it grew. Where the wall falls depends on the work, the tools, and the person, but the shape — strong on small tasks, weakening as size and time grow — shows up across independent measurements.1
3. Three months produced 780,000 lines, for $600
Once the platform held, the rest came quickly. I used Crucible to build everything else, directing its agents rather than writing the code myself: Visualist, a tool that turns research and conversation into slide decks (release pending); a CRM in operational use for mail routing and support; a few live websites on a shared design system; and Rolling Thunder, a small side-scroller (demoed). The cross-platform release setup came together the same way, under my direction, from pieces already built.
All of it together is roughly 780,000 lines of production code; the released platform alone is about 265,000, in the range of a mature open-source project like Redis. It’s genuine, working software, built to a real bar. That bar sits between two I know from experience: internal software, shipped to colleagues who tolerate rough edges, and external software, shipped to the public with the heavy extra cost of accessibility, internationalization, failover, and formal security and legal review — often most of the work on a product. What I built is documented and tested more thoroughly than the internal projects I worked on, but it skips the external-launch work. It’s alpha, for individual engineers and small teams. It also came with its own paper trail — requirements, architecture, decisions, and tasks, cross-mapped to the code — because maintaining that shared record is what the orchestration layer does as it works.
Work like this normally takes a team of dozens, a few years, and tens of millions of dollars in salary. I did it alone, in three months, for about six hundred dollars in subscription fees. I checked the gap two ways: Boehm’s COCOMO, the textbook cost model, and a back-of-the-envelope estimate from the real team sizes I’ve seen. Both run high — even pre-AI at Google, we discounted estimates like COCOMO heavily, especially for internal-grade work. But the gap survives the discounting: cut these models by three-quarters, or even by a full order of magnitude, and one person in three months still comes out far ahead of what the work would normally cost. The workings, and the ratios, are in a footnote.2
Most of that gap isn’t me — it’s the AI doing real engineering work. My narrower claim is only about what happens above the wall, where the tools stop holding together on their own: orchestration is what got me past it.
4. The AI writes the code; I hold the vision
This is where the hype and the reality split. I didn’t write the code — I directed it. I decided what to build, in what order, and to what standard; the agents wrote it; and it faithfully carried out the vision. But it only did that because I was there holding the vision. Two different things get lumped together as “AI can write the code.” One is functional code that holds together at scale — it runs, it does what the task asked, and it stays coherent across a large project instead of repeating itself, building the same thing twice, or quietly breaking what already worked. The model on its own can’t do that past the wall; that coherence is exactly what the orchestration buys, and with it the functional layer is reliable. The other is code that solves the problem you’re actually trying to solve, in the way you imagined it, that lands with the people who’ll use it — and that, no tool can do for you yet, not nearly. It takes a clear picture of what you’re making, and someone holding that picture as the work proceeds. Left alone it drifts from your intent in ways that look fine up close and are wrong as a whole. So you stay in the loop the way you would with a strong team: check in, probe, catch the drift early, point it back at the vision. I could even turn the system on itself — have it audit its own codebase and propose improvements, approve the ones worth doing, and watch it implement and test them; a good deal of Crucible was built that way. But none of it steers itself. The leverage is real, and it is not autonomy.
5. The companies seeing returns reorganized, not retooled
At the level of one person, the gains from these tools are real — I saw them over those three months, and they show up across the field. At the company level, the returns mostly haven’t followed: in PwC’s 2026 survey of more than four thousand CEOs, most reported neither cost savings nor revenue gains from their AI spending.3 The ones who did were markedly more likely to have changed how they work around the tools, not just bought better ones.
My wall was about coherence — real capability that doesn’t hold at scale without structure built around it. A company’s version is the same shape: real capability that doesn’t reach the bottom line until the work is reorganized around it. Either way, it looks more like an organizational limit than a technical one.
If the main blocker were the technology, you’d be stuck waiting for better models. The evidence runs the other way: where companies have done the organizational work, the returns have followed — so it’s probably something you can act on now, not later. You can likely get a lot more out of the AI you’re already paying for by changing how the work is organized around it, rather than by spending more on tokens or waiting for the next model. Personal-level gains compounded at the org level aren’t marginal — they’re the kind of efficiency that lets one operator own more, a team ship faster, and a company get ahead of competitors that haven’t done the integration work. The shift is expansion of what each person can do, not replacement.
I can’t prove any of this from one run, and the companies actually closing the gap know more about it than I do. But the heart of it was a partnership: I can’t operate at this scale without Crucible, and Crucible has no agency without me. If you’ve hit the same wall, it’s free — point it at something real and see whether it holds for you the way it did for me. And if you’re working on this inside an organization, I’d like to hear what’s slowing you down, what’s working, and what the tools still have to do for the gains to show up.
Crucible is a free AI software-development platform from Ember Agentic Labs that builds working, tested software by directing specialized AI agents. It’s free to install at crucible.emberagenticlabs.com — look at the architecture yourself, and take whatever proves useful.
-
The curve is drawn from my own experience, but the shape — strong on small work, degrading as size, duration, and complexity grow — recurs in independent measurements, several from 2026. METR’s 2025 randomized trial found experienced developers on large, mature codebases ~19% slower with AI even as the same tools give 2–5x speedups on small greenfield work; METR’s larger February 2026 follow-up softened that to roughly 4% slower, not statistically significant. Long-horizon agent benchmarks show the same cliff by scope: SWE-EVO (January 2026) drops a frontier model from ~73% on isolated verified tasks to ~25% on multi-file software-evolution tasks, and NL2Repo-Bench (December 2025) holds even the strongest agents under 40%, naming the failure modes directly — loss of global coherence, fragile cross-file dependencies, weak planning over hundreds of steps. On context, a 2025 study found LLM accuracy falls 14–85% as input length grows even with perfect retrieval — a bigger context window doesn’t buy coherence. None is my exact setup and none proves a universal threshold; together they show the strong-then-degrading shape isn’t unique to my setup. ↩
-
The three reference points, in detail. Textbook estimation. Barry Boehm’s COCOMO II, calibrated against pre-AI engineering practice, at semi-detached complexity puts the 265,000-line released platform at roughly 1,550 person-months — about forty engineers for three-plus years — and the 780,000-line full portfolio at roughly 5,200, for ratios of about 500-to-1 and 1,700-to-1 against the three person-months I actually spent. Experience-based. Working backward from the component inventory at the quality bar I built to (between an internal- and an external-launch standard), a realistic team for the full portfolio is twenty to thirty engineers plus program and leadership staff over twelve to eighteen months — 240 to 540 person-months, a ratio of 80-to-1 to 180-to-1. This one is informal proxy estimation, not a peer-reviewed model. Floor. These models run high, and in practice they get discounted hard — even pre-AI, the teams I worked on discounted textbook estimates like COCOMO heavily, especially for internal-grade work. Cut everything here by three-quarters and the gap still runs from about 20-to-1 up to several hundred to one; cut it by a full order of magnitude and even the most conservative estimate is still about 8-to-1. Loaded team cost at these scales runs $20–60 million at strong-tech-company salary bands; my out-of-pocket was about $600. Lines of code is an imperfect proxy, and AI-assisted code can run longer per unit of function — but that would inflate a LOC-derived estimate, which is one more reason these are discounted as hard as they are here, not a mark against them. ↩
-
PwC, 2026 Global CEO Survey (4,454 respondents): 56% report neither cost savings nor revenue gains from AI investments; 12% report both. CEOs who report returns are markedly more likely to have embedded AI into how they operate rather than only adopting tools. ↩
Enjoy Reading This Article?
Here are some more articles you might like to read next: