A roadmap for autonomous software engineering
By now, everybody is aware that AI can write code. AI has made staggering progress in the past few years, and it’s made a lot of people very excited and optimistic about enabling non-developers to build software.
But when you look around for examples of real software products built by non-developers with AI, you won’t find much. You’ll find lots of impressive demos and prototypes, but very few instances where they actually followed through and launched a useful, real-world product with paying customers. In fact, the ones that have managed to launch are often built by very experienced developers who knew exactly what they wanted to build and firmly steered things in the right direction, filling in the gaps and fixing problems where necessary.
If there’s so much demand for this, why are we not seeing more success stories? It’s still early days, but there are multiple established services on the market – such as Replit, Lovable, and Bolt – where non-developers can log in, describe an app, and get something back that resembles what they asked for. These services have tens of millions of users – why are these not converting to real products in the market en masse?
When you look at the results, what you’ll find is that they are quite good at putting together a somewhat functional prototype, but which is a substantial distance away from being an actual product you can launch. So, while these services are exciting and show a lot of promise, they tend to be more useful as a prototyping tool than anything else. Which – don’t get me wrong – is still a fantastic achievement, but it’s short of the goal we are shooting for.
Why are they falling short? Well, it’s because there’s a truly vast difference between a prototype and a real product, and it’s incredibly easy to underestimate this. Developers have known this for a long time:
The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
— Tom Cargill, Bell Labs
To a non-developer, writing code to make something come to life seems like the bulk of the work. Once you can click on stuff and see the things you want to happen actually happen, it’s mostly done, right? Sadly not. The gap between making a prototype work and having a product that’s ready for the real world is where all the software engineering happens.
So far, AI has made a tonne of progress on the first 90% of effort – the proof of concept stage that involves writing a bunch of code – but hasn’t gotten very far with the second 90% of effort – the software engineering stage. In order to fulfil the promise of enabling non-developers to build software products, we need to make a lot more progress on the software engineering side of things.
How can we get there from here?
When you look at how these AI platforms work, you’ll see that they all follow pretty much the same sort of process. They expand on your prompt by adding more detail, they split it up into a to-do list, then they set off a bunch of agents to try to do each task. Once those are complete, the user is presented with the result and has to identify all the things that aren’t working. When they describe those, more agents are started to solve those problems, and so on.
This is bad software engineering. We know this is bad software engineering. Trying to capture a fully comprehensive specification and then mechanically iterating through the entire thing, before testing and fixing all the problems you find at the end is the Bad Old Days of the 80s and 90s. It’s the whole problem that agile was a response to. And it doesn’t even do it that well. Instead of a proper specification, it’s a few sentences vaguely pointing in the direction of something familiar. This is like getting a developer to listen to your pitch and then sit down and start hammering out code straight away. Engineering teams get terrible results when they work this way, so why would we expect AI to do better?
We know how to solve these problems because we have solved them already for humans. We are adept at breaking products down into smaller MVPs, writing PRDs, building design systems, wireframing pages, mocking up designs, writing tests, and so on. We don’t throw a bunch of crap at the wall all in one go and then try to clean up the mess because we know we’ll fail if we do it that way.
One of the most important skills a developer has is the ability to organise code so that individual components can be worked on in isolation. It’s not just so that they can focus all of their attention in one spot, it also means that they can have a very tight iteration loop where they can test and improve just that one thing without pulling in the entire project. This is even more important for AI. The larger the project gets, the more context you need and the more difficult it is for AI to check its own work. If it can’t check its own work then it needs constant hand-holding, which makes it a lot less effective for developers and a complete failure for non-developers.
One of the biggest problems AI faces is that mistakes compound. If AI gets a little way off track, it can fail to complete a task well. If that bad code sticks around, it can influence everything that comes next. The larger a project gets, the more likely it is for something to go wrong, and once it does, more problems are likely to follow. The ability to separate concerns and work on smaller pieces in isolation acts as a firewall against this effect.
But beyond that, we also know that there are some things AI is unable to get right no matter how much time it spends on it. When AI is being used by a developer, this isn’t an especially serious issue because the developer can step in and do the parts that AI struggles with. But when AI is being used by a non-developer, this is a showstopper. AI that can accomplish 80% of development tasks is 80% useful to a developer, but 0% useful to a non-developer.
Smarter models won’t solve this any time soon
So far, a lot of the focus on improving AI has been aimed at making models smarter. This has paid off really well. As models have gotten smarter, they have been able to do progressively more complex tasks. But progress on this front has been slowing down. As progress has slowed, the focus has begun shifting more towards collections of agents that work in concert to get things done. This is a good start, but we haven’t even begun to scratch the surface of what’s possible when we put agents into a framework designed for them to succeed at software engineering.
People who believe that scale is all we need will no doubt be keen to point out that the more we work on domain-specific harnesses for AI to operate within, the faster we are heading towards learning the Bitter Lesson yet again. That’s fair. Perhaps they are right. Perhaps we can just throw more computation at the problem and AI will eventually get smart enough to one-shot entire software products. But I think the current state of the industry gives a very good indication that – at least in the short-to-medium-term – we will see far more progress building a framework that steers AI towards processes we know work well for human software developers. Maybe with scale we can get models that can accomplish 90% of development tasks. That’s still 0% useful to non-developers, so let’s not sit around waiting for scale to save us when we know we can do a lot better right now.
What problems do we need to solve?
We need to be clear about what we are building
A prompt, no matter how detailed, is not enough. AI can make educated guesses and fill in the gaps. But what ends up happening is that it builds something and then once the user sees what it has built, they realise it’s wrong and ask for changes. Waiting until something has been built before figuring out the details is a great way to waste time.
We need to be able to agree on what to build before we build it. This means turning prompts into PRDs, wireframes, and mockups. Describing how something should work is a series of agreements that stack upon each other. Let’s get them correct in turn instead of trying to skip to the end. Write the code after the user and the AI agree on what the goal is, not before.
We need to control scope
No more running off and trying to build it all at once. Pare a project down to an MVP, get it into production, and then iterate.
Identify ways in which we can isolate tasks more effectively. Working on the look & feel should not mean bringing in a load of backend logic just to iterate – or worse, making inadvertent changes to the backend because the agent got distracted.
We need a real software development lifecycle
It’s not enough for an agent to make a to-do list and then try to build the whole thing. That results in a mess of a prototype with many shortcomings.
We know what success looks like here. Agree on architecture. Split things up into features. Make sure each feature has a clear spec. and acceptance criteria. Write the code for that feature. Write tests. Run the tests. Fix the bugs. Keep doing this until the feature is correct and only then move on to the next feature.
We need tools that enable the AI to evaluate success
Where are the frontend agents that can take a screenshot of a component in a design system and compare it to a PRD? That can write a test that clicks through a page and then watch a screen recording in five different browsers to spot any incompatibilities?
Where are the backend agents that fill the database with test data and measure query times to detect bad schema design? That can run Nessus against the API and evaluate potential security vulnerabilities? That can deploy with Terraform to an isolated AWS organisation to check that deployments work?
We need reusability
There are only so many variations of login flows the world needs. If the user doesn’t ask for anything special, why is the AI figuring everything out from scratch every time instead of picking a known-good PRD from a catalogue, then picking known-good page layouts and components from a design system? Sure, it uses libraries here and there, but we could be building things with much larger building blocks and it would be faster and higher quality. Leave the thinking for the parts that truly need to be novel.
We need escape hatches everywhere
This is probably the most important thing on the list. We know AI can’t do everything, but those things need to be done regardless. The state of the art solution to this appears to be to shrug our shoulders and tell the user that they are on their own. Maybe they’ll run to Upwork and try to figure out how to hire a developer to dump the whole thing on.
The next big step forward will not be fully autonomous software engineering. This is too large a problem to solve in one go. The next big step forward will be partially autonomous software engineering, and that cannot succeed without humans in the loop. A successful path forward must integrate humans and AI in a better way than simply saying “here’s the mess, you fix it” to a human developer.
The roadmap
There could be several ways to solve the problems I’ve outlined above. What I’ve focused on here is a series of tools that have value outside of this roadmap. It may seem like a long list, but because each of them is valuable in its own right, they can all be worked on independently without treating this roadmap as a monolith. First I’ll describe each of them, then I’ll outline how they fit together.
The product planner
Users are pretty bad at describing what they want. When you ask a non-developer to describe a product, they will include a tonne of ambiguities, and perhaps even things that are impractical to build. Worse, they will often deliver it in forms that are not helpful for people to collaborate on or for AI to digest. And they will normally stack all the bells and whistles in instead of focusing on an MVP.
Solution: An agentic, collaborative product planner that a user talks to in order to build a clear specification. It will select known-good solutions from a library of existing PRDs, write new ones where existing solutions are not available, interrogate the user to clarify ambiguities, and prompt them for things they might not have considered. It will work with the user to identify the key features and split things up into an MVP and subsequent iterations.
This will result in a strategy document and interactive functional specification for the MVP with wireframes. These are things that humans can collaborate on, review, and amend. The goal of this product is to achieve consensus on what to build and how it should work before any code is written.
Escape hatch: These are normal documents that can be commented on and refined just like Google Docs. If the AI just isn’t getting it right, humans can fill in the blanks.
The design system builder
Current AI builders don’t do a good job of building user interfaces because they try to tackle the whole product at once. This means that they’ve already written a tonne of code before the user sees anything. Once the user is in the position to provide feedback and ask for changes, the agent now has a load of code to sift through in order to address that feedback. They currently do not do a good job of this; burning context, getting distracted, or inadvertently making unrelated changes.
Solution: A design system builder that will only build the user interface. It will take inspiration from existing products to build a mood board that reflects the user’s preferences, it will identify all UI components required from the wireframes from the product planner, it will define common tokens to provide consistency across the design, and it will work on each UI component in turn to implement it.
When building the UI components, it has access to tools like Browserstack to take screenshots and screen recordings and confirm browser compatibility. It can write tests to compare them to the wireframes and PRD to measure success. It can take snapshot tests to ensure that future work doesn’t cause regressions.
Once more, it can pull from an existing library of high quality components to avoid reinventing the wheel. Where a PRD covers a wide range of behaviour, it can lift solutions for entire user journeys such as registration flows into the project wholesale.
It will present the user with the interactive UI components so that they can confirm they are happy with the look and feel before any business logic is written. Because we are working on the UI in isolation, the user can iterate on the look and feel without worrying about the backend functionality getting in the way, and without the AI burning context on it.
Escape hatch: We know that AI will not always be able to build every component. Having a formal design system means that it is low friction to bring a front-end developer on board to complete one or two components without dumping a huge, messy project on them.
The developer
This is basically what AI builders are trying to build today, except they are trying to roll up all the previous work into this part. Because we have a high quality specification, wireframes, and design system to work from, at this point, the developer can focus on wiring things up instead of keeping the implementation details of everything at once in its context.
Once more, where the PRDs have pulled existing solutions, the builder can re-use existing solutions instead of starting from scratch. Where there are 95% matches, the developer can start there and hit the eject button to make changes just for this project.
The freelance platform
This is probably not what you were expecting to see here, but it’s a key part of the puzzle. As I said before, we are not going to get to fully autonomous software engineering without going through a phase of semi-autonomous software engineering with human developers in the loop. Any realistic roadmap needs to account for this instead of us burying our heads in the sand and hoping that smarter models will solve this problem for us.
Right now, both AI builders and freelance platforms treat projects as monoliths. The AI builder produces a project, and in order to get a human developer to work on it, the user needs to figure out how to hire a developer and put all their eggs in one basket by trying to hire the right one. Results vary immensely and few people walk away from the experience on either side feeling happy.
One thing I underlined earlier is that a key factor in successful software engineering is decomposing larger pieces of work into smaller pieces. When you can separate concerns and focus on working on isolated tasks, it improves your success rate for both human and AI developers substantially.
At this point, our platform has already broken things down into smaller pieces. It can already determine what developer profile is suitable to work on each piece. We already have clear requirements. We can already evaluate success. We have so many pieces in place that we can do a far, far better job of connecting users with freelance developers. Better yet, we can do it in parallel, so that if you have five tasks that take a week each, you can give them to five developers and get them done in a week instead of one developer in five weeks.
Solution: A freelance platform that hires semi-autonomously for fine-grained tasks. It can provide the exact PRD to prospective developers, evaluate their profiles, and collect bids – presenting recommendations to the user. When the freelancers work on the product, the platform can run the same quality checks and evaluations that run against the AI-created features to ensure that the developer meets all the appropriate standards.
There are three important ways in which this outperforms existing freelance platforms. Firstly, we are able to eliminate most of the effort involved in hiring candidates and evaluating their work. Secondly, because the work is so fine-grained, it allows for specialisation to a much greater degree. If a developer wanted to, they could just take on work for form controls and focus on being the best form builder there is. Finally – and most importantly for the purpose of adoption – because this platform provides value to the rest of our product, we can run this without taking fees. This instantly undercuts every other freelance platform and solves the chicken-egg problem by filling the developer side as soon as they realise they can keep 100% of their earnings.
This is our generic escape hatch. It works because the user is guided to extract just what is necessary for a human to act upon instead of having them inherit an entire project. Consider it a huge agentic framework where AI agents and human agents operate as peers.
The project manager
The current state of the art with AI builders is that the agent writes a to-do list and describes it to the user. We already know how to manage software projects effectively, and this isn’t it.
Solution: A project management tool that takes the PRDs, writes the cards and puts them into a Kanban board, then keeps it up to date. When tasks get implemented, the user gets prompted for UAT before the card is moved on and the feature finalised. If the AI builder fails a task, the card gets assigned to the freelance platform, and when a freelancer completes the work, the card is moved on and the freelancer paid.
This becomes more important once the project matures. After the MVP is launched, the user will be thinking about new features and bug fixes. This is how new work is added to the system; and we then bring this full-circle by feeding new requests into the product planner to start the process all over again.
Putting it all together
When you assemble these pieces, you get a framework that is designed to put AI agents to work in the most effective way. It uses best practices from human software engineering to avoid most of the pitfalls existing platforms are currently falling into.
Most importantly, it recognises that AI cannot do everything, and is carefully constructed with escape hatches whereby we can integrate humans in the loop with a minimum of effort and without derailing the overall agentic flow. We no longer need to eject from an AI builder, throw our hands up in the air, and say “you fix it” to a human developer.
Once we build this, we will not only have an extremely valuable AI builder, but we will also have what we need to close the capabilities gap and get to fully autonomous software engineering. Because every time we’ve hit a wall with AI and given a carefully curated problem to a human, we’ve received a focused, tested, human-approved solution to that problem. We haven’t just been building a semi-autonomous software engineering platform, we’ve been building a high quality, curated training set for AI to learn from. And it all comes with appropriate licensing from freelancers who were paid for their contributions.
This may sound like a lot of work, but one of the things that makes a roadmap like this feasible is that we do not have to boil the ocean all in one go. Each of the constitutent pieces is a valuable startup in its own right. One single startup does not need to build all of this. Either multiple startups can work on each piece and integrate, or a startup aiming to build all of them can use just one piece of the puzzle as their GTM strategy.
What does this look like for the customer?
The plan
The customer has a great idea for an app, so they put the description into the product planner. The planner asks them to clarify a few things, then makes recommendations about which features should be tackled as part of the MVP and which should be deferred until the product is in the market.
It then assembles a high-level description of the MVP and a set of wireframes with descriptions of how each part works. It provides alternative recommendations for things that aren’t feasible or could be done a different way.
The customer steps through this to confirm that it is what they want. Perhaps they edit a few things, or share it with their team for further input. Once both customer and AI are in agreement about what to build, they move forward.
At this point, the customer has a clear product definition and wireframes, and they know what’s reasonable to accomplish as an MVP and what should be put off until later.
If the customer chooses to do so, they can depart the AI builder at this point. Go and find a full-stack freelancer or dev agency to build it for them. The AI has already provided value without writing a single line of code.
The design
The design system builder asks the customer which sites they like the look of and accepts URLs and screenshots. It analyses colour schemes, typography, and related design elements to come up with a draft set of design tokens. It presents the customer with a set of standard page templates that use these values to confirm that the general style is okay as a rough start.
Then the design system analyses the wireframes and specification to determine which page layouts and UI compnents need to be built. It lays the pages out using placeholders for the components, and confirms with the customer.
Once the customer is happy with the pages, the design system builder selects the off-the-shelf components they can use from its library, and triggers agents to build the remainder in parallel. It writes tests for the custom components and produces screen recordings of them in action.
The customer can see progress being made in the project manager. As the design system builder completes each component, they are presented to the customer for approval as live demos for those specific components. When enough components are completed to finish a page, those pages are also presented as live demos and approved by the customer.
The AI may fail to build some components. In this case, the customer is offered the option of hiring a freelancer. They see a rough estimate in both time and money. If they choose to move forward with this approach, after a short wait, they will be presented with a final price that they can accept or reject. They do not need to sift through candidates unless they want to. They also have the option of providing their own freelancer to do this. If they are unable to get a freelancer, they will return to the product planner to amend the specification and wireframes to find an alternative solution.
At this point, the customer has a clear product definition and wireframes, they have scoped things down to a reasonable MVP, they have a moodboard and design system, and they have working front-end components for all aspects of their UI.
If the customer chooses to do so, they can depart the AI builder at this point. They can hand off the project so far to a backend developer to wire everything up. The AI has already provided value, even if it hasn’t built the entire product.
The development process
Now each feature needs to be implemented. The project manager will break down the specification and wireframes into individual features and put them into the Kanban board. Each feature will be worked on in turn, with the customer able to progress. As each feature passes tests and matches the acceptance criteria from the product planner, it is presented to the customer for user acceptance testing. When the customer is happy, the feature is merged and the next feature is begins development.
The AI may fail to build some features. Just like with the design system, where the AI cannot get something done, the project manager can outsource it to a freelance developer with minimal friction.
At this point, the customer has a working MVP that can be brought to market. There are many places where we can add value at this point too – for instance deployment, hosting, and monitoring; go-to-market strategies; advertising, etc. And of course, when they want to build more features, they just return to the product planner and start talking.
One key difference in the user experience is that existing AI builders use chatbots as the primary interface. We know that communicating product concepts to an engineering team is a collaborative process that relies upon artefacts like PRDs and wireframes, and is organised with things like Kanban boards. The current chatbot model is like having a conversation with an engineering team – this is wholly inadequate for successful software engineering. Let’s learn from what has worked before and build on our experience instead of figuring everything out the hard way all over again.
Is this realistic?
All of this is possible today without any further advancements in AI. Today’s models are already smart enough to get all of this done, but only if they are put into the right framework that helps them succeed. This is an engineering problem, not a research problem.
We’re raising
If this excites you and you invest at pre-seed stage, get in touch!