Unsupervised Codebase Refactoring: How To, What Works and What Doesn't

08 Feb 2026

Could code refactoring become a one-click operation, like running a formatter but
for architecture ? Split the bloated files, untangle the responsibilities, deduplicate
the logic, and come back to a clean codebase ?

So I ran an experiment. I pointed Opus 4.6 at a 60K line Go microservice I had
vibe coded over the past few months, gave it some refactoring principles, and let it
run unsupervised. Two hours and 43 euros later, it had produced 22 commits across 46
files, changing 11,174 lines of code.

The codebase

The service itself is nothing out of the ordinary:

A CRUD REST api backed by a postgres database, defined by openapi specs and autogenerated
A few pages of server side rendered html
An asynchronous job dispatcher backed by sqs queues and lambda executors
REST api clients for various third party and internal services, also autogenerated
Business logic orchestrating all of the above
An exhaustive unit test suite
Terraform infrastructure in the same repository

Since I didn't know Go before starting this project, and since vibe coding tends
to make one a bit lazy, the overall code quality was not stellar. Some files grew too
big, some types took on too many responsibilities, some logic was duplicated, etc, etc.

The technique

I used the Ralph technique. The agent is ran in a loop
where at each step it will either plan a set of tasks, or pick one task and work on it.
This allows you to run Claude unsupervised on tasks bigger than one session can handle.

The same prompt is used at every step:

# REFACTORING.md

Hello, we are about to refactor this codebase to cleanup the code. Here's the plan of action.

  - 1. Read the whole code of the repository.
  - 2. Read the TASKS.md file if it exists.
      - 2.1. If it exists and is not empty, pick a refactoring task from the list. Choose the most appropriate.
          - 2.1.1. Refactor the code according to the task description.
          - 2.1.2. Commit the changes to git.
          - 2.1.3  Remove the task from TASKS.md
          - 2.1.4. You are done. Do NOT pick and process another task.
      - 2.2. If it doesn't exist or is empty:
          - 2.2.1. Identify the parts of the code that could be refactored, following the following principles
              - A class should have a single responsibility
              - The dependencies of the class should be mockable and injected at class instantiation
              - Repeated code should be factored into functions
              - Files shouldn't be longer than 1.5K lines
          - 2.2.2: If using the previous insights you think there is valuable refactoring work to be done:
              - 2.2.2.1 Write a list of refactoring tasks in TASKS.md
              - 2.2.2.2: You are done.
          - 2.2.3: If there is no more refactoring to be done, notify me with 'say "I am done with refactoring"'

There's some important things to note about this prompt:

We load the whole application code into the context before doing anything. This not
only gives the model a global understanding of the code, but since it does one task at
a time, it will always have the latest version of the code in its context.
The refactoring principles are explicitly listed, and I knew beforehand that those
were pain points where meaningful work could be done.
The AI is left free to choose which task it wants to work on. This is important
because the model is smart enough to identify the dependencies between the tasks, and
those evolve as the code gets refactored.
When the task list is emptied, the agent builds a new set of tasks. It's the AI's
job to decide when it's done, when there is nothing left to refactor.
Each step is concluded by a git commit, which helps reviewing the massive amount of
code change this will generate.
I didn't specify anything regarding running the unit tests since I know from
experience that Claude will do that on its own unprompted.

I then ran this prompt in a loop using the following script. It looks a bit more complex
than it really is. It just runs Claude Code in yolo mode in a loop. The large jq part is
there to parse the json event stream and print it in a humanly readable way.

#!/usr/bin/env bash

trap 'exit 0' INT

while true; do
  printf '\n\n\n ============ STEP ============ \n\n\n'
  claude --dangerously-skip-permissions --verbose --output-format stream-json -p "Read and follow the instructions in ./REFACTORING.md" </dev/null | \
    jq -rj '
      if .type == "assistant" then
        (.message.content[]? |
          if .type == "text" and (.text | length) > 0 then
            .text
          elif .type == "tool_use" then
            "\n>>> \(.name): \(.input.file_path // .input.command // .input.pattern // "") <<<\n"
          else
            empty
          end)
      elif .type == "result" then
        "\n\n=== Done (turns: \(.num_turns), cost: $\(.total_cost_usd)) ===\n"
      else
        empty
      end
    '
  sleep 1
done

What it did

Here is a sample of the changes, roughly in the order they were made:

Split big files into smaller and meaningfully named ones
Split types into more focused ones
Regrouped the various html rendering functions into a PageRenderer type
Fixed a circular dependency
Extracted data constants into external json files
Refactored the html templates to use common templates
Introduced helper functions to deduplicate repeated logic

The ordering was sound, it tackled the high-impact structural changes first, then
moved to progressively smaller improvements. It ran the build and unit tests before
every commit, leaving a clean and workable history. The git commit messages were
also top notch.

And the fact that it only added 49 lines total while changing 11,174 shows that what
it did was truly refactoring. Deduplicating logic removes code, splitting types adds
some, and it struck a good balance.

What it missed

There are some refactoring tasks I wished it had identified and fixed but hasn't.
Although I'm quite sure it would be able to do those if prompted specifically.

One example is the logging. Logs need to include contextual data, so that data needs
to be passed around to the log call site. If done via function arguments, this
unnecessarily bloats the function signatures. You can fix that by introducing logger
objects that can hold that data, and then only the logger instance needs to be passed.

Introducing this pattern would have been an excellent refactoring for this codebase,
but it wasn't picked. The quality of the output is bounded by the specificity of the
refactoring concerns in the prompt. General principles like single responsibility
get you general improvements. If you want the agent to tackle a specific architectural
smell, you need to name it.

What went wrong

At some point in the code, we re-fetch some database records immediately before doing
a write to avoid updating from stale data. It decided those calls were unnecessary and
removed them.

That is not entirely the fault of the model as there were no comments explaining this,
the calls were not in a transaction as they should have been, and there were no unit
tests covering this case. It was not just slop, it was a loaded footgun.

I caught it because I reviewed each commit individually, and because I was aware of
that fragile section of the code. If I hadn't, it would have gone to production.

This is the thing with unsupervised refactoring. The agent can encounter your existing
footguns and pull the trigger. Bang!

What made it possible

I think there are some key elements that made this possible at all:

This service was of reasonable size and all its code was in a single repository
cleanly separated from the other services.
There was extensive unit test coverage.
There was a good AGENTS.md file that gave the necessary context to understand
the application.
The interfaces with other services are not defined as code but as static yaml files,
giving confidence that the interfaces were left untouched by the refactor.
I had already identified beforehand the issues with the existing code and could steer
the agent in that direction.
The original code, while not perfect, was not a vibe coded disaster either.

Conclusion

My dream, starting this experiment, was that I could then put this process in a CI
pipeline action, so that every engineer of the company could just press the refactor
button when code was bad, and that code refactoring would become a solved problem just
like automated code formatting.

We're not there yet. Two things are missing.

You need to give direction. The refactoring principles are specific to each project's
goals and technology choices. You could encode them in an AGENTS.md, but there is no
universal make my codebase clean prompt. Someone who understands the code still needs
to identify what clean means for that codebase.

And the agent doesn't know when to stop. As iterations went on, the changes became less
and less meaningful and less and less focused on the criteria in the prompt. The agent
wants to be helpful, so at every step it will find work to do, even past the point
where it makes any sense. I have seen this behavior in many other unsupervised agentic coding
experiments, and I would love to hear your solutions for that problem.

Still, the model did in about two hours what would have taken me much longer in any
other way. Even with the time required for reviewing the changes, this was a huge
productivity win. In most cases the limiting factor on refactoring is the engineering
time available to do it, and being able to do it faster can only increase the quality
of our code bases.

Unsupervised Codebase Refactoring: How To, What Works and What Doesn't

The codebase #

The technique #

What it did #

What it missed #

What went wrong #

What made it possible #

Conclusion #