Refactoring Codebases through Library Design

Abstract

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become increasingly accurate at solving isolated programming problems. We investigate code agents' capacity to refactor code in ways supporting growth and reusability. We present both a method and a benchmark for refactoring: Librarian, a sample-and-rerank method for generating reusable libraries, and MiniCode, a benchmark where code agents must minimize and refactor multiple independent solutions into a joint library. Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on MiniCode, obtaining compression rates 1.6-2x better than coding agents while also improving correctness.

Demo

library.py

solution.py

Loading... | Timestep 0

Librarian refactors competition coding solutions in MiniCode: Given a collection of code solutions, Librarian identifies useful abstractions and creates a library. It then rewrites each code solution using the library.

Key Contributions

Librarian is a sample-and-rerank method that refactors codebases into reusable libraries. It clusters code to find shared structures, samples refactorings, and ranks them by simplicity and correctness. It achieves 1.6-2x better compression than top code agents while boosting accuracy.

MiniCode is a benchmark for testing code agents' ability to create unified libraries from multiple code sources, such as competition coding programs and Python repositories. It requires open-ended design and large-context understanding in order to craft simple libraries.

Problem Statement

Given multiple code sources that contain problem- specific implementations, we evaluate whether agents can create a cohesive library that captures shared abstractions. This library must reduce the total code size while supporting all original use cases, potentially opening up new use cases as well by mining and formalizing latent shared abstractions.

Libraries and refactored sources must be:

Correct: The refactored code passes all original tests.
Simple: Elegant code is short and natural.

We measure correctness by ensuring refactored code passes at least as many tests as the original sources and simpleness via the mininum description length (MDL). MDL, essentially the total log probability of all code under a model, captures both shortness and naturalness. This avoids issues of code golf, where shortness is achieved via code obfuscation.

Formally, given a set of original programs $\{\rho_n\}_{n=1}^N$, we want to find a new library $\mathcal{L}$ and refactored programs $\{\rho'_n\}_{n=1}^N$. We define the pass rate $\tau(\rho_n)$ as the fraction of unit tests program $\rho_n$ passes. In practice we are concerned both with the case where we are refactoring several sources ($N>1$) and also the case where there is only a single large source we are refactoring ($N=1$).

Refactorings are evaluated using the following objective:

\ell(\mathcal{L}, \{\rho'_n\}) = \begin{cases} -\log p_{\text{LM}}(\mathcal{L}) + \sum_n -\log p_{\text{LM}}(\rho'_n\mid\mathcal{L}) & \forall \rho_n, \, \tau(\rho_n) \leq \tau(\rho'_n) \\ \infty & \text{otherwise} \end{cases}

Here, $p_{\text{LM}}(\mathcal{L})$ is the probability of the library under a language model, and $p_{\text{LM}}(\rho'_n\mid\mathcal{L})$ is the probability of the refactored program $\rho'_n$ given the library $\mathcal{L}$. The constraint $\tau(\rho_n) \leq \tau(\rho'_n)$ ensures that the refactored programs pass at least as many tests as the originals. The loss function $\ell$ thus encourages solutions that are both correct and have minimal description length, as measured by the language model.

The MiniCode Benchmark

We instantiate our evaluation across three splits of varying difficulty: large repositories, small repositories, and competition coding. In each of these domains, agents must understand a collection of code sources, synthesize a set of shared abstractions into a library, then refactor the code sources using that library. The refactored code and library are evaluated on correctness and simplicity.

Repository Split

We synthesize both large-scale and small-scale Python repositories by prompting LMs. In order to obtain a collection of refactorable repositories, we prompt LMs to generate ideas then synthesize repositories by generating variations of those ideas via personas. Agents must create a unified common library package that gets imported into the original repository packages.

CodeContests Split

Sourced from the CodeContests dataset, this domain uses competitive programming problems which naturally contain shared concepts and test cases. Each collection provides multiple solutions, and the agent's task is to create a central library.py file that is imported into each refactored solution.

Domain	Sources	Collections	Avg LoC	Avg Tests	Generated by
Code Contests	300	30	87	10	Humans
Small Repositories	262	22	209	12	o4-mini
Large Repositories	20	10	6,433	101	Claude-Sonnet 3.7

Table 1: MiniCode statistics

Explore a CodeContests Collection

We visualize the original code sources as presented to a code agent below. We also provide the library and refactored solutions that Claude Sonnet 3.7 created in a refactor attempt.

Check out the full benchmark here.

Librarian: Refactoring Code to Create Libraries

Librarian is our method for refactoring existing code into a more organized and reusable library. By identifying common patterns and abstracting them into shared building blocks, Librarian compresses collections of programs while migrating them to use these new components—reducing overall code size and often improving functionality. The method operates on a simple sample-and-rerank framework, progressively building a library of useful functions to maximize our refactoring objective. Figure 1 illustrates the overall process.

Librarian operates on a simple sample-and-rerank framework to maximize our refactoring objective described above. It maintains and grows a library of useful functions as part of this objective.

Concretely, our framework follows:

\mathcal{L}^\star, \left\{ \rho^\star_n \right\} = \arg\min_{\mathcal{L}, \left\{ \rho'_n \right\} \in \mathrm{Sample}(\left\{ \rho_n \right\})} \ell(\mathcal{L}, \left\{ \rho'_n \right\}).

How It Works:

Clustering: We group related input programs into "tuples" by having a language model summarize the code, then clustering these summaries. This focuses the language model's attention on relevant code chunks.
Sampling Refactorings: For each tuple, Librarian retrieves relevant existing library functions. Then, using the original code and retrieved functions as context, a language model proposes K candidate refactorings.
Ranking with Compression: All K candidates are evaluated. We select the refactoring that scores highest on quality and maintains (or improves) test accuracy compared to the original code. New, useful library functions from the chosen refactoring are then added to the Librarian library for future use.

Results

We present the results for the CodeContests split of MiniCode below, comparing Librarian to Sonnet and codex-mini agent baselines.

Agent	Pass %	MDL %
original	82.0	100.0
sonnet 3.7	93.9	107.4
sonnet 4	84.4	77.1
codex-mini	82.0	86.8
LIBRARIAN	90.67	53.0

Table 2: Results on the MiniCode CodeContests split

We also present the results on the small repo split, which consists of repositories generated by o4-mini. We experiment with Claude Sonnet 3.7 as a planner, and either Sonnet 3.7 as the implementer or o4-mini.

Agent	Pass rate (%)	MDL ratio (%)
original	100	100
Claude Sonnet 3.7 planner, Claude Sonnet 3.7 implementer	78	514
Claude Sonnet 3.7 planner, o4-mini implementer	80	244

Table 3: Average results on MiniCode-repositories small

Finally, we present results on the large repo split. Due to the stronger performance of Sonnet models, we evaluate only Sonnet models to minimize cost.

Agent	Pass Rate (%)	MDL Ratio (%)
original	100	100
Claude Sonnet 3.7	84.3	152
Claude Sonnet 4	88.3	143.3

Table 4: Average refactoring results on MiniCode-repositories large, comparing the original code sources to refactored sources from Claude Sonnet 3.7 and Sonnet 4

Check out the paper for the full details on results!

Citation

@misc{kovacic2025refactoringcodebaseslibrarydesign,
      title={Refactoring Codebases through Library Design}, 
      author={Ziga Kovacic and Celine Lee and Justin Chiu and Wenting Zhao and Kevin Ellis},
      year={2025},
      eprint={2506.11058},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2506.11058}, 
}