Refactoring Codebases through Library Design

1Cornell University, 2Cohere
* Equal contribution Equal advisorship
teaser
Figure 1: Given a collection of different code sources, where a source is either program or repository and an optional existing library, agents must refactor the code sources by designing a new modular and reusable library. Candidate refactorings are evaluated based on program simplicity (compression) across both the library and refactored sources, and are expected to maintain correctness of the original code sources (pass rate).

Abstract

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become increasingly accurate at solving isolated programming problems. We investigate code agents' capacity to refactor code in ways supporting growth and reusability. We present both a method and a benchmark for refactoring: Librarian, a sample-and-rerank method for generating reusable libraries, and MiniCode, a benchmark where code agents must minimize and refactor multiple independent solutions into a joint library. Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on MiniCode, obtaining compression rates 1.6-2x better than coding agents while also improving correctness.

Demo

library.py
solution.py
Loading... | Timestep 0
Librarian refactors competition coding solutions in MiniCode: Given a collection of code solutions, Librarian identifies useful abstractions and creates a library. It then rewrites each code solution using the library.

Key Contributions

Problem Statement

Given multiple code sources that contain problem- specific implementations, we evaluate whether agents can create a cohesive library that captures shared abstractions. This library must reduce the total code size while supporting all original use cases, potentially opening up new use cases as well by mining and formalizing latent shared abstractions.

Libraries and refactored sources must be:

  1. Correct: The refactored code passes all original tests.
  2. Simple: Elegant code is short and natural.
We measure correctness by ensuring refactored code passes at least as many tests as the original sources and simpleness via the mininum description length (MDL). MDL, essentially the total log probability of all code under a model, captures both shortness and naturalness. This avoids issues of code golf, where shortness is achieved via code obfuscation.

Formally, given a set of original programs $\{\rho_n\}_{n=1}^N$, we want to find a new library $\mathcal{L}$ and refactored programs $\{\rho'_n\}_{n=1}^N$. We define the pass rate $\tau(\rho_n)$ as the fraction of unit tests program $\rho_n$ passes. In practice we are concerned both with the case where we are refactoring several sources ($N>1$) and also the case where there is only a single large source we are refactoring ($N=1$).

Refactorings are evaluated using the following objective:

$$ \ell(\mathcal{L}, \{\rho'_n\}) = \begin{cases} -\log p_{\text{LM}}(\mathcal{L}) + \sum_n -\log p_{\text{LM}}(\rho'_n\mid\mathcal{L}) & \forall \rho_n, \, \tau(\rho_n) \leq \tau(\rho'_n) \\ \infty & \text{otherwise} \end{cases} $$

Here, $p_{\text{LM}}(\mathcal{L})$ is the probability of the library under a language model, and $p_{\text{LM}}(\rho'_n\mid\mathcal{L})$ is the probability of the refactored program $\rho'_n$ given the library $\mathcal{L}$. The constraint $\tau(\rho_n) \leq \tau(\rho'_n)$ ensures that the refactored programs pass at least as many tests as the originals. The loss function $\ell$ thus encourages solutions that are both correct and have minimal description length, as measured by the language model.

The MiniCode Benchmark

We instantiate our evaluation across three splits of varying difficulty: large repositories, small repositories, and competition coding. In each of these domains, agents must understand a collection of code sources, synthesize a set of shared abstractions into a library, then refactor the code sources using that library. The refactored code and library are evaluated on correctness and simplicity.

Repository Split

We synthesize both large-scale and small-scale Python repositories by prompting LMs. In order to obtain a collection of refactorable repositories, we prompt LMs to generate ideas then synthesize repositories by generating variations of those ideas via personas. Agents must create a unified common library package that gets imported into the original repository packages.

CodeContests Split

Sourced from the CodeContests dataset, this domain uses competitive programming problems which naturally contain shared concepts and test cases. Each collection provides multiple solutions, and the agent's task is to create a central library.py file that is imported into each refactored solution.

Domain Sources Collections Avg LoC Avg Tests Generated by
Code Contests 300 30 87 10 Humans
Small Repositories 262 22 209 12 o4-mini
Large Repositories 20 10 6,433 101 Claude-Sonnet 3.7
Table 1: MiniCode statistics

Explore a CodeContests Collection

We visualize the original code sources as presented to a code agent below. We also provide the library and refactored solutions that Claude Sonnet 3.7 created in a refactor attempt.

Check out the full benchmark here.

Librarian: Refactoring Code to Create Libraries

Librarian is our method for refactoring existing code into a more organized and reusable library. By identifying common patterns and abstracting them into shared building blocks, Librarian compresses collections of programs while migrating them to use these new components—reducing overall code size and often improving functionality. The method operates on a simple sample-and-rerank framework, progressively building a library of useful functions to maximize our refactoring objective. Figure 1 illustrates the overall process.

Librarian operates on a simple sample-and-rerank framework to maximize our refactoring objective described above. It maintains and grows a library of useful functions as part of this objective.

Concretely, our framework follows:

$$ \mathcal{L}^\star, \left\{ \rho^\star_n \right\} = \arg\min_{\mathcal{L}, \left\{ \rho'_n \right\} \in \mathrm{Sample}(\left\{ \rho_n \right\})} \ell(\mathcal{L}, \left\{ \rho'_n \right\}). $$

How It Works:

Results

We present the results for the CodeContests split of MiniCode below, comparing Librarian to Sonnet and codex-mini agent baselines.

Agent Pass % MDL %
original 82.0 100.0
sonnet 3.7 93.9 107.4
sonnet 4 84.4 77.1
codex-mini 82.0 86.8
LIBRARIAN 90.67 53.0
Table 2: Results on the MiniCode CodeContests split
We also present the results on the small repo split, which consists of repositories generated by o4-mini. We experiment with Claude Sonnet 3.7 as a planner, and either Sonnet 3.7 as the implementer or o4-mini.
Agent Pass rate (%) MDL ratio (%)
original 100 100
Claude Sonnet 3.7 planner, Claude Sonnet 3.7 implementer 78 514
Claude Sonnet 3.7 planner, o4-mini implementer 80 244
Table 3: Average results on MiniCode-repositories small
Finally, we present results on the large repo split. Due to the stronger performance of Sonnet models, we evaluate only Sonnet models to minimize cost.
Agent Pass Rate (%) MDL Ratio (%)
original 100 100
Claude Sonnet 3.7 84.3 152
Claude Sonnet 4 88.3 143.3
Table 4: Average refactoring results on MiniCode-repositories large, comparing the original code sources to refactored sources from Claude Sonnet 3.7 and Sonnet 4
Check out the paper for the full details on results!

Citation

@misc{kovacic2025refactoringcodebaseslibrarydesign,
      title={Refactoring Codebases through Library Design}, 
      author={Ziga Kovacic and Celine Lee and Justin Chiu and Wenting Zhao and Kevin Ellis},
      year={2025},
      eprint={2506.11058},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2506.11058}, 
}