Abstract
Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become increasingly accurate at solving isolated programming problems. We investigate code agents' capacity to refactor code in ways supporting growth and reusability. We present both a method and a benchmark for refactoring: Librarian, a sample-and-rerank method for generating reusable libraries, and MiniCode, a benchmark where code agents must minimize and refactor multiple independent solutions into a joint library. Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on MiniCode, obtaining compression rates 1.6-2x better than coding agents while also improving correctness.
Demo
Key Contributions
Librarian is a sample-and-rerank method that refactors codebases into reusable libraries. It clusters code to find shared structures, samples refactorings, and ranks them by simplicity and correctness. It achieves 1.6-2x better compression than top code agents while boosting accuracy.
MiniCode is a benchmark for testing code agents' ability to create unified libraries from multiple code sources, such as competition coding programs and Python repositories. It requires open-ended design and large-context understanding in order to craft simple libraries.
Problem Statement
Given multiple code sources that contain problem- specific implementations, we evaluate whether agents can create a cohesive library that captures shared abstractions. This library must reduce the total code size while supporting all original use cases, potentially opening up new use cases as well by mining and formalizing latent shared abstractions.
Libraries and refactored sources must be:
- Correct: The refactored code passes all original tests.
- Simple: Elegant code is short and natural.
Formally, given a set of original programs $\{\rho_n\}_{n=1}^N$, we want to find a new library $\mathcal{L}$ and refactored programs $\{\rho'_n\}_{n=1}^N$. We define the pass rate $\tau(\rho_n)$ as the fraction of unit tests program $\rho_n$ passes. In practice we are concerned both with the case where we are refactoring several sources ($N>1$) and also the case where there is only a single large source we are refactoring ($N=1$).
Refactorings are evaluated using the following objective:
Here, $p_{\text{LM}}(\mathcal{L})$ is the probability of the library under a language model, and $p_{\text{LM}}(\rho'_n\mid\mathcal{L})$ is the probability of the refactored program $\rho'_n$ given the library $\mathcal{L}$. The constraint $\tau(\rho_n) \leq \tau(\rho'_n)$ ensures that the refactored programs pass at least as many tests as the originals. The loss function $\ell$ thus encourages solutions that are both correct and have minimal description length, as measured by the language model.
The MiniCode Benchmark
We instantiate our evaluation across three splits of varying difficulty: large repositories, small repositories, and competition coding. In each of these domains, agents must understand a collection of code sources, synthesize a set of shared abstractions into a library, then refactor the code sources using that library. The refactored code and library are evaluated on correctness and simplicity.
Repository Split
We synthesize both large-scale and small-scale Python repositories by prompting LMs. In order to obtain a collection of refactorable repositories, we prompt LMs to generate ideas then synthesize repositories by generating variations of those ideas via personas. Agents must create a unified common
library package that gets imported into the original repository packages.
CodeContests Split
Sourced from the CodeContests dataset, this domain uses competitive programming problems which naturally contain shared concepts and test cases. Each collection provides multiple solutions, and the agent's task is to create a central library.py
file that is imported into each refactored solution.
Domain | Sources | Collections | Avg LoC | Avg Tests | Generated by |
---|---|---|---|---|---|
Code Contests | 300 | 30 | 87 | 10 | Humans |
Small Repositories | 262 | 22 | 209 | 12 | o4-mini |
Large Repositories | 20 | 10 | 6,433 | 101 | Claude-Sonnet 3.7 |
Explore a CodeContests Collection
We visualize the original code sources as presented to a code agent below. We also provide the library and refactored solutions that Claude Sonnet 3.7 created in a refactor attempt.
Librarian: Refactoring Code to Create Libraries
Librarian is our method for refactoring existing code into a more organized and reusable library. By identifying common patterns and abstracting them into shared building blocks, Librarian compresses collections of programs while migrating them to use these new components—reducing overall code size and often improving functionality. The method operates on a simple sample-and-rerank framework, progressively building a library of useful functions to maximize our refactoring objective. Figure 1 illustrates the overall process.
Librarian operates on a simple sample-and-rerank framework to maximize our refactoring objective described above. It maintains and grows a library of useful functions as part of this objective.
Concretely, our framework follows:
How It Works:
- Clustering: We group related input programs into "tuples" by having a language model summarize the code, then clustering these summaries. This focuses the language model's attention on relevant code chunks.
- Sampling Refactorings: For each tuple, Librarian retrieves relevant existing library functions. Then, using the original code and retrieved functions as context, a language model proposes K candidate refactorings.
- Ranking with Compression: All K candidates are evaluated. We select the refactoring that scores highest on quality and maintains (or improves) test accuracy compared to the original code. New, useful library functions from the chosen refactoring are then added to the Librarian library for future use.
Results
We present the results for the CodeContests split of MiniCode below, comparing Librarian to Sonnet and codex-mini agent baselines.
Agent | Pass % | MDL % |
---|---|---|
original | 82.0 | 100.0 |
sonnet 3.7 | 93.9 | 107.4 |
sonnet 4 | 84.4 | 77.1 |
codex-mini | 82.0 | 86.8 |
LIBRARIAN | 90.67 | 53.0 |
Agent | Pass rate (%) | MDL ratio (%) |
---|---|---|
original | 100 | 100 |
Claude Sonnet 3.7 planner, Claude Sonnet 3.7 implementer | 78 | 514 |
Claude Sonnet 3.7 planner, o4-mini implementer | 80 | 244 |
Agent | Pass Rate (%) | MDL Ratio (%) |
---|---|---|
original | 100 | 100 |
Claude Sonnet 3.7 | 84.3 | 152 |
Claude Sonnet 4 | 88.3 | 143.3 |
Citation
@misc{kovacic2025refactoringcodebaseslibrarydesign,
title={Refactoring Codebases through Library Design},
author={Ziga Kovacic and Celine Lee and Justin Chiu and Wenting Zhao and Kevin Ellis},
year={2025},
eprint={2506.11058},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2506.11058},
}