Abstract
Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become increasingly accurate at solving isolated programming problems. We investigate code agents' capacity to refactor code in ways supporting growth and reusability. We present both a method and a benchmark for refactoring: Librarian, a sample-and-rerank method for generating reusable libraries, and Minicode, a benchmark where code agents must minimize and refactor multiple independent solutions into a joint library. Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on Minicode, obtaining compression rates 1.6-2x better than coding agents while also improving correctness. We open-source our code, benchmark, and benchmark scripting.
Key Contributions
Librarian is a novel sample-and-rerank method that refactors codebases into reusable libraries. It clusters code to find shared structures, samples refactorings, and ranks them by simplicity and correctness. It achieves 1.6-2x better compression than top code agents while boosting accuracy.
Minicode is a benchmark for testing code agents' ability to create unified libraries from multiple code sources, such as competition coding programs and Python repositories. It requires open-ended design and large-context understanding in order to craft simple libraries.
Problem Statement
We study the problem of refactoring code for better organization and efficiency. Given multiple codebases with similar functionalities, our goal is to create a unified library that captures common patterns. This process should significantly reduce the total amount of code while ensuring all original functionality remains intact.
We evaluate refactorings based on two key principles:
- Correctness is straightforward: Does the refactored code pass all the original tests?
- Simplicity is more nuanced: We don't just count characters; we define simplicity using Minimum Description Length (MDL). This means we're looking for code that is not only short but also natural, elegant, and extensible—like finding the most concise yet understandable way to express an idea, rather than just the shortest, potentially unreadable, version (think "Perl Golf" where the shortest code is often incomprehensible!).
Formalization
Formally, given a set of original programs $\{\rho_n\}_{n=1}^N$, we want to find a new library $\mathcal{L}$ and refactored programs $\{\rho'_n\}_{n=1}^N$. We optimize the following objective:
Here, $p_{\text{LM}}(\mathcal{L})$ is the probability of the library under a language model, and $p_{\text{LM}}(\rho'_n\mid\mathcal{L})$ is the probability of the refactored program $\rho'_n$ given the library $\mathcal{L}$. The constraint $\tau(\rho_n) \leq \tau(\rho'_n)$ ensures that the refactored programs pass at least as many tests as the originals. The loss function $\ell$ thus encourages solutions that are both correct and have minimal description length, as measured by the language model.
In simpler terms, we're looking for a library and refactored programs that pass at least as many tests as the originals, and whose combined "description length" (how hard they are to describe using a language model) is minimized. This ensures our refactored code is not only correct but also intuitively simple and well-structured.
Librarian: Refactoring Code to Create Libraries
Librarian is our method for refactoring existing code into a more organized and reusable library. By identifying common patterns and abstracting them into shared building blocks, Librarian compresses collections of programs while migrating them to use these new components—reducing overall code size and often improving functionality. The method operates on a simple sample-and-rerank framework, progressively building a library of useful functions to maximize our refactoring objective. Figure 1 illustrates the overall process.
Librarian operates on a simple sample-and-rerank framework to maximize our refactoring objective described above. It maintains and grows a library of useful functions as part of this objective.
Concretely, our framework follows:
How It Works:
- Clustering: We group related input programs into "tuples" by having a language model summarize the code, then clustering these summaries. This focuses the language model's attention on relevant code chunks.
- Sampling Refactorings: For each tuple, Librarian retrieves relevant existing library functions. Then, using the original code and retrieved functions as context, a language model proposes K candidate refactorings.
- Ranking with Compression: All K candidates are evaluated. We select the refactoring that scores highest on quality and maintains (or improves) test accuracy compared to the original code. New, useful library functions from the chosen refactoring are then added to the Librarian library for future use.
The MINICODE Benchmark
MINICODE evaluates a code agent's capability to identify abstractions across multiple implementations and design reusable libraries. Agents are presented with a collection of code sources and are tasked with refactoring them into a unified library. Key desiderata for these collections are that they must be compressible, containing a latent shared abstraction, and verifiable, allowing functional correctness to be measured. Agents interact with the benchmark via the terminal, managing multi-package Python repositories.
CodeContests Domain
Sourced from the CodeContests dataset, this domain uses competitive programming problems which naturally contain shared concepts and test cases. Each collection provides multiple solutions, and the agent's task is to create a central library.py
file that is imported by each refactored solution.
Repositories Domain
This domain features synthesized projects with controlled complexity and overlap. Using a generative process, we create collections of repositories tailored to specific use cases. Agents must extract reusable functions from across these repositories and rewrite the original source code to use a new, shared common
subpackage.
Domain | Sources | Collections | Avg LoC | Avg Tests | Gen by |
---|---|---|---|---|---|
Code Contests | 300 | 30 | 87 | 10 | Humans |
Small Repositories | 262 | 22 | 209 | 12 | o4-mini |
Large Repositories | 20 | 10 | 6,433 | 101 | Claude-Sonnet 3.7 |
Citation
@misc{kovacic2025refactoringcodebaseslibrarydesign,
title={Refactoring Codebases through Library Design},
author={Ziga Kovacic and Celine Lee and Justin Chiu and Wenting Zhao and Kevin Ellis},
year={2025},
eprint={2506.11058},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2506.11058},
}