olmo-eval: An evaluation workbench for the model development loop
olmo-eval is a new evaluation workbench designed to streamline the LLM development process. It offers greater flexibility and modularity compared to existing tools, allowing for rapid experimentation and in-depth analysis of model performance across various interventions. This tool helps developers assess whether model adjustments genuinely improve performance or merely introduce noise.
Evaluating large language models (LLMs) repeatedly during development is a critical but often cumbersome process. Every change, from data adjustments to architectural shifts, necessitates re-running benchmarks and analyzing results. Most existing evaluation tools are not built for this iterative development cycle, struggling to keep pace with constantly evolving models or accurately reflect real-world behaviors.
olmo-eval is designed to address these challenges, building upon the Open Language Model Evaluation Standard (OLMES). While OLMES aimed to standardize benchmark scores, olmo-eval extends this by simplifying the implementation of new evaluations, offering more flexibility in execution, and enabling the composition of complex workflows. It supports agentic and multi-turn evaluations as first-class use cases and provides enhanced analysis tools to distinguish genuine improvements from noise.
olmo-eval differs from tools like Harbor by focusing on the everyday needs of model development. Unlike Harbor, which emphasizes running and publishing agent benchmarks in a single, containerized environment, olmo-eval allows users to choose how each benchmark runs. This flexibility means simpler benchmarks can run directly for speed and cost-efficiency, while those requiring isolated environments receive a containerized setup only when necessary.
Benchmark addition in olmo-eval prioritizes speed and customization, contrasting with Harbor's more rigorous, publication-oriented process. olmo-eval offers various definition options, from short descriptions for basic evaluations to thin wrappers for integrating existing benchmark code. This adaptability allows developers to quickly move through evaluation cycles.
Both olmo-eval and Harbor separate benchmark logic from runtime policy, but olmo-eval offers greater modularity. Components such as the model, tools, environment, and helper models (e.g., an LLM-as-a-judge) are all swappable. This allows for reuse across different evaluations and fine-tuning individual settings without affecting others.
olmo-eval provides detailed performance reports, including standard error and minimum detectable effect, to help interpret results. It allows developers to compare specific questions across model checkpoints, identifying whether a minor change in overall average reflects a true improvement or merely noise.
Related articles
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA, IBM's open-source Agent Harness, simplifies building agentic applications by handling infrastructure, allowing developers to focus on tools and prompts. It offers pre-assembled components for planning, execution, and state management, significantly reducing development time. CUGA has topped agent benchmarks like AppWorld and WebArena.
OpenAI launches new initiative to help find and patch open source bugs
OpenAI has launched "Patch the Planet," a new initiative in partnership with cybersecurity firm Trail of Bits, to enhance the security of open-source projects. This program aims to assist maintainers in identifying and patching bugs, utilizing OpenAI's AI-powered security tools while reducing the burden on project teams.
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
Baidu has released PP-OCRv6, an advanced optical character recognition (OCR) model supporting 50 languages. Available on Hugging Face, this version significantly improves accuracy and efficiency across various parameter sizes, from 1.5 million to 34.5 million, marking a substantial leap in multilingual OCR technology.
