Is it agentic enough? Benchmarking open models on your own tooling
This article introduces a new methodology for benchmarking open-source models, focusing on the efficiency and effort required by AI agents to achieve a task, rather than just the final outcome. It highlights the importance of designing APIs and documentation to be "agent-friendly" to optimize performance and reduce computational costs for agents.
Coding agents are increasingly interacting with our software, automating tasks from library selection to debugging. This shift introduces a new paradigm in library development: software must now be designed for effective agent interaction, not just correctness and speed. Clunky APIs and outdated documentation can lead agents down inefficient and more expensive paths.
Traditional benchmarks primarily evaluate final answers. However, this new approach focuses on the entire process—how much effort an agent expends to reach a solution. By examining models, library revisions, and tasks, this method measures not only correctness but also the efficiency of the pathway taken. The `transformers` library serves as a key case study for this benchmarking methodology.
Optimizing software for agents involves two core principles: discoverability and clarity. APIs must be clear, and documentation needs to be extensive, well-structured, and easily accessible with relevant examples. To ensure a tool works effectively for agents, it must be tested specifically for agentic use cases. This approach helps in identifying improvements that streamline agent workflows and reduce resource consumption.
Related articles
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA, IBM's open-source Agent Harness, simplifies building agentic applications by handling infrastructure, allowing developers to focus on tools and prompts. It offers pre-assembled components for planning, execution, and state management, significantly reducing development time. CUGA has topped agent benchmarks like AppWorld and WebArena.
OpenAI launches new initiative to help find and patch open source bugs
OpenAI has launched "Patch the Planet," a new initiative in partnership with cybersecurity firm Trail of Bits, to enhance the security of open-source projects. This program aims to assist maintainers in identifying and patching bugs, utilizing OpenAI's AI-powered security tools while reducing the burden on project teams.
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
Baidu has released PP-OCRv6, an advanced optical character recognition (OCR) model supporting 50 languages. Available on Hugging Face, this version significantly improves accuracy and efficiency across various parameter sizes, from 1.5 million to 34.5 million, marking a substantial leap in multilingual OCR technology.
