Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%

Microsoft Research has unveiled Webwright, a new open-source framework that revolutionizes how AI agents interact with web browsers. By providing a terminal interface, Webwright enables agents to write and execute code, leading to significant performance improvements on complex web tasks compared to traditional methods.
Most current web agents interact with browsers one action at a time, receiving page states and predicting the next click or keypress. This method, while once suitable for less capable language models, now limits advancements as models become more adept at coding and debugging. This rigid, sequential approach hinders their full potential.
Microsoft Research's AI Frontiers lab developed Webwright, an open-source framework that offers a different approach. Instead of a stateful browser session, Webwright provides the agent with a terminal. This allows the AI to write Playwright code to control browsers, execute bash commands, inspect logs, and iteratively refine scripts. This decouples the agent from the browser, treating the browser as a disposable tool for code development rather than a persistent session.
Webwright mirrors a developer's workflow for Robotic Process Automation (RPA), where a script is written once to be rerun, adapted, and shared, rather than manually repeating actions. This methodology is now applied to LLM-powered agents, enabling them to express multi-step interactions as compact programs. This includes using loops, functions, and abstractions to generalize across similar tasks without predicting every low-level step.
The system addresses key challenges like premature task completion and context length limitations. It requires the agent to generate a self-reflection configuration and pass its own judgment before reporting success, preventing false positives. To manage long coding trajectories, Webwright compacts historical data every 20 steps into a single summary, optimizing context usage.
Evaluations show Webwright's superior performance. On the Odysseys benchmark, Webwright, powered by GPT-5.4, achieved 60.1%, a significant improvement over the baseline GPT-5.4's 33.5% and the previous state-of-the-art of 44.5%. This highlights the effectiveness of a code-driven, terminal-based approach over traditional step-by-step coordinate prediction. Even smaller models like Qwen3.5-9B, when augmented with pre-built tool scripts, can achieve strong results on complex web tasks, demonstrating Webwright's versatility and potential for broader application.
Related articles
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA, IBM's open-source Agent Harness, simplifies building agentic applications by handling infrastructure, allowing developers to focus on tools and prompts. It offers pre-assembled components for planning, execution, and state management, significantly reducing development time. CUGA has topped agent benchmarks like AppWorld and WebArena.
OpenAI launches new initiative to help find and patch open source bugs
OpenAI has launched "Patch the Planet," a new initiative in partnership with cybersecurity firm Trail of Bits, to enhance the security of open-source projects. This program aims to assist maintainers in identifying and patching bugs, utilizing OpenAI's AI-powered security tools while reducing the burden on project teams.
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
Baidu has released PP-OCRv6, an advanced optical character recognition (OCR) model supporting 50 languages. Available on Hugging Face, this version significantly improves accuracy and efficiency across various parameter sizes, from 1.5 million to 34.5 million, marking a substantial leap in multilingual OCR technology.
