Browse latest
Tools & PlatformsMarkTechPost · May 24, 2026

Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%

Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5% — MarkTechPost

Microsoft Research has unveiled Webwright, a new open-source framework that revolutionizes how AI agents interact with web browsers. By providing a terminal interface, Webwright enables agents to write and execute code, leading to significant performance improvements on complex web tasks compared to traditional methods.

Author: Morein.ai Editorial

Most current web agents interact with browsers one action at a time, receiving page states and predicting the next click or keypress. This method, while once suitable for less capable language models, now limits advancements as models become more adept at coding and debugging. This rigid, sequential approach hinders their full potential.

Microsoft Research's AI Frontiers lab developed Webwright, an open-source framework that offers a different approach. Instead of a stateful browser session, Webwright provides the agent with a terminal. This allows the AI to write Playwright code to control browsers, execute bash commands, inspect logs, and iteratively refine scripts. This decouples the agent from the browser, treating the browser as a disposable tool for code development rather than a persistent session.

Webwright mirrors a developer's workflow for Robotic Process Automation (RPA), where a script is written once to be rerun, adapted, and shared, rather than manually repeating actions. This methodology is now applied to LLM-powered agents, enabling them to express multi-step interactions as compact programs. This includes using loops, functions, and abstractions to generalize across similar tasks without predicting every low-level step.

The system addresses key challenges like premature task completion and context length limitations. It requires the agent to generate a self-reflection configuration and pass its own judgment before reporting success, preventing false positives. To manage long coding trajectories, Webwright compacts historical data every 20 steps into a single summary, optimizing context usage.

Evaluations show Webwright's superior performance. On the Odysseys benchmark, Webwright, powered by GPT-5.4, achieved 60.1%, a significant improvement over the baseline GPT-5.4's 33.5% and the previous state-of-the-art of 44.5%. This highlights the effectiveness of a code-driven, terminal-based approach over traditional step-by-step coordinate prediction. Even smaller models like Qwen3.5-9B, when augmented with pre-built tool scripts, can achieve strong results on complex web tasks, demonstrating Webwright's versatility and potential for broader application.

Read original source

Related articles