NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

NVIDIA researchers have developed Polar, a novel rollout framework that trains language agents with reinforcement learning without altering their existing configurations. Polar significantly enhances performance across various coding platforms, as demonstrated by its substantial improvements on SWE-Bench Verified pass@1 scores.
NVIDIA researchers have introduced Polar, a groundbreaking rollout framework designed to train language agents through reinforcement learning. A key innovation of Polar is its ability to achieve this without requiring any modifications to the agents' existing harnesses, streamlining the integration process.
The framework operates by inserting a model API proxy between the agent harness and the inference server. This proxy meticulously captures token-level interactions, enabling the reconstruction of trajectories that are optimized for trainer-ready data.
Polar's effectiveness has been rigorously tested and proven. Utilizing GRPO on a Qwen3.5-4B base model, Polar showcased a remarkable improvement in SWE-Bench Verified pass@1 scores. This included a 22.6-point increase under the Codex harness, a 4.8-point increase with Claude Code, and a 6.2-point increase under Pi.
This innovative framework is now available and has been registered as a NeMo Gym environment. It is publicly accessible under the ProRL Agent Server repository, facilitating its adoption and further development within the research community.
Related articles
The AI world is getting ‘loopy’
AI models are taking a significant leap forward with the adoption of "agentic loops," where AI agents continuously prompt each other to improve code and solve complex problems. This approach, though potentially resource-intensive, promises to unlock new levels of autonomous problem-solving and efficiency in AI applications.
Codex-maxxing for long-running work
Codex is increasingly being used by organizations to support long-running projects that go beyond a single prompt. This whitepaper by Jason Liu offers practical strategies for leveraging Codex as a persistent workspace, managing complex workflows and sustaining progress.
Nobel laureate John Jumper is leaving DeepMind for rival Anthropic
Nobel laureate John Jumper is departing Google DeepMind to join its competitor, Anthropic, after dedicating nearly nine years to DeepMind, where he led the AlphaFold team. Jumper, who shared a Nobel Prize for his work on AlphaFold, expressed gratitude for his time at DeepMind while looking forward to new endeavors.
