ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Frontier AI models perform below 50% on ITBench-AA, the first benchmark for agentic enterprise IT tasks, focusing on Site Reliability Engineering. This new benchmark evaluates models on their ability to diagnose live systems and identify root causes in Kubernetes incidents, highlighting a significant challenge for current AI capabilities in complex IT environments.
Artificial Analysis and IBM Software Innovation Lab have launched ITBench-AA, a new benchmark series to evaluate AI models on agentic enterprise IT tasks. The initial focus is on Site Reliability Engineering (SRE) tasks, where frontier models currently score below 50%. This benchmark is the first of its kind, designed to assess how well AI can handle real-world IT incidents.
The ITBench-AA SRE tasks specifically evaluate model performance in Kubernetes incident response. Models are required to diagnose live systems by analyzing logs, tracing dependencies, and identifying root-cause entities within complex infrastructure. The underlying ITBench dataset, developed by IBM, brings extensive expertise in enterprise IT operations to this evaluation.
Key findings show that even leading models like Claude Opus 4.7 and GPT-5.5 score below 50%, highlighting the significant challenges in this domain. This makes ITBench-AA SRE one of the least saturated agentic benchmarks, indicating a wide margin for improvement. Importantly, more turns or longer diagnostic trajectories do not necessarily lead to higher accuracy.
The benchmark includes 59 SRE tasks, comprising both public and newly held-out tasks. Each task provides a snapshot of a Kubernetes incident with various data points like alerts, logs, and application topology. Models must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. Faults simulate typical SRE failure modes, including infrastructure, service, and application issues.
To ensure fair comparison, an open-source Stirrup reference harness is used for evaluations, providing models with shell access to a sandboxed file system. Models submit a list of root-cause entities, which are then compared against ground-truth data provided by IBM. A unique scoring method, average precision at full recall, is employed, where missing any ground-truth root cause results in a 0.0 score for that attempt. The harness is kept constant across all models, enabling an "apples-to-apples" comparison.
Related articles
The AI world is getting ‘loopy’
AI models are taking a significant leap forward with the adoption of "agentic loops," where AI agents continuously prompt each other to improve code and solve complex problems. This approach, though potentially resource-intensive, promises to unlock new levels of autonomous problem-solving and efficiency in AI applications.
Codex-maxxing for long-running work
Codex is increasingly being used by organizations to support long-running projects that go beyond a single prompt. This whitepaper by Jason Liu offers practical strategies for leveraging Codex as a persistent workspace, managing complex workflows and sustaining progress.
Nobel laureate John Jumper is leaving DeepMind for rival Anthropic
Nobel laureate John Jumper is departing Google DeepMind to join its competitor, Anthropic, after dedicating nearly nine years to DeepMind, where he led the AlphaFold team. Jumper, who shared a Nobel Prize for his work on AlphaFold, expressed gratitude for his time at DeepMind while looking forward to new endeavors.
