Browse latest
Research & PapersHugging Face - Blog · May 27, 2026

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Frontier AI models perform below 50% on ITBench-AA, the first benchmark for agentic enterprise IT tasks, focusing on Site Reliability Engineering. This new benchmark evaluates models on their ability to diagnose live systems and identify root causes in Kubernetes incidents, highlighting a significant challenge for current AI capabilities in complex IT environments.

Author: Morein.ai Editorial

Artificial Analysis and IBM Software Innovation Lab have launched ITBench-AA, a new benchmark series to evaluate AI models on agentic enterprise IT tasks. The initial focus is on Site Reliability Engineering (SRE) tasks, where frontier models currently score below 50%. This benchmark is the first of its kind, designed to assess how well AI can handle real-world IT incidents.

The ITBench-AA SRE tasks specifically evaluate model performance in Kubernetes incident response. Models are required to diagnose live systems by analyzing logs, tracing dependencies, and identifying root-cause entities within complex infrastructure. The underlying ITBench dataset, developed by IBM, brings extensive expertise in enterprise IT operations to this evaluation.

Key findings show that even leading models like Claude Opus 4.7 and GPT-5.5 score below 50%, highlighting the significant challenges in this domain. This makes ITBench-AA SRE one of the least saturated agentic benchmarks, indicating a wide margin for improvement. Importantly, more turns or longer diagnostic trajectories do not necessarily lead to higher accuracy.

The benchmark includes 59 SRE tasks, comprising both public and newly held-out tasks. Each task provides a snapshot of a Kubernetes incident with various data points like alerts, logs, and application topology. Models must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. Faults simulate typical SRE failure modes, including infrastructure, service, and application issues.

To ensure fair comparison, an open-source Stirrup reference harness is used for evaluations, providing models with shell access to a sandboxed file system. Models submit a list of root-cause entities, which are then compared against ground-truth data provided by IBM. A unique scoring method, average precision at full recall, is employed, where missing any ground-truth root cause results in a 0.0 score for that attempt. The harness is kept constant across all models, enabling an "apples-to-apples" comparison.

Read original source

Related articles