Core dump epidemiology: fixing an 18-year-old bug
OpenAI debugged elusive C++ crashes in its Rockset service using a population-level analysis, revealing an 18-year-old bug in GNU libunwind and a silent hardware corruption issue. This epidemiological approach to debugging allowed the team to identify and fix these critical issues impacting their data infrastructure.
OpenAI encountered persistent and perplexing C++ crashes within its Rockset service, a critical component of its ChatGPT data infrastructure. These crashes often manifested as functions returning to invalid memory addresses, indicating severe stack corruption. Despite initial in-depth inspection of individual crash reports, the underlying cause remained elusive, baffling even AI-powered diagnostic attempts.
The debugging team adopted a novel "epidemiological" approach, shifting from individual case studies to a population-level analysis of all crashes. This involved building a high-quality dataset from numerous core dumps, which are snapshots of program states at the time of a crash. This systematic data collection and analysis proved crucial in uncovering the true nature of the problems.
Ultimately, the investigation revealed not one, but two unrelated issues coinciding to produce the mysterious crashes. The first was a silent hardware corruption on an Azure host, where the CPU was performing calculations incorrectly. The second, and more surprising, was a long-standing race condition in GNU libunwind, an 18-year-old unaddressed bug in a widely used open-source library. This dual discovery highlights the complexity of debugging at scale and the importance of comprehensive data analysis.
Developing scalable data infrastructure is crucial for OpenAI's models and agents, which increasingly rely on efficient data retrieval during inference. The use of C++ provides performance and memory efficiency but also introduces challenges with memory safety. The resolution of these deep-seated bugs ensures improved reliability and stability for OpenAI's critical services.
Related articles
Gemini Spark, Google’s agentic assistant, is now available on Mac
Google has launched Gemini Spark, its agentic AI assistant, for macOS, enhancing its competitive stance against rivals like Microsoft Copilot and Claude Desktop. This new release integrates with various apps, offering real-time topic tracking and expanded functionalities for everyday digital tasks.
Lumo, Proton’s privacy-focused AI chatbot, gets an upgrade
Proton has upgraded its privacy-focused AI chatbot, Lumo, to version 2.0, introducing advanced features like image recognition and generation, alongside significant speed improvements. This update enhances Lumo’s capabilities for project management and introduces persistent memory, all while maintaining Proton’s strong privacy commitments.
Featuring Every Eval Ever Results on Hugging Face Model Pages
Hugging Face now integrates "Every Eval Ever" (EVE) results directly onto model pages, enhancing transparency and aiding developers in selecting performant models. This feature, visible via a dedicated "Evals" tab, simplifies access to comprehensive LLM evaluation data.
