New Microsoft tool lets devs spin up AI behavior tests using text descriptions
Microsoft has introduced ASSERT, an open-source framework designed to simplify the testing of AI systems. It allows developers to create specific behavioral tests for their AI applications using natural language descriptions, ensuring the AI performs as intended for a particular product or service.
Microsoft has launched ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that helps developers ensure their AI systems behave as intended for specific products or services. This initiative addresses a growing need in the industry to move beyond general AI evaluations towards application-specific testing.
ASSERT allows developers to use natural language descriptions of desired AI goals, policies, or behaviors. The tool then translates these descriptions into structured sets of acceptable and unacceptable behaviors, generates test cases, and evaluates the AI system against them. It also records the AI’s actions for detailed inspection of any failures.
Developers can customize these evaluations further by providing system context, tools, and constraints. For example, a developer could specify that a document research AI agent should not send emails outside the company or should limit confidential information sharing. ASSERT will then generate tests to continuously verify adherence to these rules.
According to Sarah Bird, chief product officer of Responsible AI at Microsoft, ASSERT fills a crucial gap where broader evaluations fall short. She emphasized that understanding an AI system's behavior through application-specific testing is critical for making sound decisions and ensuring a trustworthy system.
This release aligns with a broader industry trend focusing on repeatable testing and regression checks for increasingly capable AI models. ASSERT can be utilized during the AI system's development, after deployment, and for ongoing monitoring, offering a comprehensive solution for AI behavior evaluation.
Related articles
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA, IBM's open-source Agent Harness, simplifies building agentic applications by handling infrastructure, allowing developers to focus on tools and prompts. It offers pre-assembled components for planning, execution, and state management, significantly reducing development time. CUGA has topped agent benchmarks like AppWorld and WebArena.
OpenAI launches new initiative to help find and patch open source bugs
OpenAI has launched "Patch the Planet," a new initiative in partnership with cybersecurity firm Trail of Bits, to enhance the security of open-source projects. This program aims to assist maintainers in identifying and patching bugs, utilizing OpenAI's AI-powered security tools while reducing the burden on project teams.
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
Baidu has released PP-OCRv6, an advanced optical character recognition (OCR) model supporting 50 languages. Available on Hugging Face, this version significantly improves accuracy and efficiency across various parameter sizes, from 1.5 million to 34.5 million, marking a substantial leap in multilingual OCR technology.
