How to Speed Up Transformer Training Using NVIDI

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

This article explores optimizing Transformer training with NVIDIA Apex and torch.amp. It details setting up Apex with CUDA extensions, benchmarking FusedAdam and FusedLayerNorm against native PyTorch, and integrating these into a Transformer for performance gains.

Author: Morein.ai EditorialPublished: June 2, 2026Updated: 6/2/2026

Optimizing Transformer training is crucial for efficiency. This guide focuses on leveraging NVIDIA Apex and native torch.amp to accelerate these processes, building upon modern GPU training workflows. We isolate and test the still-useful components of Apex, moving beyond its general mixed-precision library identity. The setup involves checking the CUDA runtime, correctly building Apex with necessary CUDA and C++ extensions, and verifying the availability of fused kernels. This ensures that the high-performance kernels, essential for Apex's utility, are indeed active and not silently missed in a Python-only installation.

We benchmark FusedAdam against PyTorch's AdamW optimizer and compare FusedLayerNorm and FusedRMSNorm with standard normalization layers. The process also includes running examples for both legacy apex.amp and the more modern torch.amp. A key aspect is the preparation of the CUDA environment, confirming GPU availability, and detailing the active PyTorch, CUDA, and GPU configurations. We also define a reusable benchmarking helper for consistent performance evaluation.

The article includes an in-depth comparison of PyTorch AdamW and Apex FusedAdam using a model with multiple linear layers, designed to highlight optimizer overhead. This ensures the comparison accurately reflects update speed. Similarly, FusedLayerNorm and FusedRMSNorm are evaluated against standard normalization layers using a consistent input tensor, allowing for precise measurement of forward and backward pass times and effective speedup assessment.

Read original source

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

Related articles

Build real agentic apps using CUGA: two dozen working examples on a lightweight harness

OpenAI launches new initiative to help find and patch open source bugs

PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters

Related articles

Tools & Platforms
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA, IBM's open-source Agent Harness, simplifies building agentic applications by handling infrastructure, allowing developers to focus on tools and prompts. It offers pre-assembled components for planning, execution, and state management, significantly reducing development time. CUGA has topped agent benchmarks like AppWorld and WebArena.
Hugging Face - BlogJun 23, 2026

Tools & Platforms
OpenAI launches new initiative to help find and patch open source bugs
OpenAI has launched "Patch the Planet," a new initiative in partnership with cybersecurity firm Trail of Bits, to enhance the security of open-source projects. This program aims to assist maintainers in identifying and patching bugs, utilizing OpenAI's AI-powered security tools while reducing the burden on project teams.
AI News & Artificial Intelligence | TechCrunchJun 23, 2026

Tools & Platforms
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
Baidu has released PP-OCRv6, an advanced optical character recognition (OCR) model supporting 50 languages. Available on Hugging Face, this version significantly improves accuracy and efficiency across various parameter sizes, from 1.5 million to 34.5 million, marking a substantial leap in multilingual OCR technology.
Hugging Face - BlogJun 22, 2026