Muon AI News List

Muon AI News List | Blockchain.News

AI News List

List of AI News about Muon

Time	Details
2026-02-03 21:49	Latest Analysis: FP8 Training Reduces GPT-2 Training Time to 2.91 Hours with H100 GPUs According to Andrej Karpathy on Twitter, enabling FP8 training has improved 'time to GPT-2' by 4.3%, reducing the training duration to 2.91 hours on an 8x H100 GPU setup. Karpathy notes that, using spot instance pricing, the cost to reproduce GPT-2 training is now approximately $20. This marks a significant shift from GPT-2's original classification as 'too dangerous to release' in 2019 to being as accessible as MNIST today. The FP8 implementation presented practical challenges, with support limitations and real-world performance falling short of theoretical FLOPS gains. For tensorwise scaling, a speedup of about 7.3% was achieved, though Karpathy highlights that further optimizations could lower the time and cost even more. Comparatively, torchao reported a 25% speedup for Llama3-8B training using FP8. Karpathy also underscores that, thanks to advancements like Flash Attention 3 and the Muon optimizer, the cost of training GPT-2 has dropped nearly 600 times over the past seven years, offering substantial business opportunities for AI startups and researchers seeking low-cost, rapid model prototyping. As reported by Karpathy, ongoing optimizations in projects like nanochat continue to drive down training costs and times, making advanced language model training accessible to a wider audience. Source

Time

Details

2026-02-03
21:49

Latest Analysis: FP8 Training Reduces GPT-2 Training Time to 2.91 Hours with H100 GPUs

According to Andrej Karpathy on Twitter, enabling FP8 training has improved 'time to GPT-2' by 4.3%, reducing the training duration to 2.91 hours on an 8x H100 GPU setup. Karpathy notes that, using spot instance pricing, the cost to reproduce GPT-2 training is now approximately $20. This marks a significant shift from GPT-2's original classification as 'too dangerous to release' in 2019 to being as accessible as MNIST today. The FP8 implementation presented practical challenges, with support limitations and real-world performance falling short of theoretical FLOPS gains. For tensorwise scaling, a speedup of about 7.3% was achieved, though Karpathy highlights that further optimizations could lower the time and cost even more. Comparatively, torchao reported a 25% speedup for Llama3-8B training using FP8. Karpathy also underscores that, thanks to advancements like Flash Attention 3 and the Muon optimizer, the cost of training GPT-2 has dropped nearly 600 times over the past seven years, offering substantial business opportunities for AI startups and researchers seeking low-cost, rapid model prototyping. As reported by Karpathy, ongoing optimizations in projects like nanochat continue to drive down training costs and times, making advanced language model training accessible to a wider audience.

Source