Microsoft and NVIDIA Enhance Llama Model Performance on Azure AI Foundry
Ted Hisokawa Mar 21, 2025 02:05
Microsoft and NVIDIA collaborate to significantly boost Meta Llama model performance on Azure AI Foundry using NVIDIA TensorRT-LLM optimizations, enhancing throughput, reducing latency, and improving cost efficiency.

Microsoft and NVIDIA Collaborate for Performance Boost
In a strategic collaboration, Microsoft and NVIDIA have announced groundbreaking performance enhancements for the Meta Llama family of models on Microsoft's Azure AI Foundry platform. This partnership leverages NVIDIA TensorRT-LLM optimizations to deliver remarkable gains in throughput and latency reductions, according to NVIDIA.
Significant Throughput and Latency Improvements
The integration of NVIDIA TensorRT-LLM has facilitated a 45% throughput increase for the Llama 3.3 70B and Llama 3.1 70B models, alongside a 34% increase for the Llama 3.1 8B model within the serverless deployment model catalog. Such enhancements translate into faster token generation and improved real-time application performance, including chatbots and virtual assistants.
Optimized Deployment and Cost Efficiency
Azure AI Foundry simplifies the deployment of these optimized Llama models, enabling developers to scale without the burden of infrastructure management. The platform's serverless APIs offer a pay-as-you-go pricing model, reducing the cost per token and improving the price-performance ratio for AI-driven applications.
Technical Innovations Driving Performance
The collaboration between Microsoft and NVIDIA involved deep technical integration, with NVIDIA TensorRT-LLM serving as the backend for model deployment. Key optimizations include the GEMM Swish-Gated Linear Unit (SwiGLU) activation Plugin and the Reduce Fusion optimization, which enhance computational efficiency and latency.
Furthermore, the User Buffer feature in TensorRT-LLM v0.16 significantly boosts inter-GPU communication performance, particularly for FP8 precision in large-scale models. These technical advancements ensure that increased throughput does not compromise the quality of model outputs.
Broader Implications and Accessibility
The performance gains achieved through this collaboration are available to the wider developer community. Developers can utilize these optimizations for faster and more cost-effective AI inference, facilitating the creation of scalable AI products on NVIDIA-accelerated platforms.
In addition to these advancements, Microsoft and NVIDIA announced the integration of NVIDIA NIM with Azure AI Foundry at NVIDIA GTC 2025. This integration provides pre-optimized AI models and microservices, enhancing the capabilities available to AI application developers.
Future Prospects
The collaboration exemplifies the synergy between Microsoft's cloud infrastructure expertise and NVIDIA's AI performance optimization leadership. The enhancements promise to empower developers to build more efficient and responsive AI applications, whether through Azure AI Foundry's managed services or custom deployments on Azure VMs or Kubernetes.
Image source: Shutterstock