DEEPSEEK
Microsoft and NVIDIA Enhance Llama Model Performance on Azure AI Foundry
Microsoft and NVIDIA collaborate to significantly boost Meta Llama model performance on Azure AI Foundry using NVIDIA TensorRT-LLM optimizations, enhancing throughput, reducing latency, and improving cost efficiency.
NVIDIA Enhances TensorRT-LLM with KV Cache Optimization Features
NVIDIA introduces new KV cache optimizations in TensorRT-LLM, enhancing performance and efficiency for large language models on GPUs by managing memory and computational resources.
NVIDIA Enhances Llama 3.3 70B Model Performance with TensorRT-LLM
Discover how NVIDIA's TensorRT-LLM boosts Llama 3.3 70B model inference throughput by 3x using advanced speculative decoding techniques.
NVIDIA TensorRT-LLM Enhances Encoder-Decoder Models with In-Flight Batching
NVIDIA's TensorRT-LLM now supports encoder-decoder models with in-flight batching, offering optimized inference for AI applications. Discover the enhancements for generative AI on NVIDIA GPUs.
NVIDIA's TensorRT-LLM Multiblock Attention Enhances AI Inference on HGX H200
NVIDIA's TensorRT-LLM introduces multiblock attention, significantly boosting AI inference throughput by up to 3.5x on the HGX H200, tackling challenges of long-sequence lengths.
NVIDIA NIM Revolutionizes AI Model Deployment with Optimized Microservices
NVIDIA NIM streamlines the deployment of fine-tuned AI models, offering performance-optimized microservices for seamless inference, enhancing enterprise AI applications.
NVIDIA's TensorRT-LLM Enhances AI Efficiency with KV Cache Early Reuse
NVIDIA introduces KV cache early reuse in TensorRT-LLM, significantly speeding up inference times and optimizing memory usage for AI models.
NVIDIA's TensorRT-LLM MultiShot Enhances AllReduce Performance with NVSwitch
NVIDIA introduces TensorRT-LLM MultiShot to improve multi-GPU communication efficiency, achieving up to 3x faster AllReduce operations by leveraging NVSwitch technology.
Enhancing Large Language Models with NVIDIA Triton and TensorRT-LLM on Kubernetes
Explore NVIDIA's methodology for optimizing large language models using Triton and TensorRT-LLM, while deploying and scaling these models efficiently in a Kubernetes environment.
NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer
NVIDIA's TensorRT Model Optimizer significantly boosts performance of Meta's Llama 3.1 405B large language model on H200 GPUs.