Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware
Timothy Morano Apr 01, 2026 19:17
Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments.
The team behind FlashAttention has quietly become one of the most consequential groups in AI infrastructure. Together AI's kernel research unit, now about 15 engineers strong, is solving a problem most people don't even know exists: the massive performance gap between AI models and the hardware running them.
Their latest win? Taking a voice AI company's time-to-first-token from 281ms down to 77ms—a 3.6x improvement that translated to 7.2x better unit economics.
The Hidden Bottleneck
Here's what most AI discourse misses: having great models and expensive GPUs doesn't guarantee performance. The bottleneck sits in between—the kernel layer that translates mathematical operations into actual silicon instructions.
"The gap between what researchers design and what actually runs fast on hardware is vast," explains Dan Fu, who leads a parallel research lab at UCSD. Get kernels right and you unlock hardware's full potential. Get them wrong and your expensive GPUs sit partially idle.
For companies building AI-native products, this isn't academic. When inference costs run 2x higher than necessary, or when latency breaks the user experience, kernel optimization becomes existential.
One Week Versus One Year
The team's capabilities showed clearly when NVIDIA's Blackwell GPUs arrived in March 2025. NVIDIA had spent a year with dozens of engineers optimizing kernels for the new architecture. Together AI had a week.
Their secret weapon: ThunderKittens, a library developed with Stanford researchers that reduces kernel code from 1,000+ lines of CUDA to roughly 100-200 lines. The abstraction layer is built around NVIDIA's tensor cores, the specialized matrix multiplication units on modern GPUs.
Within seven days of hardware access, the team had some of the fastest FP4 and FP8 GEMM kernels available for Blackwell, achieving up to 2x speedups over cuBLAS on H100s.
Real-World Impact
The voice AI case study illustrates what this means in production. The customer had a hard constraint: time-to-first-64-tokens above roughly 100ms breaks conversational flow. Their B200 deployment was hitting 281ms.
Together's team hand-optimized a "Megakernel" implementation—running an entire model in a single kernel, targeting the HBM bandwidth ceiling of NVIDIA H100s. Results on Llama-3.2-1B: 77ms. On Qwen 2.5 1.5B: 127ms, down from 292ms.
The approach traces back to FlashAttention's original insight. That Memorial Day 2022 paper proved the AI establishment wrong about attention being fully optimized. By applying database systems principles—data locality, memory hierarchies—to transformer attention, the team achieved 2-3x speedups where previous sparsity methods showed only 10% real gains.
Academic-Industry Pipeline
The team operates through an unusual model. Dan Fu runs his UCSD lab on higher-risk fundamental research. Together AI co-founder Tri Dao is at Princeton. Simran Arora is at Caltech. Ideas get de-risked in academia, then productionized at Together AI. PhD students join the company. Interns work on longer-term research in academic labs.
This produces engineers who bridge theory and production—people who, as Fu puts it, "lose sleep over memory access patterns" and "find beauty in data flow diagrams."
The work isn't glamorous. No announcements when a kernel optimization lands. Just faster training times, lower costs, higher throughput. But these margins determine whether AI-native products feel instant or sluggish, whether unit economics work or don't, whether companies scale to millions of users or plateau at thousands.
For enterprise AI deployments where every millisecond matters—and every percentage point of efficiency translates to significant cost savings—this invisible infrastructure layer may be where the real competitive advantage lies.
Image source: Shutterstock