NVIDIA and Meta Collaborate on Advanced RAG Pipelines with Llama 3.1 and NeMo Retriever NIMs

Peter Zhang  Jul 24, 2024 11:50  UTC 03:50

3 Min Read

In a significant advancement for large language models (LLMs), NVIDIA and Meta have jointly introduced a new framework incorporating Llama 3.1 and NVIDIA NeMo Retriever NIMs, designed to enhance retrieval-augmented generation (RAG) pipelines. This collaboration aims to optimize LLM responses, ensuring they are current and accurate, according to NVIDIA Technical Blog.

Enhancing RAG Pipelines

Retrieval-augmented generation (RAG) is a crucial strategy for preventing LLMs from generating outdated or incorrect responses. Various retrieval strategies, such as semantic search or graph retrieval, improve the recall of documents needed for accurate generation. However, there is no one-size-fits-all approach, and the retrieval pipeline must be customized according to specific data requirements and hyperparameters.

Modern RAG systems increasingly incorporate an agentic framework to handle reasoning, decision-making, and reflection on the retrieved data. An agentic system enables an LLM to reason through problems, create plans, and execute them using a set of tools.

Meta's Llama 3.1 and NVIDIA NeMo Retriever NIMs

Meta’s Llama 3.1 family, spanning models with 8 billion to 405 billion parameters, is equipped with capabilities for agentic workloads. These models can break down tasks, act as central planners, and perform multi-step reasoning, all while maintaining model and system-level safety checks.

NVIDIA has optimized the deployment of these models through its NeMo Retriever NIM microservices, providing enterprises with scalable software to customize their data-dependent RAG pipelines. The NeMo Retriever NIMs can be integrated into existing RAG pipelines and work with open-source LLM frameworks like LangChain or LlamaIndex.

LLMs and NIMs: A Powerful Duo

In a customizable agentic RAG, LLMs equipped with function-calling capabilities play a crucial role in decision-making on retrieved data, structured output generation, and tool calling. NeMo Retriever NIMs enhance this process by providing state-of-the-art text embedding and reranking capabilities.

NVIDIA NeMo Retriever NIMs

NeMo Retriever microservices, packaged with NVIDIA Triton Inference Server and NVIDIA TensorRT, offer several benefits:

  • Scalable deployment: Seamlessly scale to meet user demands.
  • Flexible integration: Integrate into existing workflows and applications with ease.
  • Secure processing: Ensure data privacy and rigorous data protection.

Meta Llama 3.1 Tool Calling

Llama 3.1 models are designed for serious agentic capabilities, allowing LLMs to plan and select appropriate tools to solve complex problems. These models support OpenAI-style tool calling, facilitating structured outputs without the need for regex parsing.

RAG with Agents

Agentic frameworks enhance RAG pipelines by adding layers of decision-making and self-reflection. These frameworks, such as self-RAG and corrective RAG, improve the quality of retrieved data and generated responses by ensuring post-generation verification and alignment with factual information.

Architecture and Node Specifications

Multi-agent frameworks like LangGraph allow developers to group LLM application-level logic into nodes and edges, offering finer control over agentic decision-making. Noteworthy nodes include:

  • Query decomposer: Breaks down complex questions into smaller logical parts.
  • Router: Decides the source of document retrieval or handles responses.
  • Retriever: Implements the core RAG pipeline, often combining semantic and keyword search methods.
  • Grader: Checks the relevance of retrieved passages.
  • Hallucination checker: Verifies the factual accuracy of generated content.

Additional tools can be integrated based on specific use cases, such as financial calculators for answering trend or growth-related questions.

Getting Started

Developers can access NeMo Retriever embedding and reranking NIM microservices, along with Llama 3.1 NIMs, on NVIDIA’s AI platform. A detailed implementation guide is available in NVIDIA’s developer Jupyter notebook.



Read More