In a recent post, NVIDIA introduced the NeMo Curator, a powerful tool designed to facilitate the curation of custom datasets for large language models (LLMs) and small language models (SLMs). The NeMo Curator aims to streamline pretraining and continuous training processes, as well as fine-tuning existing foundation models on domain-specific datasets, according to the NVIDIA Technical Blog.
Overview
The blog post highlights an example of using NeMo Curator for email classification. The Enron emails dataset, publicly available on HuggingFace, was used for this demonstration. This dataset features approximately 1,400 records, each categorized into one of eight categories. The data curation pipeline involves several steps, including downloading, iterating, and extracting email data, unifying Unicode representation, and filtering out irrelevant or low-quality records.
Key Steps in Data Curation
The curation process begins with defining downloader, iterator, and extractor classes to convert the dataset into JSONL format. NeMo Curator supports various data processing operations, such as:
- Downloading and converting the dataset to JSONL format.
- Filtering out emails that are empty or too long.
- Redacting personally identifiable information (PII).
- Adding instruction prompts and ensuring proper formatting.
The execution of this pipeline is efficient, taking less than five minutes on consumer-grade hardware.
Advanced Fine-Tuning Techniques
NVIDIA NeMo Curator supports parameter-efficient fine-tuning (PEFT) methods such as LoRA and p-tuning, which are crucial for adapting LLMs to specific domains. These methods allow for quick iterations and experimentation with hyperparameters and data processing techniques, ensuring effective learning from domain-specific data.
Implementing Custom Filters and Modifiers
Custom filters and modifiers play a significant role in refining the dataset. For instance, filters can remove emails that are too long or empty, while modifiers can redact PII and add instructional prompts. These operations can be chained together using the Sequential class in NeMo Curator, enabling a streamlined and efficient data curation process.
Practical Applications and Future Steps
The curated datasets can be used to fine-tune LLMs like the Llama 2 model for specific applications such as email classification. NVIDIA provides extensive resources, including the NeMo framework PEFT with Llama 2 playbook, to assist developers in leveraging these tools for their machine learning projects.
NVIDIA also offers the NeMo Curator microservice, which simplifies custom generative AI development for enterprises. Interested parties can apply for early access to this microservice on the NVIDIA Developer website.
For more detailed information on the NeMo Curator and its applications, visit the NVIDIA Technical Blog.
Image source: Shutterstock