Anyscale Explores Direct Preference Optimization Using Synthetic Data

Felix Pinkston  Aug 22, 2024 11:00  UTC 03:00

3 Min Read

According to Anyscale, Direct Preference Optimization (DPO) has emerged as a significant methodology for tuning language models to align their outputs with human preferences. The company’s latest blog post provides an in-depth case study on the application of DPO using synthetic data, particularly in the context of summarization tasks.

Synthetic Data Generation

Synthetic data generation has become a powerful technique for creating high-quality datasets. Anyscale's approach leverages AI models as data augmenters and judges to improve subsequent models. The blog outlines a detailed pipeline for synthetic data generation, emphasizing the utility of Ray Data and vLLM for scaling and rapid experimentation.

DPO Training and Insights

Direct Preference Optimization (DPO) offers a balanced trade-off between complexity and effectiveness, making it a widely adopted algorithm for preference tuning. Anyscale has integrated DPO into its LLM suite, enabling users to build preference-tuned models through an intuitive API. The blog covers modeling insights and experiments conducted on DPO for summarization.

Evaluation

Anyscale utilizes Ray Data and vLLM for batch inference to evaluate the generated summaries at scale. Evaluation is crucial for determining the quality of models, and Anyscale emphasizes the importance of task-specific evaluation aligned with training objectives. The blog provides key details on setting up preference functions for effective evaluation.

Comparison with Supervised Fine-Tuning

The blog contrasts DPO with traditional supervised fine-tuning (SFT). While SFT relies on high-quality data collection and exact imitation of desired behavior, preference tuning focuses on whether a response is preferred over another. This approach allows for scalable data generation and on-policy data collection, directly addressing model-specific issues.

Case Study: Summarization

The case study applies DPO to the Mistral-7B-instruct-v0.1 model for summarizing CNN articles. Anyscale designed a synthetic summarization preference dataset, using a synthetic judge to reduce costs and ensure alignment between training and evaluation. The preference function combines word count minimization and Q&A accuracy to evaluate summaries.

Data Generation

Anyscale used the Mistral-7B-Instruct-v0.1 model to generate on-policy data for summarization. The process involved generating multiple summaries for each article and using the Llama-3-70B-Instruct model to create and answer multiple-choice questions about the original text. This method ensured diverse outputs and accurate evaluation.

DPO Training

Anyscale implemented DPO within its LLM post-training offering, allowing users to configure hyperparameters and compute resources for training runs. The blog provides a detailed example of a DPO training configuration, emphasizing the importance of the β hyperparameter and efficient training using Ray.

Evaluation

Evaluation involved computing win-rates for each model, comparing DPO-trained models with the original and other baselines. The results demonstrated DPO's advantage in balancing accuracy and compression, outperforming both SFT and GPT-4o baselines.

Insights and Challenges

Anyscale identified key insights for DPO training, including the critical role of β and learning rate hyperparameters. The blog also discusses failure modes, such as long off-topic endings and gibberish tokens, highlighting the need for careful hyperparameter tuning and monitoring.

Iterative On-Policy Training

The blog suggests iterative on-policy training as a method to enhance DPO performance. By regenerating training data with the fine-tuned model and applying additional DPO rounds, Anyscale achieved significant performance gains, making DPO competitive with traditional RLHF methods.

For the full detailed case study and methodology, readers can refer to the original post on Anyscale.



Read More