Meta's latest model: LIMA-65B, without RLHF, the model effect is far better than Alpaca!

The training of large-scale language models is divided into two stages: (1) unsupervised pre-training from raw text to learn general representations; (2) large-scale instruction tuning and reinforcement learning to better align specific tasks and User Preferences. The article shared with you today is the latest research result released by Meta: that is, in the absence of any RLHF, using 1000 carefully screened prompts and responses "LiMA model obtained by fine-tuning LLaMA-65B", the experiment shows that The model demonstrates very strong performance, and the authors concluded that "nearly all of the knowledge of large language models is learned during pre-training, requiring only limited guidance tuning data to teach the model to produce high-quality output".

background introduction

 Language models are pre-trained through a large amount of data, so that they can learn general representations and successfully predict the next representation, and can be adapted to any language understanding and generation tasks through migration. In order to achieve this transfer, various methods for aligning language models have been proposed, mainly focusing on instruction tuning through large-scale data sets (millions), and reinforcement learning based on human feedback (RLHF). Existing alignment methods require massive computation and specialized data to achieve ChatGPT-like performance. But "this paper demonstrates that with a strong pre-trained language model, fine-tuned on only 1000 carefully selected training examples, it is possible to achieve reasonably strong performance".


  This paper considers alignment to be a simple process. To test this hypothesis, 1000 samples of near-real user prompts and high-quality replies are selected. Among them, considering the quality and diversity of data, 750 best questions and answers were selected from community forums, such as Stack Exchange and wikiHow; in addition, 250 tips were manually written to further optimize task diversity And the example of the reply, emphasizing the response style of the AI assistant, and finally applying the 1000 sets of data to fine-tune the LLaMA-65B to get the LIMA.


  Comparing LIMA to state-of-the-art language model offerings on 300 challenging test cues. In the human preference study, LIMA was found to outperform OpenAI's DaVinci-003 (trained on RLHF) and Alpaca-65B (trained on 52,000 examples). While there is currently a greater emphasis on response generation like GPT-4, Claude, and Bard, this is not always the case, in contrast LIMA produced the same or better responses in 43%, 46%, and 58% of cases, respectively . "Absolute quantification of LIMA response generation found that 88% of responses met prompt requirements, of which 50% were considered excellent".


 Ablation experiments show that the gains decrease significantly when scaling up the data volume without increasing the diversity of hints, while the gains increase significantly when optimizing the data quality. Furthermore, despite zero dialogue examples, we find that LIMA can produce coherent multi-turn dialogues and can significantly improve the model's dialogue ability by adding only 30 hand-crafted multi-turn dialogues to the training set.


alignment assumption

  This paper defines the "surface alignment assumption": "The knowledge and capabilities of the model are almost entirely learned during pre-training, and the alignment teaches it which format distribution it should use when interacting with the user". If this assumption is correct, and alignment is primarily about learning styles, then a corollary of the superficial alignment assumption is that it is possible to adequately tune a pretrained language model with considerably fewer examples.


  To this end, this paper collects a dataset of 1000 prompts and responses, where the outputs (responses) are stylistically consistent with each other, but the inputs (prompts) are diverse. Specifically, useful AI assistant-style output is primarily sought. "We collate such examples from various sources, mainly divided into community Q&A forums and manually written examples". We also collect a test set with 300 cues and a development set with 50 cues. Table 1 below shows an overview of the different data sources and provides some statistics.


LIMA training

  For the training of model LIMA. Mainly, LLaMA is fine-tuned using the 1000-example aligned training dataset. In order to distinguish the speaker (user and assistant), a special end-of-episode token (EOT) is introduced at the end of each utterance, "This token acts the same as the EOS that stops generating, but avoids the EOS token that the pre-trained model may assign any other meaning".


  We follow standard fine-tuning hyperparameters: use AdamW to fine-tune for 15 epochs, where , , weight decay 0.1. In the absence of a warmup step, the initial learning rate is set to and decays linearly to at the end of training. The batch size is set to 32 examples (64 for the smaller model), and texts exceeding 2048 tokens are pruned. A notable difference from conventional training is the use of residual dropout; we follow the method of Ouyang et al., applying dropout to the residual connections, starting from the bottom layer and increasing the rate linearly to (for smaller models, ). "We found that perplexity is not related to generation quality", so 2 checkpoints between the 5th and 10th epochs were manually selected using a dev set holding out 50 examples.


Experimental results

  This paper studies the performance of output results generated by five different language models (Alpaca 65B, DaVinci003, Bard, Claude, and GPT-4) on the Human Preference Test and the GPT-4 Preference Test. The experimental results are shown in the figure below:

The study found that "although Alpaca 65B has more training data than other models, the output results it produces are generally not as good as LIMA"; DaVinci003 performs similarly, although it uses a better alignment method RLHF. On the contrary, Bard shows a tendency to be better than LIMA, but at the same time, 58% of the time, the output of LIMA is as good or better than Bard. Finally, while Claude and GPT-4 generally outperform LIMA, there are still cases where LIMA’s output is actually better, and even GPT-4 likes LIMA’s output 19% of the time.


  In this paper, we study the impact of the diversity, quality, and quantity of training data on model performance through ablation experiments. The experiment uses a language model with 7B parameters, which is fine-tuned on different data sets. For different research questions, the experiment compares the effect of different data sets, and the difference in model performance after different filtering processes are performed on the data. As shown below:

Among them, the experimental results show that "selecting high-quality data from the Stack Exchange dataset can improve the performance of the model under the premise of controlling the amount of data"; at the same time, by comparing the Stack Exchange and wikiHow datasets, the experiment found that it is more " Diverse data can improve model performance". But "simply improving model training data does not improve model performance."




You must be logged in to post a comment.

About Author

This guy is lazy and left nothing behind.