Stanford and Google researchers propose DoReMi: an AI algorithm that reweights data domains for language model training

https://arxiv.org/abs/2305.10429

Datasets are often drawn from various domains during language model (LM) training. For example, a sizable publicly accessible dataset called The Pile has 24% data online, 9% Wikipedia, 4% GitHub, etc. The composition of pre-training data has a significant impact on the performance of an LM. It needs to be clear how much of each domain should be included to create a model that excels at a number of downstream tasks. Existing studies use intuition or a series of downstream activities to establish domain weights or sample probabilities for each domain. For example, The Pile uses heuristically selected domain weights, which may not be the best choice.

In this study, researchers from Google and Stanford University seek to identify domain weights that provide models that perform well across all domains by minimizing worst-case loss across domains rather than optimizing domain weights based on a collection of downstream assets. Given that each domain has a unique optimal loss (also known as entropy), a naive worst-case strategy would give more weight to the domains with the noisiest data. However, the formation of perhaps thousands of LMs on various domain weights and the ability to overscale a specific set of downstream activities is involved with existing LMs such as PaLM and GLaM, which adjust domain weights based on a set of activities at a Valley.

Figure 1: Domain Reweighting with Minimax Optimization (DoReMi) improves language models trained on a dataset by optimizing domain weights given a dataset containing a collection of domains. DoReMi first trains a reference model using some initial reference domain weights (Step 1). In step 2, we tune the reference model to produce domain weights rather than a robust model by training a small proxy model using group-distributively robust optimization (group DRO) across the domains. The third step involves training a sizable model using the optimized domain weights.

This serves as the driving force behind their technique, Domain Reweighting with Minimax Optimization (DoReMi), which uses distributively solid optimization (DRO) to adjust domain weights without being aware of the tasks that will be performed next (Figure 1) . DoReMi begins by conventionally training a tiny reference model with 280 million parameters. To reduce the worst-case excess loss (compared to the reference model loss), they also introduce a tiny distribution-resistant language model (DRO-LM). In particular, they use domain weights generated by DRO training rather than the robust LM. Instead of building a robust model, their strategy uses the DRO-LM framework to optimize domain weights. A large (8B) LM is then trained on a new dataset specified by these domain weights.

Check out 100s AI Tools in our AI Tools Club

Instead of subselecting instances from a minibatch, they use Group DRO’s online learning-based optimizer, which dynamically changes the domain weights based on the loss on each domain for training target scaling. DoReMi then uses the domain weights averaged during the DRO training phases. To optimize domain weights on The Pile and the GLaM dataset, they run DoReMi on 280 million proxies and reference models. An LM with parameter 8B that is more than 30 times larger is trained using DoReMi domain weights. Even when a domain is underweighted, DoReMi reduces the perplexity about The Pile across all domains compared to the base domain weights.

In low-hit production tasks, DoReMi achieves downstream baseline accuracy 2.6x faster than a baseline model trained on The Piles predefined domain weights, improving average downstream accuracy by 6.5% . They drop domain weights optimized to enhance future LMs learned using The Pile. They find that DoReMi consistently improves LM training when the size of the master model trained with optimized domain weights and the proxy model is changed. DoReMi even outperforms domain weight optimization on downstream activity performance on the GLaM dataset, where domain weights on downstream activity can be optimized.


Check out The Paper. Don’t forget to subscribeour 22k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com

Check out 100s AI Tools in the AI ​​Tools Club

Aneesh Tickoo is a Consulting Intern at MarktechPost. She is currently pursuing her BA in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects that harness the power of machine learning. Her research interest is image processing and she is passionate about building solutions around it. She loves connecting with people and collaborating on interesting projects.

Ultimate Guide to Data Labeling in Machine Learning

#Stanford #Google #researchers #propose #DoReMi #algorithm #reweights #data #domains #language #model #training

Leave a Comment