Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, and gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.
TL;DR
AI KEY POINTS
ABSTRACT
PAPER
BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, and gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.
Research is provided by Semantic Scholar and AI-generated text may at times produce inaccurate results.
Information provided on this site does not constitute legal, financial, medical, or any other professional advice.
DATA LICENSING
Search and article data is provided under CC BY-NC or ODC-BY and via The Semantic Scholar Open Data Platform. Read more at Kinney, Rodney Michael et al. “The Semantic Scholar Open Data Platform.” ArXiv abs/2301.10140 (2023): n. pag.