Papers
arxiv:2111.10952

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Published on Nov 22, 2021
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

ExMix, a large-scale collection of diverse NLP tasks, demonstrates that multi-task pre-training can significantly improve model performance, and ExT5, pre-trained with ExMix and self-supervised denoising, outperforms T5 across multiple benchmarks.

AI-generated summary

Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training.

Community

Interesting BTS story by @yitayreka !

"Sharing a pretty interesting story about how two research projects can end up with a very different outcome despite almost similar type of “work” being done, just because of research motivation and taste. Here, I was on the side that got rekt so I thought I would share my lessons and introspections. 😂

This was some years ago when people were scaling up many finetuning tasks for training LLMs by collecting many academic datasets. We all know the famous Flan paper by @jasonwei which was one of the pioneering “instruction tuning” works. However, not many people know but there was another more low-key effort within Google that did almost all the same work but was pretty under-hyped because it was framed wrongly. That was the exT5 work that my AI resident @VAribandi and myself drove (i take all the blame though).

Despite so much parallel "almost identical" work being done and tedious efforts in collecting all those tasks and processing them, the framing for exT5 was off and the research motivation was much less sexy compared to Flan. Note: this was before the time I joined the Brain team and Jason and I were on different teams. I also joined the Flan-2 project much later.

I think it was around this time when many people had very similar intuitions (scale up tasks! 📈) but Flan turned out to be the most successful back then. Introspecting a little I found out the key mistake we made was selling it as a multi-task/transfer thing while Flan sold the zero-shot capabilities of generative LLMs. The framing of a T5 variant was also a big blunder on my side.

I dug around even more about the history and found out that @mrtnjbsm and @Quoclele really influenced the idea of improving zero-shot capabilities, which IMO really made Flan shine and so impactful. Also note that this was the era where the standard practice was pretraining and then finetuning models to do very well on a few targeted downstream tasks. I think the Flan team had more foresight into the future than we did. Moreover, I was also too stuck in this narrow “present-moment” of that time. I was focused a lot on architecture research on t5-style models so i completely overlooked the new capabilities game.

I took away a lot from this. i) who you talk to/brainstorm in research matters a lot, ii) small details in framing change the impact and reception of the work significantly and iii) environment matters a lot. I consider myself pretty decent/good in framing research and have a pretty solid research taste but as a researcher one can also be dead wrong sometimes. And ExT5 was one of the moments where I was like darn, we got rekt. Should have worked with @jasonwei and @Quoclele much earlier.

This was mostly inspired by a convo with @huaixiu recently where we discussed how Jason was so good at selling/marketing research and we spend some time reminiscing how our exT5 project got rekt completely despite doing very similar things. To be fair, it was not because of Flan, but because the work undersold itself and our research motivation was lacklustre.

Today ext5 has like 200+ citations. Not a complete wreck but still a far cry for anyone to get excited about. Flan has like few thousands the last I checked.

People don’t share behind the scenes of research that much so I thought this might be interesting too! I learned a lot from @jasonwei from the time I worked with him about marketing his research but in 2022 when I got the chance I took the opportunity to get my revenge by destroying him 6-0 on competitive Pokemon so we're even now."

https://x.com/YiTayML/status/1920535379545108835

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2111.10952 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.