Mark Crowley | Spring 2023 Reading Group

Reading Groups Tips

In a reading group everyone takes turns leading discussion of a paper each week. Leading discussion can be as simple as having your own annotated notes on Hypothes.is to share and start discussion as we go through it together. Or it could be more involved, including making slides to present your overview of the paper’s contributions, highlights and weak points.

Spring 2023 - Transformers Reading Group

Motivation

AI research has been undergoing since the dawn of computer science itself, and Deep Learning has seen an uninterrupted, and accelerating wave of advancing abilities for over 12 years since the public breakthroughs of CNNs in 2012. Yet still, many people, including AI/ML researchers have been surprised at the abilities of the generative models that have been released since summer 2022 by OpenAI, Facebook, Google and others. The recent systems all rely in various ways on the Transformer model (missing reference).

Resources

This github page has a quite extensive list of papers and references on the topic so seems as good a place as any to start:

Awesome LLM Milestone Papers

See the links and notes on paper we have done in previous meetings, obtain the link for the next paper or look at planned upcoming or potential future papers, feel free to suggest others or changes in the upcoming order.

Jump to stage: next ~ done ~ upcoming ~ potential

done

[8] Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu.

Arxiv Preprint. 2023.

PDF URL Notes

Note: I’m wary of some of their intro, "BERT models started to disappear" it’s only been a year or two. They have a very nice overview figure. This recent review paper gives content on what tasks GPT style decoder-only LLMs are good for and which they are not (most tasks in fact).

[7] Are Pretrained Convolutions Better than Pretrained Transformers?

Yi Tay, Mostafa Dehghani, Jai Prakash Gupta, Vamsi Aribandi, Dara Bahri, Zhen Qin, and Donald Metzler.

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online.. Aug, 2021.

Abs PDF URL Hypoth

Abstract: In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.

[6] LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le.

Arxiv Preprint. 2022.

PDF Notes

Note: Facebook enters the ring...

[5] Language Models are Few-Shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.

In Advances in Neural Information Processing Systems. Virtual.. 2020.

Abs arXiv PDF URL Hypoth Notes Notes

Note: Introduction of the GPT-3 model.

Note: Introduction of the GPT-3 model.

Abstract: We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
[5] Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.

In . 2019.

Abs PDF URL Hypoth

Abstract: Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
[5] Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.

arXiv preprint arXiv:1910.13461. 2019.

PDF Notes

Note: This paper builds on the success of BERT but maintaining full encoder-decoder framework.

[4] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenten Lee, and Kristina Toutanova.

In Proceedings of NAACL-HLT. 2019.

PDF Notes

Note: The original paper for the BERT transformer model for NLP tasks.

[3] RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.

. Jul, 2019.

Abs PDF URL Notes

Note: Where does the annotation show up? anywhere?

Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
[3] Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.

Preprint. 2018.

PDF Notes

Note: The founding paper for GPT 1.0 posted online as a preprint.

Trnsfrmr
[2] Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.

In Advances in Neural Information Processing Systems. Long Beach California, USA. Dec, 2017.

Abs PDF URL Hypoth Notes
Note: The original paper that introduced the Transformers architecture. The stack of modules of the original transformer are quite similar to multiple subsequent layers of a CNN rather than the CNN filters.
Questions
1. Why do they "stack" the modules in the original transformer?
2. How does the output of one module feed into the next in the stack?
3. Is there a notion of "sets" and "subsets" somewhere in this definition? Relations being learned amongst sets of non-local symbols?
4. Multi-head : where is the multi-part? It’s not the QKV parts, it’s a set of h copies of the attention module. (this is well hidden)
5. "Pair-of-pairs" we discussed that the h copies of the the attention mechanism are considering pairs-of-pairs of symbol outputs or attention outputs, is it really though? Or are they independent filters as in CNNs?
6. The Big Question: why does it work so well? It’s not just "more weights is better", the architecture matters.
Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

[1] Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey

Benyamin Ghojogh, and Ali Ghodsi.

Dec, 2020.

Abs PDF URL

Abstract: This is a tutorial and survey paper on the attention mechanism, transformers, BERT, and GPT. We first explain attention mechanism, sequence-to-sequence model without and with attention, self-attention, and attention in different areas such as natural language processing and computer vision. Then, we explain transformers which do not use any recurrence. We explain all the parts of encoder and decoder in the transformer, including positional encoding, multihead self-attention and cross-attention, and masked multihead attention. Thereafter, we introduce the Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) as the stacks of encoders and decoders of transformer, respectively. We explain their characteristics and how they work.

upcoming

potential

[9] Talking About Large Language Models

Murray Shanahan.

Arxiv Preprint. Dec, 2022.

Abs arXiv PDF URL

Abstract: Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.
[99] Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic.

Arxiv Preprint. 2022.

PDF Notes

Note: A promising approach to scientific reasoning with LLMs.