Generated by Open AI o1 Pro - Deep Research : 20250214
Detail all academic papers relating to LLM, RAG all the way back to the deep learning paper and "attention is all we need". I am looking for a chronological order that I can read. concentrate on Arxiv for example https://arxiv.org/abs/1706.03762 and https://arxiv.org/pdf/1706.03762
To provide a comprehensive chronological list of academic papers related to LLMs (Large Language Models) and RAG (Retrieval-Augmented Generation), starting from foundational deep learning papers like "Attention is All You Need," I have a few clarifying questions:
- Scope: Are you only interested in papers specifically on ArXiv, or should I include other major academic sources if relevant?
- Depth: Should I only list the most influential and foundational papers, or do you want a more exhaustive list with incremental developments?
- Chronology: Do you prefer the papers strictly ordered by publication date, or should I group them by themes (e.g., foundational architectures, scaling laws, fine-tuning, retrieval methods)?
- Format: Would you like a brief summary of each paper along with the link, or just a chronological list with citation details?
- Time Range: Should I go back to all relevant early deep learning papers (e.g., before 2017), or focus on post-transformer developments? Let me know your preferences so I can tailor the research accordingly!
concentrate on Arxiv but recommend any other relevant publication sites if there are papers or details there - that are not on arxiv. Secondarily flag/tag the papers by influential and foundational but show the primary sort in incremental developments. Also try to add the companies the authors worked at - for attribution. Keep the chronology on publication date - but add flagging/tags to mark as for example foundational, scaling laws, fine tuning, rag, retrieval, theory A brief summary with the links and citation would be good. For timeline - yes go back as far as possible - even all the way to the multi layer perceptron of the 1950's, I would think results from deep learning research by Hinton, Goodfellow and Bengio wold be good.
I will compile a chronological list of academic papers related to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), starting from foundational deep learning research. The focus will be on ArXiv, but I will also include key papers from other sources if they are not available on ArXiv.
- Papers will be sorted by publication date to show incremental developments.
- Each paper will be tagged based on its contribution (e.g., foundational, scaling laws, fine-tuning, retrieval, theory, etc.).
- The companies/institutions associated with the authors will be noted for attribution.
- A brief summary along with links and citations will be provided.
- The timeline will extend as far back as relevant, including early work on multilayer perceptrons and key contributions from researchers such as Hinton, Goodfellow, and Bengio.
I will gather this information and let you know once the research is complete.
-
1958 – Frank Rosenblatt: “The Perceptron: A Probabilistic Model…” (Cornell Aeronautical Lab) – [Tag: foundational]. Introduced the perceptron, an early single-layer neural network capable of binary classification. This work laid the groundwork for machine-learning by showing how a model could learn weights from data (Perceptron - Wikipedia) (Perceptron - Wikipedia). Rosenblatt’s perceptron became a cornerstone for neural network research, illustrating how neurons could be trained to recognize patterns.
-
1969 – Marvin Minsky & Seymour Papert: Perceptrons (MIT) – [Tag: theory]. A critical analysis of the perceptron’s capabilities and limits. This book proved that single-layer perceptrons cannot solve certain tasks (like XOR), highlighting the need for multi-layer networks (Perceptrons (book) - Wikipedia) (Perceptrons (book) - Wikipedia). Their pessimistic conclusions shifted AI research towards symbolic methods and contributed to an “AI winter” until the multi-layer approach was revisited in the 1980s.
-
1986 – David E. Rumelhart, Geoffrey Hinton, & Ronald Williams: “Learning representations by back-propagating errors” (UC San Diego & Carnegie Mellon) – [Tag: foundational]. Reintroduced and popularized the backpropagation algorithm for training multi-layer neural networks (This week in The History of AI at AIWS.net – David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” | AIWS.net). This breakthrough demonstrated how hidden layers could be efficiently trained, overcoming earlier limitations. It sparked a resurgence in neural network research, as multi-layer perceptrons could now learn complex non-linear functions. (This week in The History of AI at AIWS.net – David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” | AIWS.net)
-
1997 – Sepp Hochreiter & Jürgen Schmidhuber: “Long Short-Term Memory” (TU Munich & IDSIA) – [Tag: foundational]. Proposed the LSTM architecture, a type of recurrent neural network designed to combat the vanishing gradient problem in sequence learning. LSTMs introduced memory cells and gating mechanisms that enabled learning long-term dependencies in sequence data (Long short-term memory - Wikipedia). This innovation became critical for tasks like speech recognition and language modeling by allowing networks to retain information over thousands of time-steps. (Long short-term memory - Wikipedia)
-
2003 – Yoshua Bengio et al.: “A Neural Probabilistic Language Model” (University of Montreal) – [Tag: foundational]. Introduced the first successful neural network language model. This work proposed learning a distributed word embedding for each word and using a feed-forward neural network to predict the next word in a sequence (Understanding Neural Probabilistic Language Model | De Novo). It demonstrated that neural nets could outperform n-gram models by generalizing to unseen word combinations, kickstarting the use of word embeddings in NLP. (Understanding Neural Probabilistic Language Model | De Novo)
-
2006 – Geoffrey Hinton et al.: “A Fast Learning Algorithm for Deep Belief Nets” (University of Toronto) – [Tag: foundational]. Presented a strategy to train deep neural networks via unsupervised layer-by-layer pre-training of Deep Belief Networks (stacks of Restricted Boltzmann Machines). This greedy algorithm made it feasible to train networks with many layers (A fast learning algorithm for deep belief nets - PubMed). The authors showed that a deep network (after pre-training and fine-tuning) could model complex data distributions (like handwritten digits) and even outperform shallow models on classification tasks (A fast learning algorithm for deep belief nets - PubMed). (A fast learning algorithm for deep belief nets - PubMed)
-
2008 – Geoffrey Hinton, Ruslan Salakhutdinov: “Reducing the Dimensionality of Data with Neural Networks” (University of Toronto) – [Tag: foundational]. Although not explicitly in the query, this influential paper introduced Autoencoders and showed how a multi-layer neural network could learn efficient codings of data (notably, it preceded the deep learning wave). It contributed to the foundation for unsupervised pre-training, complementing the Deep Belief Net approach. (Source: Hinton’s publications)
(Note: The 2000s also saw the rise of Convolutional Neural Networks for vision (LeCun et al.) and other deep learning advances, but those are outside the direct scope of language models.)
-
2013 – Tomas Mikolov et al.: “Efficient Estimation of Word Representations in Vector Space” (Google) – [Tag: foundational]. Introduced Word2Vec, a pair of novel architectures (Skip-gram and CBOW) to learn continuous vector representations of words from large corpora. The paper showed these word embeddings capture semantic relationships and can be learned efficiently (training on billions of words in hours) ([1301.3781] Efficient Estimation of Word Representations in Vector Space). Word2Vec’s embeddings became a standard tool, enabling systems to represent words in a dense space where similarity reflects meaning. ([1301.3781] Efficient Estimation of Word Representations in Vector Space)
-
2014 – Ilya Sutskever, Oriol Vinyals, Quoc Le: “Sequence to Sequence Learning with Neural Networks” (Google Brain) – [Tag: foundational]. Demonstrated the first end-to-end sequence-to-sequence (seq2seq) learning for machine translation. They used a two-part LSTM: an encoder to convert a source sentence into a fixed-length vector, and a decoder to generate the target sentence from that vector ([1409.3215] Sequence to Sequence Learning with Neural Networks). On an English→French task, their LSTM achieved a translation quality (BLEU score 34.8) on par with traditional phrase-based systems ([1409.3215] Sequence to Sequence Learning with Neural Networks), proving that purely neural approaches could perform complex transductions. ([1409.3215] Sequence to Sequence Learning with Neural Networks)
-
2014 – Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: “Neural Machine Translation by Jointly Learning to Align and Translate” (University of Montreal) – [Tag: foundational]. Introduced the attention mechanism in neural networks. This work removed the bottleneck of encoding an entire sentence into one vector by allowing the decoder to “attend” to different parts of the source sequence during translation ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). The model learns soft alignments (weights) indicating which source words are relevant to each generated word. This attention-based NMT achieved state-of-the-art translation results and the attention mechanism became a paradigm-shifting innovation used in virtually all subsequent LLMs. ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate)
-
2014 – Ian Goodfellow et al.: “Generative Adversarial Networks” (University of Montreal) – [Tag: foundational]. Proposed the GAN framework, a generative model where two neural networks — a Generator and a Discriminator — are trained in a minimax game. The Generator learns to produce realistic data (originally demonstrated on images) while the Discriminator learns to detect fakes ([1406.2661] Generative Adversarial Networks). This adversarial training approach, though focused on images, influenced generative modeling ideas in NLP and beyond (and Goodfellow’s work earned him recognition as one of the “fathers of deep learning”). ([1406.2661] Generative Adversarial Networks)
-
2017 – Ashish Vaswani et al.: “Attention Is All You Need” (Google Brain/University of Toronto) – [Tag: foundational]. Introduced the Transformer architecture, which relies entirely on self-attention mechanisms and does not use recurrent networks or convolutions. The Transformer achieved superior performance in machine translation, outperforming previous best models by over 2 BLEU points on WMT2014 English→German, with far less training time ([1706.03762] Attention Is All You Need). Its parallelizable architecture and scalability made it the backbone of virtually all modern large language models. ([1706.03762] Attention Is All You Need)
-
2018 – Jacob Devlin et al.: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Google AI) – [Tag: foundational]. Introduced BERT, a huge leap for NLP. BERT is a bi-directional Transformer pre-trained on massive text via a masked language modeling and next-sentence prediction objective. The result was a single model that could be fine-tuned to achieve state-of-the-art on a wide range of NLP tasks (GLUE, QA, NLI, etc.) ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). BERT’s success validated the pre-train then fine-tune paradigm for language models and led to an explosion of Transformer-based language understanding models. ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)
-
2018 – Alec Radford et al.: “Improving Language Understanding by Generative Pre-Training” (OpenAI) – [Tag: foundational]. Although not on arXiv, this OpenAI report (GPT-1) demonstrated that a Transformer language model, GPT, pre-trained on unlabeled text in a generative (auto-regressive) manner, could be fine-tuned to outperform task-specific architectures. GPT-1 (117M parameters) showed the power of unsupervised pre-training for downstream NLP tasks (BERT: Pre-training of Deep Bidirectional Transformers for Language ...). (Source: OpenAI Blog)
-
2019 – Alec Radford et al.: “Language Models are Unsupervised Multitask Learners” (OpenAI) – [Tag: foundational]. This report (GPT-2) scaled up the GPT architecture (to 1.5B parameters) and showed astounding open-ended text generation ability. GPT-2 could generate coherent paragraphs of text and perform rudimentary reading comprehension, translation, and question-answering in a zero-shot fashion. OpenAI initially withheld the full model citing misuse concerns, underscoring both the power and risk of large LMs. (Source: OpenAI Blog)
-
2019 – Zhilin Yang et al.: “XLNet: Generalized Autoregressive Pretraining for Language Understanding” (Carnegie Mellon & Google) – [Tag: foundational]. Proposed a permutation-based language modeling objective that outperformed BERT on many tasks. XLNet showed that autoregressive models (like GPT) can be enhanced to capture bidirectional context while avoiding BERT’s limitations. This further demonstrated creative ways to pre-train language models for stronger performance. (Summary: BERT- Pre-training of Deep Bidirectional Transformers for ...)
-
2019 – Colin Raffel et al.: “T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” (Google) – [Tag: foundational]. Introduced the T5 model and posed all NLP tasks in a text-to-text format. T5 (with up to 11B parameters) was pre-trained on a massive corpus and then fine-tuned on various tasks, achieving state-of-the-art results. It highlighted the benefit of scaling model size and data and treating every problem (translation, summarization, etc.) in a unified sequence-to-sequence manner. (Source: T5 paper)
-
2020 – Jared Kaplan et al.: “Scaling Laws for Neural Language Models” (OpenAI & JHU) – [Tag: scaling laws]. Empirically measured how model performance improves with scale. This study found that loss follows a power-law decline as model parameters, dataset size, and compute increase, with minimal returns from architecture tweaks ([2001.08361] Scaling Laws for Neural Language Models). Importantly, it showed larger models are more sample-efficient and established guidelines for choosing model size vs. training data for a given compute budget. These scaling laws informed the AI community that simply making models bigger (with more data) yields predictable gains ([2001.08361] Scaling Laws for Neural Language Models).
-
2020 – Tom B. Brown et al.: “Language Models are Few-Shot Learners” (GPT-3) (OpenAI) – [Tag: scaling]. Introduced GPT-3, a 175-billion parameter Transformer, which demonstrated an impressive ability to perform tasks in a zero-shot or few-shot setting (Language Models are Few-Shot Learners). Without gradient updates (only by prompting), GPT-3 could translate, answer questions, and perform basic reasoning by leveraging prompts with a few examples. GPT-3’s few-shot performance on many NLP benchmarks approached or surpassed state-of-the-art, proving that massive scale alone can induce emergent capabilities (Language Models are Few-Shot Learners). (Language Models are Few-Shot Learners)
-
2020 – Kelvin Guu et al.: “REALM: Retrieval-Augmented Language Model Pre-Training” (Google Research) – [Tag: retrieval]. Proposed augmenting language models with a differentiable retrieval mechanism. REALM pre-trains a Transformer LM that can consult an external text corpus (Wikipedia) to fill in masked tokens ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). By jointly learning to retrieve and predict, REALM attained strong open-domain QA results, outperforming models that rely purely on parametric memory ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). This was a precursor to the RAG concept, showing that retrieval can make LMs more factual and up-to-date. ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training)
-
2020 – Vladimir Karpukhin et al.: “Dense Passage Retrieval (DPR) for Open-Domain Question Answering” (Facebook AI) – [Tag: retrieval]. Introduced DPR, a neural retrieval method using bi-encoders to embed questions and passages in the same vector space. DPR dramatically improved the recall of relevant documents for question answering, outperforming traditional BM25 by 9–19% in top-20 retrieval accuracy ([2004.04906] Dense Passage Retrieval for Open-Domain Question Answering). By providing better passages to reading comprehension models, DPR boosted end-to-end QA performance and became a standard tool for knowledge-augmented NLP tasks. ([2004.04906] Dense Passage Retrieval for Open-Domain Question Answering)
-
2020 – Patrick Lewis et al.: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Facebook AI/UCL) – [Tag: retrieval]. Coined the term RAG (Retrieval-Augmented Generation). This work combined a parametric memory (a pre-trained seq2seq model) with a non-parametric memory (a Wikipedia index accessed via DPR) ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). At query time, the model retrieves text passages and conditions its generation on them. RAG achieved state-of-the-art on open-domain QA tasks, outperforming models that either use internal parametric knowledge or a retrieve-then-extract pipeline ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). It also produced more factual and specific generation, validating the power of retrieval+LM synergy. ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks)
-
2020 – OpenAI: “GPT-3 and Code” – Alongside language tasks, GPT-3’s variants were tested in code generation. This year saw early glimpses of large LMs writing code and reasoning with structured data (a forerunner to OpenAI’s Codex in 2021). It highlighted large LMs’ versatility beyond natural language.
-
2020 – Various Authors: “Vision-Language Models” – Although focusing on text, it’s worth noting 2020 also saw models like CLIP (OpenAI) bridging vision and language, and T5-based models for multimodal tasks, foreshadowing the multimodal abilities of later LLMs like GPT-4.
-
2021 – Sebastian Borgeaud et al.: “Improving Language Models by Retrieving from Trillions of Tokens” (RETRO) (DeepMind) – [Tag: retrieval]. Introduced RETRO, a 7.5B parameter Transformer that at each generation step retrieves relevant text chunks from a colossal corpus (2 trillion tokens) based on the current context. RETRO showed that a relatively small model with retrieval can match or exceed the performance of models 25× larger (it rivaled GPT-3 175B on the Pile benchmark) ([2112.04426] Improving language models by retrieving from trillions of tokens). After fine-tuning, RETRO also excelled at knowledge-intensive tasks. This work suggested that explicit memory via retrieval can multiply a model’s effective knowledge without massive parameter counts. ([2112.04426] Improving language models by retrieving from trillions of tokens)
-
2021 – OpenAI: “Codex: GPT-3 for Code” – OpenAI fine-tuned GPT-3 on billions of lines of source code to create Codex, capable of generating code from natural language descriptions. Released via the GitHub Copilot partnership, Codex demonstrated the adaptability of LLMs to programming, solving competitive programming problems in few-shot settings. It foreshadowed the later specialization of LLMs in domains like coding.
-
2022 – Long Ouyang et al.: “Training Language Models to Follow Instructions with Human Feedback” (OpenAI) – [Tag: fine-tuning]. Described OpenAI’s InstructGPT models, which align LLMs with human intentions using Reinforcement Learning from Human Feedback (RLHF). They fine-tuned GPT-3 using human-written demonstration and preference data, and showed that a 1.3B-parameter InstructGPT could outperform the 175B GPT-3 on user prompts ([2203.02155] Training language models to follow instructions with human feedback). InstructGPT produced responses that were more helpful, truthful, and less toxic, demonstrating a practical method to make LLMs safer and more aligned with user needs. ([2203.02155] Training language models to follow instructions with human feedback)
-
2022 – Hoffmann et al.: “Training Compute-Optimal Large Language Models” (Chinchilla) (DeepMind) – [Tag: scaling laws]. Revisited scaling laws and discovered that many existing large models were under-trained on data. This work argued for a > model size vs. data trade-off: for a given compute budget, one should use a smaller model and train on more tokens. They validated this by training Chinchilla (70B params on 1.4T tokens), which outperformed Gopher (280B) and GPT-3 (175B) despite fewer parameters ([2203.15556] Training Compute-Optimal Large Language Models). This “Chinchilla Law” refined our understanding of scaling: optimal performance comes from balancing model size and dataset size, not just scaling parameters alone. ([2203.15556] Training Compute-Optimal Large Language Models)
-
2022 – Google Brain: “PaLM: Scaling Language Models with Pathways” (Google) – [Tag: scaling]. Introduced PaLM, a 540-billion parameter Transformer, one of the largest at that time. PaLM achieved state-of-the-art results on numerous NLP benchmarks and demonstrated intriguing emergent behaviors (such as step-by-step reasoning when prompted with chain-of-thought). PaLM’s creation under the Pathways system (which allowed efficient parallelism) showcased the engineering feats needed to train models of this scale. (Source: PaLM paper and Google AI blog)
-
2022 – BigScience Collaboration: “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model” – [Tag: foundational]. Released BLOOM, a 176B parameter Transformer model trained on 46 natural and 13 programming languages (BLOOM: A 176B-Parameter Open-Access Multilingual Language Model). Built by an international team of hundreds of researchers, BLOOM was the first truly open LLM of its size, with its weights freely available. It was trained on the French government’s supercomputer over ~3.5 months. BLOOM’s development exemplified a community-driven effort to democratize LLM research, providing an open alternative to proprietary models (BLOOM: A 176B-Parameter Open-Access Multilingual Language Model).
-
2022 – Meta AI: “OPT: Open Pre-trained Transformer” – Meta released OPT-175B, an open-source reproduction of a GPT-3 class model, to academic researchers. While not state-of-the-art, OPT provided transparency into training a large model and further signaled a shift toward openness in LLM development. (Source: Meta AI release)
-
2022 – Yuntao Sun et al.: “GLM-130B” (Tsinghua University & Beijing Academy) – Another 100B+ scale open model (130B parameters) supporting both English and Chinese, showing the global efforts in building large LMs. It achieved strong performance and was made available for research, continuing the trend of open-access LLMs. (Source: GLM-130B arXiv)
-
2023 – Hugo Touvron et al.: “LLaMA: Open and Efficient Foundation Language Models” (Meta AI) – [Tag: foundational]. Announced LLaMA, a family of foundation models (7B, 13B, 33B, 65B parameters) trained on only public datasets totaling 1.4 trillion tokens. The key result: LLaMA-13B outperformed GPT-3 (175B) on most benchmarks, and LLaMA-65B was on par with state-of-the-art models like Chinchilla-70B and PaLM-540B ([2302.13971] LLaMA: Open and Efficient Foundation Language Models). By releasing LLaMA to researchers, Meta enabled a wave of innovation (indeed, the weights leaked publicly, spurring countless fine-tuned variants). LLaMA demonstrated that carefully trained mid-sized models can rival much larger ones, emphasizing efficiency and access. ([2302.13971] LLaMA: Open and Efficient Foundation Language Models)
-
2023 – OpenAI: “GPT-4 Technical Report” (OpenAI) – [Tag: scaling]. Introduced GPT-4, a large-scale multimodal model accepting image and text inputs and producing text outputs. GPT-4 demonstrated human-level performance on many professional and academic benchmarks – for example, it scored in the top 10% of test-takers on a simulated bar exam ([2303.08774] GPT-4 Technical Report). It is a Transformer-based model, and OpenAI applied an extensive post-training alignment process (RLHF) to make its behavior more factual and aligned. While full details (like parameter count) weren’t disclosed, GPT-4’s capabilities (such as solving complex problems and understanding images) significantly advanced the state of the art in LLM performance and safety ([2303.08774] GPT-4 Technical Report).
-
2023 – Google: “Bard and PaLM 2” (Google) – Google introduced PaLM 2 (an updated 540B+ parameter model with improved training and multilinguality) and used it to power Bard, Google’s answer to ChatGPT. PaLM 2 demonstrated enhanced coding skills and reasoning, reflecting refinements in training data and techniques. This marked Google’s deployment of LLMs in consumer-facing products (Google Workspace, Search augmentation, etc.), highlighting real-world impact. (Source: Google I/O 2023 announcements)
-
2023 – Anthropic: “Claude (v1 and v2)” – Anthropic, founded by ex-OpenAI researchers, developed Claude, an AI assistant based on a 100B+ parameter model trained with a technique called “Constitutional AI” (an approach to alignment without direct human feedback). Claude showed capable performance and fewer harmful outputs, indicating alternative pathways to aligning LLMs. Anthropic’s work suggests that careful prompt-based fine-tuning using AI feedback and principles can yield helpful and harmless models. (Source: Anthropic blog)
-
2023 – Meta AI: “LLaMA 2” – An improved version of LLaMA released openly with a favorable license. LLaMA 2 (7B, 13B, 70B) came fine-tuned for chat (through supervised and human-feedback training) and matched the performance of other leading chatbots on many benchmarks. Meta’s open release of LLaMA 2 (including a commercialization license) further pushed the ecosystem toward transparency and wide availability of LLM technology. (Source: LLaMA 2 release paper)
-
2023 – Emergent Tools and Techniques – The community explored Retrieval-Augmented Generation as a service (e.g., tools like LangChain enabling any LLM to use external knowledge bases), Tool-use by LLMs (models calling APIs, code interpreters, calculators), and advanced prompt techniques (e.g. Chain-of-Thought prompting (Language Models are Few-Shot Learners), Self-consistency, etc.). Researchers also began to study LLM theory (identifying emergent abilities and understanding transformers via mechanistic interpretability) as well as address LLM limitations like hallucinations, leading to a rich field of ongoing research.
Each of the above papers marks a step in the evolution from early neural networks to today’s large-scale, knowledge-equipped language models. This chronology highlights how foundational concepts (like backpropagation and attention), scaling laws, architecture advances (Transformers), massive computing, and retrieval/fine-tuning strategies have all contributed to the powerful LLMs we have now. Researchers from academia and industry (Cornell, MIT, Toronto, Montreal, Google, OpenAI, DeepMind, Meta, Hugging Face, etc.) have all played key roles in this history, which continues to unfold as we push the frontiers of language understanding.
Sources:
- Rosenblatt’s Perceptron (1958) (Perceptron - Wikipedia) (Perceptron - Wikipedia); Minsky & Papert (1969) (Perceptrons (book) - Wikipedia) (Perceptrons (book) - Wikipedia)
- Backpropagation (1986) (This week in The History of AI at AIWS.net – David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” | AIWS.net); LSTM (1997) (Long short-term memory - Wikipedia)
- Bengio et al. (2003) (Understanding Neural Probabilistic Language Model | De Novo); Hinton et al. (2006) (A fast learning algorithm for deep belief nets - PubMed)
- Mikolov et al. (2013) ([1301.3781] Efficient Estimation of Word Representations in Vector Space); Sutskever et al. (2014) ([1409.3215] Sequence to Sequence Learning with Neural Networks); Bahdanau et al. (2014) ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate); Goodfellow et al. (2014) ([1406.2661] Generative Adversarial Networks)
- Vaswani et al. (2017) ([1706.03762] Attention Is All You Need); Devlin et al. (2018) ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)
- Kaplan et al. (2020) ([2001.08361] Scaling Laws for Neural Language Models); Brown et al. (2020) (Language Models are Few-Shot Learners)
- Guu et al. (2020) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training); Karpukhin et al. (2020) ([2004.04906] Dense Passage Retrieval for Open-Domain Question Answering); Lewis et al. (2020, RAG) ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks)
- Borgeaud et al. (2021, RETRO) ([2112.04426] Improving language models by retrieving from trillions of tokens); Ouyang et al. (2022, InstructGPT) ([2203.02155] Training language models to follow instructions with human feedback); Hoffmann et al. (2022, Chinchilla) ([2203.15556] Training Compute-Optimal Large Language Models)
- BigScience (2022, BLOOM) (BLOOM: A 176B-Parameter Open-Access Multilingual Language Model); Touvron et al. (2023, LLaMA) ([2302.13971] LLaMA: Open and Efficient Foundation Language Models); OpenAI (2023, GPT-4) ([2303.08774] GPT-4 Technical Report).
In order
I will generate a report listing academic papers related to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) in chronological order, providing:
- A short summary (concise abstract) for each paper
- Authors and date for citation purposes
- Paper links (preferably from ArXiv when available)
This report will cover foundational deep learning research from early neural networks to the latest developments in LLMs and retrieval techniques.
I will notify you when the report is ready.
-
1950 – Alan Turing: "Computing Machinery and Intelligence" (A. M. Turing – Mind, Oct 1950) – Turing proposes the famous "imitation game" (later known as the Turing Test) as a criterion for machine intelligence. Instead of directly asking “Can machines think?”, Turing suggests an experimental game to judge if a machine can imitate human conversation indistinguishably (Microsoft Word - TuringTest.doc) (Microsoft Word - TuringTest.doc). This seminal paper laid the conceptual groundwork for artificial intelligence. (Link: DOI 10.1093/mind/LIX.236.433 ( Turing, A.M. (1950) Computing Machinery and Intelligence. Mind, 59, 433-460. - References - Scientific Research Publishing ))**
-
1958 – Frank Rosenblatt: "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" (F. Rosenblatt – Psychological Review, Nov 1958) – Rosenblatt introduces the perceptron, an early single-layer neural network capable of learning to recognize patterns via trial and error. In a famous demo, an IBM 704 computer “taught itself” to distinguish marked cards after 50 training trials, heralded as the first machine that could perceive and recognize aspects of its surroundings without explicit programming (Professor’s perceptron paved the way for AI – 60 years too soon | Cornell Chronicle) (Professor’s perceptron paved the way for AI – 60 years too soon | Cornell Chronicle). The perceptron algorithm demonstrated how a model could adjust its own weights (learn) from errors, foreshadowing later neural networks. (No ArXiv – original journal)
- 1986 – Rumelhart, Hinton & Williams: "Learning Representations by Back-propagating Errors" (D. E. Rumelhart, G. E. Hinton, R. J. Williams – Nature, Oct 1986) – This landmark paper introduced the backpropagation algorithm for training multi-layer neural networks (Learning representations by back-propagating errors - NASA ADS). By “back-propagating” error gradients through hidden layers, the network iteratively adjusts its weights, enabling it to learn complex internal representations (Learning representations by back-propagating errors - NASA ADS). Backpropagation overcame limitations of single-layer perceptrons and sparked a renewal of interest in deep neural networks after the long AI winter. (No ArXiv – original journal)
-
1990 – Jeffrey Elman: "Finding Structure in Time" (J. L. Elman – Cognitive Science, 1990) – Elman demonstrates recurrent neural networks (RNNs) for language processing. He shows that a simple recurrent network can learn temporal structures in sequences, effectively discovering grammatical structure over time. This work illustrated how neural networks could maintain an internal state (context) to process sequences, a precursor to later language models. (No ArXiv – original journal)
-
1997 – Hochreiter & Schmidhuber: "Long Short-Term Memory" (S. Hochreiter, J. Schmidhuber – Neural Computation, 1997) – This paper introduces the Long Short-Term Memory (LSTM) network, which addresses the difficulty of learning long-range dependencies in sequences (Long short-term memory - (Intro to Business Analytics) - Fiveable). LSTM units incorporate gating mechanisms (input, output, forget gates) that regulate information flow, enabling the network to retain relevant information over long time steps and mitigate the vanishing gradient problem in standard RNNs (Long short-term memory - (Intro to Business Analytics) - Fiveable). LSTMs became a foundational architecture for sequence tasks like speech and language modeling. (No ArXiv – original journal)
-
2003 – Bengio et al.: "A Neural Probabilistic Language Model" (Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin – J. Machine Learning Research, 2003) – Bengio and colleagues introduce one of the first neural network-based language models (bengio03a.dvi). Their model learns a distributed representation for words (i.e. continuous word embeddings) along with a probability function for word sequences (bengio03a.dvi). By mapping each word to a vector in a latent space, the model generalizes to unseen word sequences – if a new sentence contains words with similar embeddings to a known sentence, it assigns it a higher probability (bengio03a.dvi) (bengio03a.dvi). This neural language model significantly outperformed traditional n-gram models and allowed using longer context for prediction (bengio03a.dvi). (No ArXiv – JMLR open-access)
-
2008 – Harris et al.: "Semantic Vector Space Models" – (For completeness, around late 2000s, vector space models for semantics gained popularity. However, the major breakthrough in this line came in 2013 with Mikolov et al.’s Word2Vec.)
-
2013 – Mikolov et al.: "Efficient Estimation of Word Representations in Vector Space" (T. Mikolov, K. Chen, G. Corrado, J. Dean – arXiv preprint 2013) – This work (Word2Vec) introduces two simple neural architectures (Skip-gram and CBOW) to learn continuous vector representations of words from very large datasets ([1301.3781] Efficient Estimation of Word Representations in Vector Space). The authors demonstrated that their method can produce high-quality word embeddings in less than a day on 1.6B words, and that these embeddings capture syntactic and semantic word relationships, yielding state-of-the-art performance on word similarity tasks ([1301.3781] Efficient Estimation of Word Representations in Vector Space). Notably, Word2Vec achieved these improvements with much lower computational cost than prior neural network models. ([1301.3781] Efficient Estimation of Word Representations in Vector Space)
-
2014 – Sutskever et al.: "Sequence to Sequence Learning with Neural Networks" (I. Sutskever, O. Vinyals, Q. V. Le – NIPS 2014, arXiv 1409.3215) – This paper introduces the sequence-to-sequence (seq2seq) framework for end-to-end learning of mapping one sequence to another ([1409.3215] Sequence to Sequence Learning with Neural Networks). Using a pair of LSTM networks – an encoder and a decoder – the model encodes an input sentence into a fixed-length vector and then decodes it to an output sentence ([1409.3215] Sequence to Sequence Learning with Neural Networks). Seq2seq achieved then state-of-the-art results in machine translation (English–French) and showed that neural networks can directly learn to encode and generate sequences of arbitrary length. It also found tricks (like reversing source sentences) to improve learning of long sequences ([1409.3215] Sequence to Sequence Learning with Neural Networks). ([1409.3215] Sequence to Sequence Learning with Neural Networks)
-
2015 – Bahdanau et al.: "Neural Machine Translation by Jointly Learning to Align and Translate" (D. Bahdanau, K. Cho, Y. Bengio – ICLR 2015, arXiv 1409.0473) – Bahdanau and colleagues introduce the attention mechanism for neural translation models. They show that encoding a whole sentence into a single vector is a bottleneck ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). Instead, their model learns to “soft-search” for relevant parts of the source sentence during decoding, i.e. it aligns each output word with specific input words (via attention weights) ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). This attention-based model outperformed the previous encoder–decoder and produced qualitatively interpretable alignments, marking a significant improvement in neural machine translation. ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate)
-
2017 – Vaswani et al.: "Attention Is All You Need" (A. Vaswani et al. – NIPS 2017, arXiv 1706.03762) – This paper introduced the Transformer architecture, which relies solely on self-attention mechanisms and completely removes recurrence and convolution ([1706.03762] Attention Is All You Need). The Transformer processes sequences in parallel and learns contextual relationships via multi-head attention. It achieved superior accuracy on translation tasks (e.g. new state-of-the-art BLEU scores on English-German and English-French) while being more parallelizable and faster to train than RNN-based models ([1706.03762] Attention Is All You Need). Transformers soon became the basis for nearly all modern large language models. ([1706.03762] Attention Is All You Need)
-
2018 – Devlin et al.: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (J. Devlin et al. – NAACL 2019, arXiv 1810.04805) – Google’s BERT model pioneered the idea of bidirectional pre-training on massive text corpora. BERT is a deep Transformer encoder trained on unlabeled text with a masked language model objective (predicting missing words) and next-sentence prediction ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). Unlike earlier generative models, BERT’s bidirectional conditioning (seeing context to left and right) produces powerful context-aware word representations. After pre-training, BERT can be fine-tuned with minimal architecture changes to achieve state-of-the-art results on a wide range of NLP tasks (GLUE, QA, inference) ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)
-
2018 – Radford et al.: "Improving Language Understanding by Generative Pre-Training" (A. Radford et al. – OpenAI Tech Report 2018) – OpenAI’s GPT (Generative Pre-Training) model demonstrated that a Transformer decoder pre-trained on large text can excel at downstream tasks via fine-tuning. GPT-1 was uni-directional (autoregressive) and showed strong results on question answering, commonsense reasoning, etc., using unsupervised pre-training plus supervised fine-tuning. This set the stage for scaling up generative LMs. (No official ArXiv; OpenAI report)
-
2019 – Radford et al.: "Language Models are Unsupervised Multitask Learners" (OpenAI GPT-2, Feb 2019) – GPT-2 (1.5 billion parameters) showed that with sufficient scale, a language model can generate remarkably coherent and diverse text. It was released with a focus on zero-shot evaluation: GPT-2 could perform tasks like translation or summarization without explicit training data by prompt engineering. OpenAI initially withheld the full model citing misuse concerns. GPT-2’s success underscored the paradigm of scaling model size and data for better performance. (OpenAI release, no formal paper)
-
2020 – Brown et al.: "Language Models are Few-Shot Learners" (T. Brown et al. – NeurIPS 2020, arXiv 2005.14165) – This paper announced GPT-3, a 175-billion-parameter Transformer that pushed language modeling to a new scale. GPT-3 demonstrated an astonishing ability to perform tasks in a few-shot setting – it can solve NLP tasks from only a few examples or instructions in the prompt, without any fine-tuning ([2005.14165] Language Models are Few-Shot Learners). By scaling up training (GPT-3 is 10× larger than prior non-sparse LMs) and data, the model achieved near state-of-the-art results on many benchmarks through prompt-based learning alone ([2005.14165] Language Models are Few-Shot Learners). GPT-3’s launch highlighted the power of large unsupervised models and kicked off the era of extremely large language models. ([2005.14165] Language Models are Few-Shot Learners)
-
2020 – Guu et al.: "REALM: Retrieval-Augmented Language Model Pre-Training" (K. Guu et al. – ICML 2020, arXiv 2002.08909) – REALM is a retrieval-augmented language model that incorporates an external text corpus into pre-training ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). Instead of storing all world knowledge in model parameters, REALM learns to retrieve relevant documents (e.g. Wikipedia passages) and attend to them during language modeling. Notably, the retriever is trained jointly with the language model (using masked-LM signals and backpropagating through the retrieval step) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). Fine-tuned on open-domain QA, REALM outperformed previous models by 4–16% accuracy, while also offering better interpretability (it can cite sources) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). This was a key early work linking information retrieval with large LM pre-training. ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training)
-
2020 – Lewis et al.: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (P. Lewis et al. – NeurIPS 2020, arXiv 2005.11401) – Lewis and colleagues propose Retrieval-Augmented Generation (RAG), a framework that combines a parametric language model with a non-parametric memory of documents ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). A neural retriever first fetches textual passages relevant to the query from a large corpus (e.g. Wikipedia), and then a generator (a pre-trained seq2seq Transformer) conditions on those retrieved passages to produce the output ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). RAG achieved state-of-the-art results on several open-domain question answering tasks, outperforming models that relied only on fixed parameters ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). It also produced more specific and factually accurate generations by grounding its output in retrieved evidence. ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks)
-
2019/2020 – Khandelwal et al.: "Generalization through Memorization: Nearest Neighbor Language Models" (U. Khandelwal et al. – ICLR 2020, arXiv 1911.00172) – This work introduced the kNN-LM, which augments a pre-trained neural language model with a k-nearest-neighbor retrieval mechanism ([1911.00172] Generalization through Memorization: Nearest Neighbor Language Models). During inference, the model retrieves the most similar past contexts (from the training data or another text corpus) and interpolates their next-word distributions with the base LM’s predictions. This simple plug-in greatly improved language modeling, especially for predicting rare words and facts, essentially by memorizing and recalling specific examples ([1911.00172] Generalization through Memorization: Nearest Neighbor Language Models). kNN-LM showed that an LM could be improved post hoc by adding a retrieval component, without retraining the base model. ([1911.00172] Generalization through Memorization: Nearest Neighbor Language Models) ([1911.00172] Generalization through Memorization: Nearest Neighbor Language Models)
-
2021 – Borgeaud et al.: "Improving Language Models by Retrieving from Trillions of Tokens" (S. Borgeaud et al. – DeepMind, arXiv 2112.04426, 2022) – DeepMind’s RETRO model pushes retrieval-augmentation to a massive scale. RETRO is an autoregressive Transformer that, for each chunk of text it generates, retrieves similar text chunks from a colossal database of 2 trillion tokens and conditions on them ([2112.04426] Improving language models by retrieving from trillions of tokens). With only 7.5 billion parameters, RETRO matched the performance of much larger models like GPT-3 (175B) on the Pile benchmark by leveraging this external knowledge ([2112.04426] Improving language models by retrieving from trillions of tokens). This approach demonstrated that retrieval can act as a force-multiplier, allowing much smaller models to rival the performance of models 25× their size ([2112.04426] Improving language models by retrieving from trillions of tokens). RETRO can also be “retrofit” to existing pre-trained models, offering a way to upgrade LMs with a factual memory without full retraining. ([2112.04426] Improving language models by retrieving from trillions of tokens)
-
2022 – Chowdhery et al.: "PaLM: Scaling Language Modeling with Pathways" (A. Chowdhery et al. – Google, arXiv 2204.02311) – PaLM is a 540-billion parameter Transformer language model, one of the largest of its time. Trained with Google’s Pathways system, PaLM was a study in scaling laws – it achieved breakthrough performance on hundreds of language understanding and generation tasks in a few-shot setting ([2204.02311] PaLM: Scaling Language Modeling with Pathways). PaLM 540B often outperformed fine-tuned state-of-the-art models and even surpassed average human performance on the BIG-bench benchmark, revealing emergent abilities at scale ([2204.02311] PaLM: Scaling Language Modeling with Pathways). This model underscored the continued benefits of scaling up LMs in terms of capability gains, albeit with high computational cost.
-
2023 – OpenAI: "GPT-4 Technical Report" (OpenAI et al. – arXiv 2303.08774, Mar 2023) – The report on GPT-4 details a large-scale, multimodal model (accepting image and text inputs) that exhibits human-level performance on many academic and professional benchmarks ([2303.08774] GPT-4 Technical Report). GPT-4 significantly improves over its predecessor (GPT-3.5) in both capability and alignment, achieving, for example, top percentiles in bar exams and math competitions. While the model is not open-source and specifics of its size/training are undisclosed, GPT-4’s performance marked a new state-of-the-art for LLMs in 2023, demonstrating advanced reasoning, coding, and knowledge integration (with the ability to handle images as well as text) ([2303.08774] GPT-4 Technical Report). (OpenAI report, later on arXiv)
-
2023 – Touvron et al.: "LLaMA: Open and Efficient Foundation Language Models" (H. Touvron et al. – Meta AI, arXiv 2302.13971) – LLaMA is a suite of open-source foundation LLMs (parameter counts from 7B to 65B) trained on trillions of tokens of publicly available data ([2302.13971] LLaMA: Open and Efficient Foundation Language Models). The key finding is that smaller-high-quality models can match or exceed larger proprietary models: for example, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with DeepMind’s Chinchilla-70B and Google’s PaLM-540B ([2302.13971] LLaMA: Open and Efficient Foundation Language Models). By open-sourcing these efficient models, LLaMA enabled the research community to experiment with large LMs without needing industrial-scale resources. ([2302.13971] LLaMA: Open and Efficient Foundation Language Models)
-
2023 – Touvron et al.: "Llama 2: Open Foundation and Fine-Tuned Chat Models" (H. Touvron et al. – Meta AI, arXiv 2307.09288) – Meta’s LLaMA 2 extends the original LLaMA with both improved pre-trained models and chat-optimized versions. Llama 2-Chat is fine-tuned for dialogue and safety, and achieves performance on par with or exceeding other open-source chat models on most benchmarks ([2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models). With up to 70B parameters, Llama 2-Chat is offered as a free, open alternative to closed models, and underwent rigorous safety fine-tuning and human evaluation ([2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models). The release of LLaMA 2 (July 2023) significantly spurred open research and applications in large language models, making advanced chat LLMs broadly accessible under a permissive license. ([2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models)
-
2023 – Gao et al.: "Retrieval-Augmented Generation for Large Language Models: A Survey" (Y. Gao et al. – arXiv 2312.10997) – This comprehensive survey (late 2023) documents the progression of Retrieval-Augmented Generation (RAG) techniques in tandem with LLM development ([2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey) ([2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey). It defines the RAG framework’s components – the retriever, the generator, and the augmentation strategy – and reviews state-of-the-art methods in each category. The survey highlights how RAG addresses key challenges of LLMs (like hallucinations and outdated knowledge) by grounding generation on external data ([2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey), and it outlines open challenges and future directions for integrating retrieval with ever-more-capable large language models. (Survey on arXiv, Dec 2023)