Skip to content

Latest commit

 

History

History
221 lines (131 loc) · 68.8 KB

papers.md

File metadata and controls

221 lines (131 loc) · 68.8 KB

Papers

Artificial Intelligence

Generated by Open AI o1 Pro - Deep Research : 20250214

Prompt

Detail all academic papers relating to LLM, RAG all the way back to the deep learning paper and "attention is all we need". I am looking for a chronological order that I can read. concentrate on Arxiv for example https://arxiv.org/abs/1706.03762 and https://arxiv.org/pdf/1706.03762

Clarify

To provide a comprehensive chronological list of academic papers related to LLMs (Large Language Models) and RAG (Retrieval-Augmented Generation), starting from foundational deep learning papers like "Attention is All You Need," I have a few clarifying questions:

  1. Scope: Are you only interested in papers specifically on ArXiv, or should I include other major academic sources if relevant?
  2. Depth: Should I only list the most influential and foundational papers, or do you want a more exhaustive list with incremental developments?
  3. Chronology: Do you prefer the papers strictly ordered by publication date, or should I group them by themes (e.g., foundational architectures, scaling laws, fine-tuning, retrieval methods)?
  4. Format: Would you like a brief summary of each paper along with the link, or just a chronological list with citation details?
  5. Time Range: Should I go back to all relevant early deep learning papers (e.g., before 2017), or focus on post-transformer developments? Let me know your preferences so I can tailor the research accordingly!

Clarified

concentrate on Arxiv but recommend any other relevant publication sites if there are papers or details there - that are not on arxiv. Secondarily flag/tag the papers by influential and foundational but show the primary sort in incremental developments. Also try to add the companies the authors worked at - for attribution. Keep the chronology on publication date - but add flagging/tags to mark as for example foundational, scaling laws, fine tuning, rag, retrieval, theory A brief summary with the links and citation would be good. For timeline - yes go back as far as possible - even all the way to the multi layer perceptron of the 1950's, I would think results from deep learning research by Hinton, Goodfellow and Bengio wold be good.

Response

I will compile a chronological list of academic papers related to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), starting from foundational deep learning research. The focus will be on ArXiv, but I will also include key papers from other sources if they are not available on ArXiv.

  • Papers will be sorted by publication date to show incremental developments.
  • Each paper will be tagged based on its contribution (e.g., foundational, scaling laws, fine-tuning, retrieval, theory, etc.).
  • The companies/institutions associated with the authors will be noted for attribution.
  • A brief summary along with links and citations will be provided.
  • The timeline will extend as far back as relevant, including early work on multilayer perceptrons and key contributions from researchers such as Hinton, Goodfellow, and Bengio.

I will gather this information and let you know once the research is complete.

Timeline of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG)

1950s–1970s: Early Neural Network Foundations

  • 1958 – Frank Rosenblatt: “The Perceptron: A Probabilistic Model…” (Cornell Aeronautical Lab) – [Tag: foundational]. Introduced the perceptron, an early single-layer neural network capable of binary classification. This work laid the groundwork for machine-learning by showing how a model could learn weights from data (Perceptron - Wikipedia) (Perceptron - Wikipedia). Rosenblatt’s perceptron became a cornerstone for neural network research, illustrating how neurons could be trained to recognize patterns.

  • 1969 – Marvin Minsky & Seymour Papert: Perceptrons (MIT) – [Tag: theory]. A critical analysis of the perceptron’s capabilities and limits. This book proved that single-layer perceptrons cannot solve certain tasks (like XOR), highlighting the need for multi-layer networks (Perceptrons (book) - Wikipedia) (Perceptrons (book) - Wikipedia). Their pessimistic conclusions shifted AI research towards symbolic methods and contributed to an “AI winter” until the multi-layer approach was revisited in the 1980s.

1980s–1990s: Backpropagation and Sequence Learning

2000s: Neural Language Models and Deep Learning Resurgence

  • 2003 – Yoshua Bengio et al.: “A Neural Probabilistic Language Model” (University of Montreal) – [Tag: foundational]. Introduced the first successful neural network language model. This work proposed learning a distributed word embedding for each word and using a feed-forward neural network to predict the next word in a sequence (Understanding Neural Probabilistic Language Model | De Novo). It demonstrated that neural nets could outperform n-gram models by generalizing to unseen word combinations, kickstarting the use of word embeddings in NLP. (Understanding Neural Probabilistic Language Model | De Novo)

  • 2006 – Geoffrey Hinton et al.: “A Fast Learning Algorithm for Deep Belief Nets” (University of Toronto) – [Tag: foundational]. Presented a strategy to train deep neural networks via unsupervised layer-by-layer pre-training of Deep Belief Networks (stacks of Restricted Boltzmann Machines). This greedy algorithm made it feasible to train networks with many layers (A fast learning algorithm for deep belief nets - PubMed). The authors showed that a deep network (after pre-training and fine-tuning) could model complex data distributions (like handwritten digits) and even outperform shallow models on classification tasks (A fast learning algorithm for deep belief nets - PubMed). (A fast learning algorithm for deep belief nets - PubMed)

  • 2008 – Geoffrey Hinton, Ruslan Salakhutdinov: “Reducing the Dimensionality of Data with Neural Networks” (University of Toronto) – [Tag: foundational]. Although not explicitly in the query, this influential paper introduced Autoencoders and showed how a multi-layer neural network could learn efficient codings of data (notably, it preceded the deep learning wave). It contributed to the foundation for unsupervised pre-training, complementing the Deep Belief Net approach. (Source: Hinton’s publications)

(Note: The 2000s also saw the rise of Convolutional Neural Networks for vision (LeCun et al.) and other deep learning advances, but those are outside the direct scope of language models.)

Early 2010s: Word Embeddings and Sequence-to-Sequence Models

  • 2013 – Tomas Mikolov et al.: “Efficient Estimation of Word Representations in Vector Space” (Google) – [Tag: foundational]. Introduced Word2Vec, a pair of novel architectures (Skip-gram and CBOW) to learn continuous vector representations of words from large corpora. The paper showed these word embeddings capture semantic relationships and can be learned efficiently (training on billions of words in hours) ([1301.3781] Efficient Estimation of Word Representations in Vector Space). Word2Vec’s embeddings became a standard tool, enabling systems to represent words in a dense space where similarity reflects meaning. ([1301.3781] Efficient Estimation of Word Representations in Vector Space)

  • 2014 – Ilya Sutskever, Oriol Vinyals, Quoc Le: “Sequence to Sequence Learning with Neural Networks” (Google Brain) – [Tag: foundational]. Demonstrated the first end-to-end sequence-to-sequence (seq2seq) learning for machine translation. They used a two-part LSTM: an encoder to convert a source sentence into a fixed-length vector, and a decoder to generate the target sentence from that vector ([1409.3215] Sequence to Sequence Learning with Neural Networks). On an English→French task, their LSTM achieved a translation quality (BLEU score 34.8) on par with traditional phrase-based systems ([1409.3215] Sequence to Sequence Learning with Neural Networks), proving that purely neural approaches could perform complex transductions. ([1409.3215] Sequence to Sequence Learning with Neural Networks)

  • 2014 – Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: “Neural Machine Translation by Jointly Learning to Align and Translate” (University of Montreal) – [Tag: foundational]. Introduced the attention mechanism in neural networks. This work removed the bottleneck of encoding an entire sentence into one vector by allowing the decoder to “attend” to different parts of the source sequence during translation ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). The model learns soft alignments (weights) indicating which source words are relevant to each generated word. This attention-based NMT achieved state-of-the-art translation results and the attention mechanism became a paradigm-shifting innovation used in virtually all subsequent LLMs. ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate)

  • 2014 – Ian Goodfellow et al.: “Generative Adversarial Networks” (University of Montreal) – [Tag: foundational]. Proposed the GAN framework, a generative model where two neural networks — a Generator and a Discriminator — are trained in a minimax game. The Generator learns to produce realistic data (originally demonstrated on images) while the Discriminator learns to detect fakes ([1406.2661] Generative Adversarial Networks). This adversarial training approach, though focused on images, influenced generative modeling ideas in NLP and beyond (and Goodfellow’s work earned him recognition as one of the “fathers of deep learning”). ([1406.2661] Generative Adversarial Networks)

2017–2019: The Transformer and Large-Scale Pre-training

  • 2017 – Ashish Vaswani et al.: “Attention Is All You Need” (Google Brain/University of Toronto) – [Tag: foundational]. Introduced the Transformer architecture, which relies entirely on self-attention mechanisms and does not use recurrent networks or convolutions. The Transformer achieved superior performance in machine translation, outperforming previous best models by over 2 BLEU points on WMT2014 English→German, with far less training time ([1706.03762] Attention Is All You Need). Its parallelizable architecture and scalability made it the backbone of virtually all modern large language models. ([1706.03762] Attention Is All You Need)

  • 2018 – Jacob Devlin et al.: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Google AI) – [Tag: foundational]. Introduced BERT, a huge leap for NLP. BERT is a bi-directional Transformer pre-trained on massive text via a masked language modeling and next-sentence prediction objective. The result was a single model that could be fine-tuned to achieve state-of-the-art on a wide range of NLP tasks (GLUE, QA, NLI, etc.) ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). BERT’s success validated the pre-train then fine-tune paradigm for language models and led to an explosion of Transformer-based language understanding models. ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)

  • 2018 – Alec Radford et al.: “Improving Language Understanding by Generative Pre-Training” (OpenAI) – [Tag: foundational]. Although not on arXiv, this OpenAI report (GPT-1) demonstrated that a Transformer language model, GPT, pre-trained on unlabeled text in a generative (auto-regressive) manner, could be fine-tuned to outperform task-specific architectures. GPT-1 (117M parameters) showed the power of unsupervised pre-training for downstream NLP tasks (BERT: Pre-training of Deep Bidirectional Transformers for Language ...). (Source: OpenAI Blog)

  • 2019 – Alec Radford et al.: “Language Models are Unsupervised Multitask Learners” (OpenAI) – [Tag: foundational]. This report (GPT-2) scaled up the GPT architecture (to 1.5B parameters) and showed astounding open-ended text generation ability. GPT-2 could generate coherent paragraphs of text and perform rudimentary reading comprehension, translation, and question-answering in a zero-shot fashion. OpenAI initially withheld the full model citing misuse concerns, underscoring both the power and risk of large LMs. (Source: OpenAI Blog)

  • 2019 – Zhilin Yang et al.: “XLNet: Generalized Autoregressive Pretraining for Language Understanding” (Carnegie Mellon & Google) – [Tag: foundational]. Proposed a permutation-based language modeling objective that outperformed BERT on many tasks. XLNet showed that autoregressive models (like GPT) can be enhanced to capture bidirectional context while avoiding BERT’s limitations. This further demonstrated creative ways to pre-train language models for stronger performance. (Summary: BERT- Pre-training of Deep Bidirectional Transformers for ...)

  • 2019 – Colin Raffel et al.: “T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” (Google) – [Tag: foundational]. Introduced the T5 model and posed all NLP tasks in a text-to-text format. T5 (with up to 11B parameters) was pre-trained on a massive corpus and then fine-tuned on various tasks, achieving state-of-the-art results. It highlighted the benefit of scaling model size and data and treating every problem (translation, summarization, etc.) in a unified sequence-to-sequence manner. (Source: T5 paper)

2020: The Birth of Modern LLMs and Retrieval-Augmentation

  • 2020 – Jared Kaplan et al.: “Scaling Laws for Neural Language Models” (OpenAI & JHU) – [Tag: scaling laws]. Empirically measured how model performance improves with scale. This study found that loss follows a power-law decline as model parameters, dataset size, and compute increase, with minimal returns from architecture tweaks ([2001.08361] Scaling Laws for Neural Language Models). Importantly, it showed larger models are more sample-efficient and established guidelines for choosing model size vs. training data for a given compute budget. These scaling laws informed the AI community that simply making models bigger (with more data) yields predictable gains ([2001.08361] Scaling Laws for Neural Language Models).

  • 2020 – Tom B. Brown et al.: “Language Models are Few-Shot Learners” (GPT-3) (OpenAI) – [Tag: scaling]. Introduced GPT-3, a 175-billion parameter Transformer, which demonstrated an impressive ability to perform tasks in a zero-shot or few-shot setting (Language Models are Few-Shot Learners). Without gradient updates (only by prompting), GPT-3 could translate, answer questions, and perform basic reasoning by leveraging prompts with a few examples. GPT-3’s few-shot performance on many NLP benchmarks approached or surpassed state-of-the-art, proving that massive scale alone can induce emergent capabilities (Language Models are Few-Shot Learners). (Language Models are Few-Shot Learners)

  • 2020 – Kelvin Guu et al.: “REALM: Retrieval-Augmented Language Model Pre-Training” (Google Research) – [Tag: retrieval]. Proposed augmenting language models with a differentiable retrieval mechanism. REALM pre-trains a Transformer LM that can consult an external text corpus (Wikipedia) to fill in masked tokens ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). By jointly learning to retrieve and predict, REALM attained strong open-domain QA results, outperforming models that rely purely on parametric memory ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training). This was a precursor to the RAG concept, showing that retrieval can make LMs more factual and up-to-date. ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training) ([2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training)

  • 2020 – Vladimir Karpukhin et al.: “Dense Passage Retrieval (DPR) for Open-Domain Question Answering” (Facebook AI) – [Tag: retrieval]. Introduced DPR, a neural retrieval method using bi-encoders to embed questions and passages in the same vector space. DPR dramatically improved the recall of relevant documents for question answering, outperforming traditional BM25 by 9–19% in top-20 retrieval accuracy ([2004.04906] Dense Passage Retrieval for Open-Domain Question Answering). By providing better passages to reading comprehension models, DPR boosted end-to-end QA performance and became a standard tool for knowledge-augmented NLP tasks. ([2004.04906] Dense Passage Retrieval for Open-Domain Question Answering)

  • 2020 – Patrick Lewis et al.: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Facebook AI/UCL) – [Tag: retrieval]. Coined the term RAG (Retrieval-Augmented Generation). This work combined a parametric memory (a pre-trained seq2seq model) with a non-parametric memory (a Wikipedia index accessed via DPR) ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). At query time, the model retrieves text passages and conditions its generation on them. RAG achieved state-of-the-art on open-domain QA tasks, outperforming models that either use internal parametric knowledge or a retrieve-then-extract pipeline ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks). It also produced more factual and specific generation, validating the power of retrieval+LM synergy. ([2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks)

  • 2020 – OpenAI: “GPT-3 and Code”Alongside language tasks, GPT-3’s variants were tested in code generation. This year saw early glimpses of large LMs writing code and reasoning with structured data (a forerunner to OpenAI’s Codex in 2021). It highlighted large LMs’ versatility beyond natural language.

  • 2020 – Various Authors: “Vision-Language Models”Although focusing on text, it’s worth noting 2020 also saw models like CLIP (OpenAI) bridging vision and language, and T5-based models for multimodal tasks, foreshadowing the multimodal abilities of later LLMs like GPT-4.

2021–2022: Refinement – Retrieval, Efficiency, and Alignment

  • 2021 – Sebastian Borgeaud et al.: “Improving Language Models by Retrieving from Trillions of Tokens” (RETRO) (DeepMind) – [Tag: retrieval]. Introduced RETRO, a 7.5B parameter Transformer that at each generation step retrieves relevant text chunks from a colossal corpus (2 trillion tokens) based on the current context. RETRO showed that a relatively small model with retrieval can match or exceed the performance of models 25× larger (it rivaled GPT-3 175B on the Pile benchmark) ([2112.04426] Improving language models by retrieving from trillions of tokens). After fine-tuning, RETRO also excelled at knowledge-intensive tasks. This work suggested that explicit memory via retrieval can multiply a model’s effective knowledge without massive parameter counts. ([2112.04426] Improving language models by retrieving from trillions of tokens)

  • 2021 – OpenAI: “Codex: GPT-3 for Code”OpenAI fine-tuned GPT-3 on billions of lines of source code to create Codex, capable of generating code from natural language descriptions. Released via the GitHub Copilot partnership, Codex demonstrated the adaptability of LLMs to programming, solving competitive programming problems in few-shot settings. It foreshadowed the later specialization of LLMs in domains like coding.

  • 2022 – Long Ouyang et al.: “Training Language Models to Follow Instructions with Human Feedback” (OpenAI) – [Tag: fine-tuning]. Described OpenAI’s InstructGPT models, which align LLMs with human intentions using Reinforcement Learning from Human Feedback (RLHF). They fine-tuned GPT-3 using human-written demonstration and preference data, and showed that a 1.3B-parameter InstructGPT could outperform the 175B GPT-3 on user prompts ([2203.02155] Training language models to follow instructions with human feedback). InstructGPT produced responses that were more helpful, truthful, and less toxic, demonstrating a practical method to make LLMs safer and more aligned with user needs. ([2203.02155] Training language models to follow instructions with human feedback)

  • 2022 – Hoffmann et al.: “Training Compute-Optimal Large Language Models” (Chinchilla) (DeepMind) – [Tag: scaling laws]. Revisited scaling laws and discovered that many existing large models were under-trained on data. This work argued for a > model size vs. data trade-off: for a given compute budget, one should use a smaller model and train on more tokens. They validated this by training Chinchilla (70B params on 1.4T tokens), which outperformed Gopher (280B) and GPT-3 (175B) despite fewer parameters ([2203.15556] Training Compute-Optimal Large Language Models). This “Chinchilla Law” refined our understanding of scaling: optimal performance comes from balancing model size and dataset size, not just scaling parameters alone. ([2203.15556] Training Compute-Optimal Large Language Models)

  • 2022 – Google Brain: “PaLM: Scaling Language Models with Pathways” (Google) – [Tag: scaling]. Introduced PaLM, a 540-billion parameter Transformer, one of the largest at that time. PaLM achieved state-of-the-art results on numerous NLP benchmarks and demonstrated intriguing emergent behaviors (such as step-by-step reasoning when prompted with chain-of-thought). PaLM’s creation under the Pathways system (which allowed efficient parallelism) showcased the engineering feats needed to train models of this scale. (Source: PaLM paper and Google AI blog)

  • 2022 – BigScience Collaboration: “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model”[Tag: foundational]. Released BLOOM, a 176B parameter Transformer model trained on 46 natural and 13 programming languages (BLOOM: A 176B-Parameter Open-Access Multilingual Language Model). Built by an international team of hundreds of researchers, BLOOM was the first truly open LLM of its size, with its weights freely available. It was trained on the French government’s supercomputer over ~3.5 months. BLOOM’s development exemplified a community-driven effort to democratize LLM research, providing an open alternative to proprietary models (BLOOM: A 176B-Parameter Open-Access Multilingual Language Model).

  • 2022 – Meta AI: “OPT: Open Pre-trained Transformer”Meta released OPT-175B, an open-source reproduction of a GPT-3 class model, to academic researchers. While not state-of-the-art, OPT provided transparency into training a large model and further signaled a shift toward openness in LLM development. (Source: Meta AI release)

  • 2022 – Yuntao Sun et al.: “GLM-130B” (Tsinghua University & Beijing Academy) – Another 100B+ scale open model (130B parameters) supporting both English and Chinese, showing the global efforts in building large LMs. It achieved strong performance and was made available for research, continuing the trend of open-access LLMs. (Source: GLM-130B arXiv)

2023: Frontier Models and Multimodality

  • 2023 – Hugo Touvron et al.: “LLaMA: Open and Efficient Foundation Language Models” (Meta AI) – [Tag: foundational]. Announced LLaMA, a family of foundation models (7B, 13B, 33B, 65B parameters) trained on only public datasets totaling 1.4 trillion tokens. The key result: LLaMA-13B outperformed GPT-3 (175B) on most benchmarks, and LLaMA-65B was on par with state-of-the-art models like Chinchilla-70B and PaLM-540B ([2302.13971] LLaMA: Open and Efficient Foundation Language Models). By releasing LLaMA to researchers, Meta enabled a wave of innovation (indeed, the weights leaked publicly, spurring countless fine-tuned variants). LLaMA demonstrated that carefully trained mid-sized models can rival much larger ones, emphasizing efficiency and access. ([2302.13971] LLaMA: Open and Efficient Foundation Language Models)

  • 2023 – OpenAI: “GPT-4 Technical Report” (OpenAI) – [Tag: scaling]. Introduced GPT-4, a large-scale multimodal model accepting image and text inputs and producing text outputs. GPT-4 demonstrated human-level performance on many professional and academic benchmarks – for example, it scored in the top 10% of test-takers on a simulated bar exam ([2303.08774] GPT-4 Technical Report). It is a Transformer-based model, and OpenAI applied an extensive post-training alignment process (RLHF) to make its behavior more factual and aligned. While full details (like parameter count) weren’t disclosed, GPT-4’s capabilities (such as solving complex problems and understanding images) significantly advanced the state of the art in LLM performance and safety ([2303.08774] GPT-4 Technical Report).

  • 2023 – Google: “Bard and PaLM 2” (Google) – Google introduced PaLM 2 (an updated 540B+ parameter model with improved training and multilinguality) and used it to power Bard, Google’s answer to ChatGPT. PaLM 2 demonstrated enhanced coding skills and reasoning, reflecting refinements in training data and techniques. This marked Google’s deployment of LLMs in consumer-facing products (Google Workspace, Search augmentation, etc.), highlighting real-world impact. (Source: Google I/O 2023 announcements)

  • 2023 – Anthropic: “Claude (v1 and v2)”Anthropic, founded by ex-OpenAI researchers, developed Claude, an AI assistant based on a 100B+ parameter model trained with a technique called “Constitutional AI” (an approach to alignment without direct human feedback). Claude showed capable performance and fewer harmful outputs, indicating alternative pathways to aligning LLMs. Anthropic’s work suggests that careful prompt-based fine-tuning using AI feedback and principles can yield helpful and harmless models. (Source: Anthropic blog)

  • 2023 – Meta AI: “LLaMA 2”An improved version of LLaMA released openly with a favorable license. LLaMA 2 (7B, 13B, 70B) came fine-tuned for chat (through supervised and human-feedback training) and matched the performance of other leading chatbots on many benchmarks. Meta’s open release of LLaMA 2 (including a commercialization license) further pushed the ecosystem toward transparency and wide availability of LLM technology. (Source: LLaMA 2 release paper)

  • 2023 – Emergent Tools and TechniquesThe community explored Retrieval-Augmented Generation as a service (e.g., tools like LangChain enabling any LLM to use external knowledge bases), Tool-use by LLMs (models calling APIs, code interpreters, calculators), and advanced prompt techniques (e.g. Chain-of-Thought prompting (Language Models are Few-Shot Learners), Self-consistency, etc.). Researchers also began to study LLM theory (identifying emergent abilities and understanding transformers via mechanistic interpretability) as well as address LLM limitations like hallucinations, leading to a rich field of ongoing research.


Each of the above papers marks a step in the evolution from early neural networks to today’s large-scale, knowledge-equipped language models. This chronology highlights how foundational concepts (like backpropagation and attention), scaling laws, architecture advances (Transformers), massive computing, and retrieval/fine-tuning strategies have all contributed to the powerful LLMs we have now. Researchers from academia and industry (Cornell, MIT, Toronto, Montreal, Google, OpenAI, DeepMind, Meta, Hugging Face, etc.) have all played key roles in this history, which continues to unfold as we push the frontiers of language understanding.

Sources:

In order

I will generate a report listing academic papers related to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) in chronological order, providing:

  • A short summary (concise abstract) for each paper
  • Authors and date for citation purposes
  • Paper links (preferably from ArXiv when available)

This report will cover foundational deep learning research from early neural networks to the latest developments in LLMs and retrieval techniques.

I will notify you when the report is ready.

Chronological Timeline of LLMs and RAG Research

1950s: Early Foundations in AI and Learning

1980s: Neural Network Resurgence

  • 1986 – Rumelhart, Hinton & Williams: "Learning Representations by Back-propagating Errors" (D. E. Rumelhart, G. E. Hinton, R. J. Williams – Nature, Oct 1986) – This landmark paper introduced the backpropagation algorithm for training multi-layer neural networks (Learning representations by back-propagating errors - NASA ADS). By “back-propagating” error gradients through hidden layers, the network iteratively adjusts its weights, enabling it to learn complex internal representations (Learning representations by back-propagating errors - NASA ADS). Backpropagation overcame limitations of single-layer perceptrons and sparked a renewal of interest in deep neural networks after the long AI winter. (No ArXiv – original journal)

1990s: Sequence Learning and Memory

  • 1990 – Jeffrey Elman: "Finding Structure in Time" (J. L. Elman – Cognitive Science, 1990) – Elman demonstrates recurrent neural networks (RNNs) for language processing. He shows that a simple recurrent network can learn temporal structures in sequences, effectively discovering grammatical structure over time. This work illustrated how neural networks could maintain an internal state (context) to process sequences, a precursor to later language models. (No ArXiv – original journal)

  • 1997 – Hochreiter & Schmidhuber: "Long Short-Term Memory" (S. Hochreiter, J. Schmidhuber – Neural Computation, 1997) – This paper introduces the Long Short-Term Memory (LSTM) network, which addresses the difficulty of learning long-range dependencies in sequences (Long short-term memory - (Intro to Business Analytics) - Fiveable). LSTM units incorporate gating mechanisms (input, output, forget gates) that regulate information flow, enabling the network to retain relevant information over long time steps and mitigate the vanishing gradient problem in standard RNNs (Long short-term memory - (Intro to Business Analytics) - Fiveable). LSTMs became a foundational architecture for sequence tasks like speech and language modeling. (No ArXiv – original journal)

2000s: Early Neural Language Models and Embeddings

  • 2003 – Bengio et al.: "A Neural Probabilistic Language Model" (Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin – J. Machine Learning Research, 2003) – Bengio and colleagues introduce one of the first neural network-based language models (bengio03a.dvi). Their model learns a distributed representation for words (i.e. continuous word embeddings) along with a probability function for word sequences (bengio03a.dvi). By mapping each word to a vector in a latent space, the model generalizes to unseen word sequences – if a new sentence contains words with similar embeddings to a known sentence, it assigns it a higher probability (bengio03a.dvi) (bengio03a.dvi). This neural language model significantly outperformed traditional n-gram models and allowed using longer context for prediction (bengio03a.dvi). (No ArXiv – JMLR open-access)

  • 2008 – Harris et al.: "Semantic Vector Space Models"(For completeness, around late 2000s, vector space models for semantics gained popularity. However, the major breakthrough in this line came in 2013 with Mikolov et al.’s Word2Vec.)

  • 2013 – Mikolov et al.: "Efficient Estimation of Word Representations in Vector Space" (T. Mikolov, K. Chen, G. Corrado, J. Dean – arXiv preprint 2013) – This work (Word2Vec) introduces two simple neural architectures (Skip-gram and CBOW) to learn continuous vector representations of words from very large datasets ([1301.3781] Efficient Estimation of Word Representations in Vector Space). The authors demonstrated that their method can produce high-quality word embeddings in less than a day on 1.6B words, and that these embeddings capture syntactic and semantic word relationships, yielding state-of-the-art performance on word similarity tasks ([1301.3781] Efficient Estimation of Word Representations in Vector Space). Notably, Word2Vec achieved these improvements with much lower computational cost than prior neural network models. ([1301.3781] Efficient Estimation of Word Representations in Vector Space)

2010s: Sequence-to-Sequence Models and Transformers

  • 2014 – Sutskever et al.: "Sequence to Sequence Learning with Neural Networks" (I. Sutskever, O. Vinyals, Q. V. Le – NIPS 2014, arXiv 1409.3215) – This paper introduces the sequence-to-sequence (seq2seq) framework for end-to-end learning of mapping one sequence to another ([1409.3215] Sequence to Sequence Learning with Neural Networks). Using a pair of LSTM networks – an encoder and a decoder – the model encodes an input sentence into a fixed-length vector and then decodes it to an output sentence ([1409.3215] Sequence to Sequence Learning with Neural Networks). Seq2seq achieved then state-of-the-art results in machine translation (English–French) and showed that neural networks can directly learn to encode and generate sequences of arbitrary length. It also found tricks (like reversing source sentences) to improve learning of long sequences ([1409.3215] Sequence to Sequence Learning with Neural Networks). ([1409.3215] Sequence to Sequence Learning with Neural Networks)

  • 2015 – Bahdanau et al.: "Neural Machine Translation by Jointly Learning to Align and Translate" (D. Bahdanau, K. Cho, Y. Bengio – ICLR 2015, arXiv 1409.0473) – Bahdanau and colleagues introduce the attention mechanism for neural translation models. They show that encoding a whole sentence into a single vector is a bottleneck ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). Instead, their model learns to “soft-search” for relevant parts of the source sentence during decoding, i.e. it aligns each output word with specific input words (via attention weights) ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate). This attention-based model outperformed the previous encoder–decoder and produced qualitatively interpretable alignments, marking a significant improvement in neural machine translation. ([1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate)

  • 2017 – Vaswani et al.: "Attention Is All You Need" (A. Vaswani et al. – NIPS 2017, arXiv 1706.03762) – This paper introduced the Transformer architecture, which relies solely on self-attention mechanisms and completely removes recurrence and convolution ([1706.03762] Attention Is All You Need). The Transformer processes sequences in parallel and learns contextual relationships via multi-head attention. It achieved superior accuracy on translation tasks (e.g. new state-of-the-art BLEU scores on English-German and English-French) while being more parallelizable and faster to train than RNN-based models ([1706.03762] Attention Is All You Need). Transformers soon became the basis for nearly all modern large language models. ([1706.03762] Attention Is All You Need)

  • 2018 – Devlin et al.: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (J. Devlin et al. – NAACL 2019, arXiv 1810.04805) – Google’s BERT model pioneered the idea of bidirectional pre-training on massive text corpora. BERT is a deep Transformer encoder trained on unlabeled text with a masked language model objective (predicting missing words) and next-sentence prediction ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). Unlike earlier generative models, BERT’s bidirectional conditioning (seeing context to left and right) produces powerful context-aware word representations. After pre-training, BERT can be fine-tuned with minimal architecture changes to achieve state-of-the-art results on a wide range of NLP tasks (GLUE, QA, inference) ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding). ([1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)

  • 2018 – Radford et al.: "Improving Language Understanding by Generative Pre-Training" (A. Radford et al. – OpenAI Tech Report 2018) – OpenAI’s GPT (Generative Pre-Training) model demonstrated that a Transformer decoder pre-trained on large text can excel at downstream tasks via fine-tuning. GPT-1 was uni-directional (autoregressive) and showed strong results on question answering, commonsense reasoning, etc., using unsupervised pre-training plus supervised fine-tuning. This set the stage for scaling up generative LMs. (No official ArXiv; OpenAI report)

  • 2019 – Radford et al.: "Language Models are Unsupervised Multitask Learners" (OpenAI GPT-2, Feb 2019) – GPT-2 (1.5 billion parameters) showed that with sufficient scale, a language model can generate remarkably coherent and diverse text. It was released with a focus on zero-shot evaluation: GPT-2 could perform tasks like translation or summarization without explicit training data by prompt engineering. OpenAI initially withheld the full model citing misuse concerns. GPT-2’s success underscored the paradigm of scaling model size and data for better performance. (OpenAI release, no formal paper)

2020s: Large Language Models & Retrieval-Augmented Generation

Genetic Algorithms | Evolutionary Computation