Chatbots For Social Change/Theory of LLMs

Resources
Introductory videos
 * What are LLMs by Google - A simple 5 minute introduction to what LLMs do.
 * What are LLMs by Apple (WWDC) - A great introduction to embeddings, tagging, and LLMs, in the first 6m. Then moves to the main topic, multilingual models.
 * Build an LLM from scratch - A good video introduction, paired with a blog, conceptualizing the process of going from zero to full-scale LLM ($1M later!).
 * Microsoft's infrastructure for training chatGPT, and similar incredibly high-throughput applications.
 * Uses DeepSpeed to optimize the training
 * InfiniBand for incredible network throughput
 * ONNX to move networks around (also see CRIU for an idea of how checkpointing works])
 * NVIDIA's tensor cores instead of GPUs for compute

Lectures
 * Introduction to Neural Networks - Stanford CS224N. A nice mathematical overview of Neural Networks.
 * Scaling Language Models - Stanford CS224N Guest Lecture. A general outlook of the rise in abilities of LLMs.
 * Building Knowledge Representation - Stanford CS224N Guest Lecture. Very useful to understand our methods for vector retrieval, but from a more general perspective.
 * Dot product has fast nearest-neighbor search algorithms (sub-linear).
 * Re-ranking is often necessary, because the dot-product is not necessarily so expressive.
 * Not all sets of vectors are easily indexed, "pathological," and to improve performance it can be beneficial to "spread them out," e.g.
 * Socially Intelligent NLP Systems - Nice! A deep-dive on how society impinges on language, and how that buggers up our models.
 * LangChain vs. Assistants API - a nice overview of two interfaces for deeper chatbot computation.
 * Emerging architectures for LLM applications, a look from the enterprise side, of architectures and their use. Covers RAG with vector search, Assistants, and general workflow of refining LLM models.
 * GPT from scratch - fantastic introduction to chatGPT, understanding exactly how it works (Torch), by Andrej Karpathy.
 * tokenization in GPT, a great intro by Andrej Karpathy, who has tons of other good lectures in this realm

Courses
 * NYU Deep learning - A fully-fledged online course, including many advanced topics, including Attention and the Transformer, Graph Convolutional Networks, and Deep Learning for Structured Prediction, for instance.
 * The textbook Understanding Deep Learning has accompanying exercises, and lecture slides

Textbooks
 * A Hacker's Guide to Language Models, by Jeremy Howard
 * Practical Deep Learning for Coders
 * Understanding Deep Learning, with this nice introduction talk
 * Dahl, D. A. (2023). Natural language understanding with Python: Combine natural language technology, deep learning, and large language models to create human-like language comprehension in computer systems. Packt Publishing.
 * Sinan Ozdemir. (2023). Quick Start Guide to Large Language Models: Strategies and Best Practices for Using ChatGPT and Other LLMs. Addison-Wesley Professional.
 * Zhao, W. X. et al. (2023). A Survey of Large Language Models (arXiv:2303.18223). arXiv.
 * Not exactly a textbook, but at 122 pages of dense referenced material it packs a punch, and should not be considered a resource to be consumed in one sitting.

Reranking
Reranking in the context of information retrieval is a two-step process used to enhance the relevance of search results. Here’s how it typically works:


 * First-Stage Retrieval: In the initial phase, a broad set of documents is retrieved using a fast and efficient method. This is often done using embedding-based retrieval, where documents and queries are represented as vectors in a multi-dimensional space. The aim here is to cast a wide net and retrieve a large candidate set of documents quickly and with relatively low computational cost.
 * Second-Stage Reranking: The documents retrieved in the first stage are then re-evaluated in the second stage to improve the precision of the search results. This stage involves a more computationally intensive algorithm, often powered by a Language Model (like an LLM), which takes the context of the search query more thoroughly into account. This step reorders (reranks) the results from the first stage, promoting more relevant documents to higher positions and demoting less relevant ones.

The reranking step is a trade-off between the relevance of the search results and the computational resources required. By using it as a second stage, systems aim to balance the speed and efficiency of embedding-based retrieval with the depth and relevance of LLM-powered retrieval. This combined approach can yield a set of results that are both relevant and produced within an acceptable timeframe and cost.

Contradiction Detection
The Stanford Natural Language Processing Group has worked on detecting contradictions in text and has created contradiction datasets for this purpose. They have annotated the PASCAL RTE datasets for contradiction, marked for a 3-way decision in terms of entailment: "YES" (entails), "NO" (contradicts), and "UNKNOWN" (doesn't entail but is not a contradiction). Additionally, they have created a corpus where contradictions arise from negation by adding negative markers to the RTE2 test data and have gathered a collection of contradictions appearing "in the wild"​.



Introduction to Large Language Models (LLMs)
Let's dive into the world of Large Language Models (LLMs). These are advanced computer programs designed to understand, use, and generate human language. Imagine them as vast libraries filled with an enormous range of books, covering every topic you can think of. Just like a librarian who knows where to find every piece of information in these books, LLMs can navigate through this vast knowledge to provide us with insights, answers, and even generate new content.

How do they achieve this? LLMs are built upon complex algorithms and mathematical models. They learn from vast amounts of text – from novels and news articles to scientific papers and social media posts. This learning process involves recognizing patterns in language: how words and sentences are structured, how ideas are connected, and how different expressions can convey the same meaning.

Each LLM has millions, sometimes billions, of parameters – these are the knobs and dials of the model. Each parameter plays a part in understanding a tiny aspect of language, like the tone of a sentence, the meaning of a word, or the structure of a paragraph. When you interact with an LLM, it uses these parameters to decode your request and generate a response that is accurate and relevant.

One of the most fascinating aspects of LLMs is their versatility. They can write in different styles, from formal reports to casual conversations. They can answer factual questions, create imaginative stories, or even write code. This adaptability makes them incredibly useful across various fields and applications.

LLMs are a breakthrough in the way we interact with machines. They bring a level of understanding and responsiveness that was previously unattainable, making our interactions with computers more natural and intuitive. As they continue to evolve, they're not just transforming how we use technology, but also expanding the boundaries of what it can achieve.

In this chapter, we'll explore the world of Large Language Models (LLMs) in depth. Starting with their basic definitions and concepts, we'll trace their historical development to understand how they've evolved into today's advanced models. We'll delve into the key components that make LLMs function, including neural network architectures, their training processes, and the complexities of language modeling and prediction. Finally, we'll examine the fundamental applications of LLMs, such as natural language understanding and generation, covering areas like conversational agents, sentiment analysis, content creation, and language translation. This chapter aims to provide a clear and comprehensive understanding of LLMs, showcasing their capabilities and the transformative impact they have in various sectors.

Definition and Basic Concepts
Foundations of Neural Networks

To truly grasp the concept of Large Language Models (LLMs), we must first understand neural networks, the core technology behind them. Neural networks are a subset of machine learning inspired by the human brain. They consist of layers of nodes, or 'neurons,' each capable of performing simple calculations. When these neurons are connected and layered, they can process complex data. In the context of LLMs, these networks analyze and process language data. The Structure of Neural Networks in LLMs


 * Input Layer: This is where the model receives text data. Each word or character is represented numerically, often as a vector, which is a series of numbers that capture the essence of the word.
 * Hidden Layers: These are where the bulk of processing happens. In LLMs, hidden layers are often very complex, allowing the model to identify intricate patterns in language. The more layers (or 'depth') a model has, the more nuanced its understanding of language can be.
 * Output Layer: This layer produces the final output, which could be a prediction of the next word in a sentence, the classification of text into categories, or other language tasks.

Training Large Language Models

Training an LLM involves feeding it a vast amount of text data. During this process, the model makes predictions about the text (like guessing the next word in a sentence). It then compares its predictions against the actual text, adjusting its parameters (the weights and biases of the neurons) to improve accuracy. This process is repeated countless times, enabling the model to learn from its mistakes and improve its language understanding. Parameters: The Building Blocks of LLMs

Parameters in a neural network are the aspects that the model adjusts during training. In LLMs, these parameters are numerous, often in the hundreds of millions or more. They allow the model to capture and remember the nuances of language, from basic grammar to complex stylistic elements. From Data to Language Understanding

Through training, LLMs develop an ability to understand context, grammar, and semantics. This isn't just word recognition, but an understanding of how language is structured and used in different situations. They can detect subtleties like sarcasm, humor, and emotion, which are challenging even for human beings. Generating Language with LLMs

Once trained, LLMs can generate text. They do this by predicting what comes next in a given piece of text. This capability is not just a parroting back of learned data, but an intelligent synthesis of language patterns that the model has internalized.

By understanding these fundamental concepts, we begin to see LLMs not just as tools or programs, but as advanced systems that mimic some of the most complex aspects of human intelligence. This section sets the stage for a deeper exploration into their historical development, key components, and the transformative applications they enable.

Historical Development of LLMs
The journey of Large Language Models began with rule-based systems in the early days of computational linguistics. These early models, dating back to the 1950s and 60s, were based on sets of handcrafted rules for syntax and grammar. The advent of statistical models in the late 1980s and 1990s marked a significant shift. These models used probabilities to predict word sequences, laying the groundwork for modern language modeling.

The 2000s witnessed a transition from statistical models to machine learning-based approaches. This era introduced neural networks in language modeling, but these early networks were relatively simple, often limited to specific tasks like part-of-speech tagging or named entity recognition. The focus was primarily on improving specific aspects of language processing rather than developing comprehensive language understanding.

The introduction of deep learning and word embeddings in the early 2010s revolutionized NLP. Models like Word2Vec provided a way to represent words in vector space, capturing semantic relationships between words. This period also saw the development of more complex neural network architectures, such as Long Short-Term Memory (LSTM) networks, which were better at handling the sequential nature of language.

The introduction of the Transformer model in 2017 was a watershed moment. The Transformer, first introduced in a paper titled "Attention Is All You Need," abandoned recurrent layers in favor of attention mechanisms. This allowed for more parallel processing and significantly improved the efficiency and effectiveness of language models. Rise of Large-Scale Language Models

Following the Transformer's success, there was a rapid escalation in the scale of language models. Notable models include OpenAI's GPT series, Google's BERT, and others like XLNet and T5. These models, with their vast number of parameters (into the billions), demonstrated unprecedented language understanding and generation capabilities. They were trained on diverse and extensive datasets, enabling them to perform a wide range of language tasks with high proficiency. Recent Developments: Increasing Abilities and Scale

The most recent phase in the development of LLMs is marked by further increases in model size and capabilities. Models like GPT-3 and its successors have pushed the boundaries in terms of the number of parameters and the depth of language understanding. These models exhibit remarkable abilities in generating coherent and contextually relevant text, answering complex questions, translating languages, and even creating content that is indistinguishable from human-written text.

Architecture
Large Language Models (LLMs), such as those based on the Transformer architecture, represent a significant advancement in the field of natural language processing. The Transformer model, introduced in the paper "Attention Is All You Need", has become the backbone of most modern LLMs.

The architecture of a Transformer-based LLM is complex, consisting of several layers and components that work together to process and generate language. The key elements of this architecture include:


 * Input Embedding Layer: This layer converts input text into numerical vectors. Each word or token in the input text is represented as a vector in a high-dimensional space. This process is crucial for the model to process language data.


 * Positional Encoding: In addition to word embeddings, Transformer models add positional encodings to the input embeddings to capture the order of the words in a sentence. This is important because the model itself does not process words sequentially as in previous architectures like RNNs (Recurrent Neural Networks).
 * Encoder and Decoder Layers: The Transformer model has an encoder-decoder structure. The encoder processes the input text, and the decoder generates the output text. Each encoder and decoder consists of multiple layers.
 * Each layer in the encoder includes two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network.
 * Each layer in the decoder also has two sub-layers but includes an additional third sub-layer for attention over the encoder's output.
 * Self-Attention Mechanism: This mechanism allows the model to weigh the importance of different words in the input sentence. It enables the model to capture contextual information from the entire sentence, which is a key feature of the Transformer model.
 * Multi-Head Attention: This component splits the attention mechanism into multiple heads, allowing the model to simultaneously attend to information from different representation subspaces at different positions.
 * Feed-Forward Neural Networks: These networks are applied to each position separately and identically. They consist of fully connected layers with activation functions.
 * Normalization and Dropout Layers: These layers are used in between the other components of the Transformer architecture to stabilize and regularize the training process.
 * Output Layer: The final decoder output is transformed into a predicted word or token, often using a softmax layer to generate a probability distribution over possible outputs.

The Transformer architecture is highly parallelizable, making it more efficient to train on large datasets compared to older architectures like RNNs or LSTMs. This efficiency is one of the reasons why Transformer-based models can be scaled to have a large number of parameters and process extensive language data.

For more detailed information on the Transformer architecture, see the WikiPedia Transformer page.

Chain-of-Thought Prompting
Graph of Thoughts paper, which has a nice exposition of chain-of-thought, and tree-of-thought prompting.

Advanced Applications of LLMs
Hadi et al.'s (2023) 44-page survey offers a solid and recent resource here.