LLM Collection
This section consists of a collection and summary of notable and foundational LLMs.
Models
| Model | Release Date | Size (B) | Checkpoints | Description | 
|---|---|---|---|---|
| Falcon LLM (opens in a new tab) | Sep 2023 | 7, 40, 180 | Falcon-7B (opens in a new tab), Falcon-40B (opens in a new tab), Falcon-180B (opens in a new tab) | Falcon LLM is a foundational large language model (LLM) with 180 billion parameters trained on 3500 Billion tokens. TII has now released Falcon LLM – a 180B model. | 
| Mistral-7B-v0.1 (opens in a new tab) | Sep 2023 | 7 | Mistral-7B-v0.1 (opens in a new tab) | Mistral-7B-v0.1 is a pretrained generative text model with 7 billion parameters. The model is based on a transformer architecture with features like Grouped-Query Attention, Byte-fallback BPE tokenizer and Sliding-Window Attention. | 
| CodeLlama (opens in a new tab) | Aug 2023 | 7, 13, 34 | CodeLlama-7B (opens in a new tab), CodeLlama-13B (opens in a new tab), CodeLlama-34B (opens in a new tab) | The Code Llama family is designed for general code synthesis and understanding. It is specifically tuned for instruction following and safer deployment. The models are auto-regressive and use an optimized transformer architecture. They are intended for commercial and research use in English and relevant programming languages. | 
| Llama-2 (opens in a new tab) | Jul 2023 | 7, 13, 70 | Llama-2-7B (opens in a new tab), Llama-2-13B (opens in a new tab), Llama-2-70B (opens in a new tab) | LLaMA-2, developed by Meta AI, was released in July 2023 with models of 7, 13, and 70 billion parameters. It maintains a similar architecture to LLaMA-1 but uses 40% more training data. LLaMA-2 includes foundational models and dialog-fine-tuned models, known as LLaMA-2 Chat, and is available for many commercial uses, with some restrictions. | 
| XGen-7B-8K (opens in a new tab) | Jul 2023 | 7 | XGen-7B-8K (opens in a new tab) | The XGen-7B-8K, developed by Salesforce AI Research, is a 7B parameter language model. | 
| Claude-2 (opens in a new tab) | Jul 2023 | 130 | - | Claude 2 is a foundational LLM built by Anthropic, designed to be safer and more "steerable" than its previous version. It is conversational and can be used for a variety of tasks like customer support, Q&A, and more. It can process large amounts of text and is well-suited for applications that require handling extensive data, such as documents, emails, FAQs, and chat transcripts. | 
| Tulu (opens in a new tab) | Jun 2023 | 7, 13, 30, 65 | Tulu-7B (opens in a new tab), Tulu-13B (opens in a new tab) Tulu-30B (opens in a new tab), Tulu-65B (opens in a new tab) | Tulu is a family of models developed by Allen Institute for AI. The models are LLaMa models that have been fine-tuned on a mixture of instruction datasets, including FLAN V2, CoT, Dolly, Open Assistant 1, GPT4-Alpaca, Code-Alpaca, and ShareGPT. They are designed to follow complex instructions across various NLP tasks | 
| ChatGLM2-6B (opens in a new tab) | Jun 2023 | 6 | ChatGLM2-6B (opens in a new tab) | ChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It has improved performance, longer context capabilities, more efficient inference, and an open license for academic and commercial use. The model uses a hybrid objective function and has been trained with 1.4T bilingual tokens. It shows substantial improvements in performance on various datasets compared to its first-generation counterpart. | 
| Nous-Hermes-13B (opens in a new tab) | Jun 2023 | 13 | Nous-Hermes-13B (opens in a new tab) | Nous-Hermes-13B is a language model fine-tuned by Nous Research on over 300,000 instructions. | 
| Baize-v2 (opens in a new tab) | May 2023 | 7, 13 | Baize-v2-13B (opens in a new tab) | Baize-v2 is an open-source chat model developed by UCSD and Sun Yat-Sen University, fine-tuned with LoRA, and trained with supervised fine-tuning (SFT) and self-distillation with feedback (SDF). | 
| RWKV-4-Raven (opens in a new tab) | May 2023 | 1.5, 3, 7, 14 | RWKV-4-Raven (opens in a new tab) | RWKV-4-Raven is a series of models. These models are fine-tuned on various datasets like Alpaca, CodeAlpaca, Guanaco, GPT4All, and ShareGPT. They follow a 100% RNN architecture for the language model. | 
| Guanaco (opens in a new tab) | May 2023 | 7, 13, 33, 65 | Guanaco-7B (opens in a new tab), Guanaco-13B (opens in a new tab), Guanaco-33B (opens in a new tab) Guanaco-65B (opens in a new tab) | Guanaco models are open-source chatbots fine-tuned through 4-bit QLoRA tuning of LLaMA base models on the OASST1 dataset. They are intended for research purposes. The models allow for cheap and local experimentation with high-quality chatbot systems. | 
| PaLM 2 (opens in a new tab) | May 2023 | - | - | A Language Model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. | 
| Gorilla (opens in a new tab) | May 2023 | 7 | Gorilla (opens in a new tab) | Gorilla: Large Language Model Connected with Massive APIs | 
| RedPajama-INCITE (opens in a new tab) | May 2023 | 3, 7 | RedPajama-INCITE (opens in a new tab) | A family of models including base, instruction-tuned & chat models. | 
| LIMA (opens in a new tab) | May 2023 | 65 | - | A 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. | 
| Replit Code (opens in a new tab) | May 2023 | 3 | Replit Code (opens in a new tab) | replit-code-v1-3b model is a 2.7B LLM trained on 20 languages from the Stack Dedup v1.2 dataset. | 
| h2oGPT (opens in a new tab) | May 2023 | 7, 12, 20, 40 | h2oGPT (opens in a new tab) | h2oGPT is a LLM fine-tuning framework and chatbot UI with document(s) question-answer capabilities. | 
| CodeGen2 (opens in a new tab) | May 2023 | 1, 3, 7, 16 | CodeGen2 (opens in a new tab) | Code models for program synthesis. | 
| CodeT5 and CodeT5+ (opens in a new tab) | May 2023 | 16 | CodeT5 (opens in a new tab) | CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research. | 
| StarCoder (opens in a new tab) | May 2023 | 15 | StarCoder (opens in a new tab) | StarCoder: A State-of-the-Art LLM for Code | 
| MPT (opens in a new tab) | May 2023 | 7, 30 | MPT-7B (opens in a new tab), MPT-30B (opens in a new tab) | MosaicML's MPT models are open-source, commercially licensed Large Language Models, offering customizable AI solutions optimized for various NLP tasks. | 
| DLite (opens in a new tab) | May 2023 | 0.124 - 1.5 | DLite-v2-1.5B (opens in a new tab) | Lightweight instruction following models which exhibit ChatGPT-like interactivity. | 
| WizardLM (opens in a new tab) | Apr 2023 | 70, 30, 13 | WizardLM-13B (opens in a new tab), WizardLM-30B (opens in a new tab), WizardLM-70B (opens in a new tab) | WizardLM is a family of large language models designed to follow complex instructions. The models performs well in coding, mathematical reasoning, and open-domain conversations. The models are license-friendly and adopt a prompt format from Vicuna for multi-turn conversations. The models are developed by the WizardLM Team, designed for various NLP tasks. | 
| FastChat-T5-3B (opens in a new tab) | Apr 2023 | 3 | FastChat-T5-3B (opens in a new tab) | FastChat-T5 is an open-source chatbot trained by fine-tuning Flan-t5-xl (3B parameters) on user-shared conversations collected from ShareGPT. It's based on an encoder-decoder transformer architecture and can autoregressively generate responses to users' inputs. | 
| GPT4All-13B-Snoozy (opens in a new tab) | Apr 2023 | 13 | GPT4All-13B-Snoozy (opens in a new tab) | GPT4All-13B-Snoozy is a GPL licensed chatbot trained over a massive curated corpus of assistant interactions including word problems, multi-turn dialogue, code, poems, songs, and stories. It has been finetuned from LLama 13B and is developed by Nomic AI. The model is designed for assistant-style interaction data and is primarily in English. | 
| Koala-13B (opens in a new tab) | Apr 2023 | 13 | Koala-13B (opens in a new tab) | Koala-13B is a chatbot created by Berkeley AI Research (BAIR). It is fine-tuned on Meta's LLaMA and focuses on dialogue data scraped from the web. The model aims to balance performance and cost, providing a lighter, open-source alternative to models like ChatGPT. It has been trained on interaction data that includes conversations with highly capable closed-source models such as ChatGPT. | 
| OpenAssistant (Llama family) (opens in a new tab) | Apr 2023 | 30, 70 | Llama2-30b-oasst (opens in a new tab), Llama2-70b-oasst (opens in a new tab) | OpenAssistant-LLaMA models are language models from OpenAssistant's work on the Llama models. It supports CPU + GPU inference using GGML format and aims to provide an open-source alternative for instruction following tasks | 
| Dolly (opens in a new tab) | Apr 2023 | 3, 7, 12 | Dolly-v2-3B (opens in a new tab), Dolly-v2-7B (opens in a new tab), Dolly-v2-12B (opens in a new tab) | An instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. | 
| StableLM (opens in a new tab) | Apr 2023 | 3, 7 | StableLM-Alpha-3B (opens in a new tab), StableLM-Alpha-7B (opens in a new tab) | Stability AI's StableLM series of language models | 
| Pythia (opens in a new tab) | Apr 2023 | 0.070 - 12 | Pythia (opens in a new tab) | A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. | 
| Open Assistant (Pythia Family) (opens in a new tab) | Mar 2023 | 12 | Open Assistant (opens in a new tab) | OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so. | 
| Med-PaLM 2 (opens in a new tab) | Mar 2023 | - | - | Towards Expert-Level Medical Question Answering with Large Language Models | 
| ChatGLM-6B (opens in a new tab) | Mar 2023 | 6 | ChatGLM-6B (opens in a new tab) | ChatGLM-6B, is an open-source, Chinese-English bilingual dialogue model based on the General Language Model (GLM) architecture with 6.2 billion parameters. Despite its small size causing some factual or mathematical logic issues, it's adept for Chinese question-answering, summarization, and conversational tasks due to its training on over 1 trillion English and Chinese tokens | 
| GPT-3.5-turbo (opens in a new tab) | Mar 2023 | 175 | - | GPT-3.5-Turbo is OpenAI's advanced language model optimized for chat but also works well for traditional completion tasks. It offers better performance across all aspects compared to GPT-3 and is 10 times cheaper per token. | 
| Vicuna (opens in a new tab) | Mar 2023 | 7, 13, 33 | Vicuna-7B (opens in a new tab), Vicuna-13B (opens in a new tab) | Vicuna is a family of auto-regressive language models based on the transformer architecture. It's fine-tuned from LLaMA and primarily intended for research on large language models and chatbots. It's developed by LMSYS and has a non-commercial license. | 
| Alpaca-13B (opens in a new tab) | Mar 2023 | 13 | - | Alpaca is an instruction-following language model fine-tuned from Meta's LLaMA 7B. It's designed for academic research to address issues like misinformation and toxicity. Alpaca is trained on 52K instruction-following demonstrations and aims to be a more accessible option for academic study. It's not intended for commercial use due to licensing and safety concerns. | 
| Claude-1 (opens in a new tab) | Mar 2023 | 137 | - | Claude is foundational a large language model (LLM) built by Anthropic. It is designed to be a helpful, honest, and harmless AI assistant. It can perform a wide variety of conversational and text processing tasks and is accessible through a chat interface and API. | 
| Cerebras-GPT (opens in a new tab) | Mar 2023 | 0.111 - 13 | Cerebras-GPT (opens in a new tab) | Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster | 
| BloombergGPT (opens in a new tab) | Mar 2023 | 50 | - | BloombergGPT: A Large Language Model for Finance | 
| PanGu-Σ (opens in a new tab) | Mar 2023 | 1085 | - | PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing | 
| GPT-4 (opens in a new tab) | Mar 2023 | - | - | GPT-4 Technical Report | 
| LLaMA (opens in a new tab) | Feb 2023 | 7, 13, 33, 65 | LLaMA (opens in a new tab) | LLaMA: Open and Efficient Foundation Language Models | 
| ChatGPT (opens in a new tab) | Nov 2022 | - | - | A model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests. | 
| Galactica (opens in a new tab) | Nov 2022 | 0.125 - 120 | Galactica (opens in a new tab) | Galactica: A Large Language Model for Science | 
| mT0 (opens in a new tab) | Nov 2022 | 13 | mT0-xxl (opens in a new tab) | Crosslingual Generalization through Multitask Finetuning | 
| BLOOM (opens in a new tab) | Nov 2022 | 176 | BLOOM (opens in a new tab) | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | 
| U-PaLM (opens in a new tab) | Oct 2022 | 540 | - | Transcending Scaling Laws with 0.1% Extra Compute | 
| UL2 (opens in a new tab) | Oct 2022 | 20 | UL2, Flan-UL2 (opens in a new tab) | UL2: Unifying Language Learning Paradigms | 
| Sparrow (opens in a new tab) | Sep 2022 | 70 | - | Improving alignment of dialogue agents via targeted human judgements | 
| Flan-T5 (opens in a new tab) | Oct 2022 | 11 | Flan-T5-xxl (opens in a new tab) | Scaling Instruction-Finetuned Language Models | 
| AlexaTM (opens in a new tab) | Aug 2022 | 20 | - | AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model | 
| GLM-130B (opens in a new tab) | Oct 2022 | 130 | GLM-130B (opens in a new tab) | GLM-130B: An Open Bilingual Pre-trained Model | 
| OPT-IML (opens in a new tab) | Dec 2022 | 30, 175 | OPT-IML (opens in a new tab) | OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization | 
| OPT (opens in a new tab) | May 2022 | 175 | OPT-13B (opens in a new tab), OPT-66B (opens in a new tab) | OPT: Open Pre-trained Transformer Language Models | 
| PaLM (opens in a new tab) | Apr 2022 | 540 | - | PaLM: Scaling Language Modeling with Pathways | 
| Tk-Instruct (opens in a new tab) | Apr 2022 | 11 | Tk-Instruct-11B (opens in a new tab) | Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks | 
| GPT-NeoX-20B (opens in a new tab) | Apr 2022 | 20 | GPT-NeoX-20B (opens in a new tab) | GPT-NeoX-20B: An Open-Source Autoregressive Language Model | 
| Chinchilla (opens in a new tab) | Mar 2022 | 70 | - | Shows that for a compute budget, the best performances are not achieved by the largest models but by smaller models trained on more data. | 
| InstructGPT (opens in a new tab) | Mar 2022 | 175 | - | Training language models to follow instructions with human feedback | 
| CodeGen (opens in a new tab) | Mar 2022 | 0.350 - 16 | CodeGen (opens in a new tab) | CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis | 
| AlphaCode (opens in a new tab) | Feb 2022 | 41 | - | Competition-Level Code Generation with AlphaCode | 
| MT-NLG (opens in a new tab) | Jan 2022 | 530 | - | Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model | 
| LaMDA (opens in a new tab) | Jan 2022 | 137 | - | LaMDA: Language Models for Dialog Applications | 
| GLaM (opens in a new tab) | Dec 2021 | 1200 | - | GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | 
| Gopher (opens in a new tab) | Dec 2021 | 280 | - | Scaling Language Models: Methods, Analysis & Insights from Training Gopher | 
| WebGPT (opens in a new tab) | Dec 2021 | 175 | - | WebGPT: Browser-assisted question-answering with human feedback | 
| Yuan 1.0 (opens in a new tab) | Oct 2021 | 245 | - | Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning | 
| T0 (opens in a new tab) | Oct 2021 | 11 | T0 (opens in a new tab) | Multitask Prompted Training Enables Zero-Shot Task Generalization | 
| FLAN (opens in a new tab) | Sep 2021 | 137 | - | Finetuned Language Models Are Zero-Shot Learners | 
| HyperCLOVA (opens in a new tab) | Sep 2021 | 82 | - | What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers | 
| ERNIE 3.0 Titan (opens in a new tab) | Jul 2021 | 10 | - | ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | 
| Jurassic-1 (opens in a new tab) | Aug 2021 | 178 | - | Jurassic-1: Technical Details and Evaluation | 
| ERNIE 3.0 (opens in a new tab) | Jul 2021 | 10 | - | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | 
| Codex (opens in a new tab) | Jul 2021 | 12 | - | Evaluating Large Language Models Trained on Code | 
| GPT-J-6B (opens in a new tab) | Jun 2021 | 6 | GPT-J-6B (opens in a new tab) | A 6 billion parameter, autoregressive text generation model trained on The Pile. | 
| CPM-2 (opens in a new tab) | Jun 2021 | 198 | CPM (opens in a new tab) | CPM-2: Large-scale Cost-effective Pre-trained Language Models | 
| PanGu-α (opens in a new tab) | Apr 2021 | 13 | PanGu-α (opens in a new tab) | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation | 
| mT5 (opens in a new tab) | Oct 2020 | 13 | mT5 (opens in a new tab) | mT5: A massively multilingual pre-trained text-to-text transformer | 
| BART (opens in a new tab) | Jul 2020 | - | BART (opens in a new tab) | Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension | 
| GShard (opens in a new tab) | Jun 2020 | 600 | - | GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding | 
| GPT-3 (opens in a new tab) | May 2020 | 175 | - | Language Models are Few-Shot Learners | 
| CTRL (opens in a new tab) | Sep 2019 | 1.63 | CTRL (opens in a new tab) | CTRL: A Conditional Transformer Language Model for Controllable Generation | 
| ALBERT (opens in a new tab) | Sep 2019 | 0.235 | ALBERT (opens in a new tab) | A Lite BERT for Self-supervised Learning of Language Representations | 
| XLNet (opens in a new tab) | Jun 2019 | - | XLNet (opens in a new tab) | Generalized Autoregressive Pretraining for Language Understanding and Generation | 
| T5 (opens in a new tab) | Oct 2019 | 0.06 - 11 | Flan-T5 (opens in a new tab) | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | 
| GPT-2 (opens in a new tab) | Nov 2019 | 1.5 | GPT-2 (opens in a new tab) | Language Models are Unsupervised Multitask Learners | 
| RoBERTa (opens in a new tab) | Jul 2019 | 0.125 - 0.355 | RoBERTa (opens in a new tab) | A Robustly Optimized BERT Pretraining Approach | 
| BERT (opens in a new tab) | Oct 2018 | - | BERT (opens in a new tab) | Bidirectional Encoder Representations from Transformers | 
| GPT (opens in a new tab) | Jun 2018 | - | GPT (opens in a new tab) | Improving Language Understanding by Generative Pre-Training | 
⚠️
This section is under development.
Data adopted from Papers with Code (opens in a new tab) and the recent work by Zhao et al. (2023) (opens in a new tab).