大型语言模型列表

大型语言模型 (LLM) 是一种机器学习模型,专为语言生成等自然语言处理任务而设计。LLM 是具有许多参数的语言模型,并通过对大量文本进行自监督学习进行训练。

本页列出了值得注意的大型语言模型。

对于训练成本一列,1 petaFLOP-day = 1 petaFLOP/sec × 1 天 = 8.64×1019 FLOP。此外,仅列出最大模型的成本。

名称 发布日期[a] 开发者 参数量 (十亿) [b] 语料库大小 训练成本 (petaFLOP-day) 许可证[c] 注解
GPT-1 2018年6月 OpenAI 0.117 1[1] MIT[2] 首个GPT模型,为仅解码器transformer。 在8个P600GPU上训练了30天。
BERT 2018年10月 Google 0.340[3] 33亿单词[3] 9[4] Apache 2.0[5] 这是一个早期且有影响力的语言模型。[6] 仅用于编码器,因此并非为提示或生成而构建。[7] 在 64 个 TPUv2 芯片上训练耗时 4 天。[8]
T5 2019年10月 Google 11[9] 340亿 tokens[9] Apache 2.0[10] 许多Google项目的基础模型,例如Imagen。[11]
XLNet 2019年6月 Google 0.340[12] 330亿单词 330 Apache 2.0[13] 作为BERT的替代,设计为仅编码器 。在512个TPU v3芯片上训练了5.5天。[14]
GPT-2 2019年2月 OpenAI 1.5[15] 40 GB[16] (~100亿 tokens)[17] 28[18] MIT[19] 在32个TPU v3芯片上训练了一周。[18]
GPT-3 2020年5月 OpenAI 175[20] 3000亿 tokens[17] 3640[21] Proprietary A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[22]
GPT-Neo 2021年3月 EleutherAI 2.7[23] 825 GiB[24] MIT[25] The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[25]
GPT-J 2021年6月 EleutherAI 6[26] 825 GiB[24] 200[27] Apache 2.0 GPT-3-style language model
Megatron-Turing NLG 2021年10月 [28] Microsoft and Nvidia 530[29] 338.6 billion tokens[29] 38000[30] Restricted web access Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[30]
Ernie 3.0 Titan 2021年12月 Baidu 260[31] 4 Tb Proprietary Chinese-language LLM. Ernie Bot is based on this model.
Claude[32] 2021年12月 Anthropic 52[33] 400 billion tokens[33] beta Fine-tuned for desirable behavior in conversations.[34]
GLaM (Generalist Language Model) 2021年12月 Google 1200[35] 1.6 trillion tokens[35] 5600[35] Proprietary Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher 2021年12月 DeepMind 280[36] 300 billion tokens[37] 5833[38] Proprietary Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications) 2022年1月 Google 137[39] 1.56T words,[39] 168 billion tokens[37] 4110[40] Proprietary Specialized for response generation in conversations.
GPT-NeoX 2022年2月 EleutherAI 20[41] 825 GiB[24] 740[27] Apache 2.0 based on the Megatron architecture
Chinchilla 2022年3月 DeepMind 70[42] 1.4 trillion tokens[42][37] 6805[38] Proprietary Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model) 2022年4月 Google 540[43] 768 billion tokens[42] 29,250[38] Proprietary Trained for ~60 days on ~6000 TPU v4 chips.[38] 截至2024年10月 (2024-10), it is the largest dense Transformer published.
OPT (Open Pretrained Transformer) 2022年5月 Meta 175[44] 180 billion tokens[45] 310[27] Non-commercial research[d] GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[46]
YaLM 100B 2022年6月 Yandex 100[47] 1.7TB[47] Apache 2.0 English-Russian model based on Microsoft's Megatron-LM.
Minerva 2022年6月 Google 540[48] 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[48] Proprietary For solving "mathematical and scientific questions using step-by-step reasoning".[49] Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM 2022年7月 Large collaboration led by Hugging Face 175[50] 350 billion tokens (1.6TB)[51] Responsible AI Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica 2022年11月 Meta 120 106 billion tokens[52] 未知 CC-BY-NC-4.0 Trained on scientific text and modalities.
AlexaTM (Teacher Models) 2022年11月 Amazon 20[53] 1.3 trillion[54] proprietary[55] bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI) 2023年2月 Meta AI 65[56] 1.4 trillion[56] 6300[57] Non-commercial research[e] Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[56]
GPT-4 2023年3月 OpenAI 未知[f] (According to rumors: 1760)[59] 未知 未知 proprietary Available for ChatGPT Plus users and used in several products.
Chameleon 2024年6月 Meta AI 34[60] 4.4 trillion
Cerebras-GPT 2023年3月 Cerebras 13[61] 270[27] Apache 2.0 Trained with Chinchilla formula.
Falcon 2023年3月 Technology Innovation Institute 40[62] 1 trillion tokens, from RefinedWeb (filtered web text corpus)[63] plus some "curated corpora".[64] 2800[57] Apache 2.0[65]
BloombergGPT 2023年3月 Bloomberg L.P. 50 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[66] Proprietary Trained on financial data from proprietary sources, for financial tasks.
PanGu-Σ 2023年3月 Huawei 1085 329 billion tokens[67] Proprietary
OpenAssistant[68] 2023年3月 LAION 17 1.5 trillion tokens Apache 2.0 Trained on crowdsourced open data
Jurassic-2[69] 2023年3月 AI21 Labs 未知 未知 Proprietary Multilingual[70]
PaLM 2 (Pathways Language Model 2) 2023年5月 Google 340[71] 3.6 trillion tokens[71] 85,000[57] Proprietary Was used in Bard chatbot.[72]
Llama 2 2023年7月 Meta AI 70[73] 2 trillion tokens[73] 21,000 Llama 2 license 1.7 million A100-hours.[74]
Claude 2 2023年7月 Anthropic 未知 未知 未知 Proprietary Used in Claude chatbot.[75]
Granite 13b 2023年7月 IBM 未知 未知 未知 Proprietary Used in IBM Watsonx.[76]
Mistral 7B 2023年9月 Mistral AI 7.3[77] 未知 Apache 2.0
Claude 2.1 2023年11月 Anthropic 未知 未知 未知 Proprietary Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78]
Grok-1[79] 2023年11月 xAI 314 未知 未知 Apache 2.0 Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).[80]
Gemini 1.0 2023年12月 Google DeepMind 未知 未知 未知 Proprietary Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81]
Mixtral 8x7B 2023年12月 Mistral AI 46.7 未知 未知 Apache 2.0 Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83]
Mixtral 8x22B 2024年4月 Mistral AI 141 未知 未知 Apache 2.0 [84]
DeepSeek LLM 2023年11月29日 DeepSeek 67 2T tokens[85] 12,000} DeepSeek License Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[85]
Phi-2 2023年12月 Microsoft 2.7 1.4T tokens 419[86] MIT Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[86]
Gemini 1.5 2024年2月 Google DeepMind 未知 未知 未知 Proprietary Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[87]
Gemini Ultra 2024年2月 Google DeepMind 未知 未知 未知
Gemma 2024年2月 Google DeepMind 7 6T tokens 未知 Gemma Terms of Use[88]
Claude 3 2024年3月 Anthropic 未知 未知 未知 Proprietary Includes three models, Haiku, Sonnet, and Opus.[89]
Nova 2024年10月 Rubik's AI 未知 未知 未知 Proprietary Includes three models, Nova-Instant, Nova-Air, and Nova-Pro.
DBRX 2024年3月 Databricks and Mosaic ML 136 12T Tokens Databricks Open Model License Training cost 10 million USD.
Fugaku-LLM 2024年5月 Fujitsu, Tokyo Institute of Technology, etc. 13 380B Tokens The largest model ever trained on CPU-only, on the Fugaku.[90]
Phi-3 2024年4月 Microsoft 14[91] 4.8T Tokens MIT Microsoft markets them as "small language model".[92]
Granite Code Models 2024年5月 IBM 未知 未知 未知 Apache 2.0
Qwen2 2024年6月 Alibaba Cloud 72[93] 3T Tokens 未知 Qwen License Multiple sizes, the smallest being 0.5B.
DeepSeek V2 2024年6月 DeepSeek 236 8.1T tokens 28,000 DeepSeek License 1.4M hours on H800.[94]
Nemotron-4 2024年6月 Nvidia 340 9T Tokens 200,000 NVIDIA Open Model License Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[95][96]
Llama 3.1 2024年7月 Meta AI 405 15.6T tokens 440,000 Llama 3 license 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[97][98]
DeepSeek V3 2024年12月 DeepSeek 671 14.8T tokens 56,000 DeepSeek License 2.788M hours on H800 GPUs.[99]
Amazon Nova 2024年12月 Amazon 未知 未知 未知 Proprietary Includes three models, Nova Micro, Nova Lite, and Nova Pro[100]
DeepSeek R1 2025年1月 DeepSeek 671 未知 未知 MIT No pretraining. Reinforcement-learned upon V3-Base.[101][102]
Qwen2.5 2025年1月 Alibaba 72 18T tokens 未知 Qwen License [103]
MiniMax-Text-01 January 2025 Minimax 456 4.7T tokens[104] 未知 Minimax Model license [105][104]
Gemini 2.0 2025年2月 Google DeepMind 未知 未知 未知 Proprietary Three models released: Flash, Flash-Lite and Pro[106][107][108]

参见

注释

  1. ^ 这是描述模型架构的文档首次发布的日期。
  2. ^ 在许多情况下,研究人员会发布或报告具有不同尺寸的多个模型版本。在这些情况下,此处会列出最大模型的尺寸。
  3. ^ 这是预训练模型权重的许可证。在几乎所有情况下,训练代码本身都是开源的或可以轻松复制。
  4. ^ The smaller models including 66B are publicly available, while the 175B model is available on request.
  5. ^ Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
  6. ^ As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."[58]

参考资料

  1. ^ Improving language understanding with unsupervised learning. openai.com. June 11, 2018 [2023-03-18]. (原始内容存档于2023-03-18). 
  2. ^ finetune-transformer-lm. GitHub. [2 January 2024]. (原始内容存档于19 May 2023). 
  3. ^ 3.0 3.1 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 11 October 2018. arXiv:1810.04805v2可免费查阅 [cs.CL]. 
  4. ^ Prickett, Nicole Hemsoth. Cerebras Shifts Architecture To Meet Massive AI/ML Models. The Next Platform. 2021-08-24 [2023-06-20]. (原始内容存档于2023-06-20). 
  5. ^ BERT. March 13, 2023 [March 13, 2023]. (原始内容存档于January 13, 2021) –通过GitHub. 
  6. ^ Manning, Christopher D. Human Language Understanding & Reasoning. Daedalus. 2022, 151 (2): 127–138 [2023-03-09]. S2CID 248377870. doi:10.1162/daed_a_01905可免费查阅. (原始内容存档于2023-11-17). 
  7. ^ Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris. Bidirectional Language Models Are Also Few-shot Learners. 2022. arXiv:2209.14500可免费查阅 [cs.LG]. 
  8. ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 11 October 2018. arXiv:1810.04805v2可免费查阅 [cs.CL]. 
  9. ^ 9.0 9.1 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020, 21 (140): 1–67. ISSN 1533-7928. arXiv:1910.10683可免费查阅. 
  10. ^ google-research/text-to-text-transfer-transformer, Google Research, 2024-04-02 [2024-04-04], (原始内容存档于2024-03-29) 
  11. ^ Imagen: Text-to-Image Diffusion Models. imagen.research.google. [2024-04-04]. (原始内容存档于2024-03-27). 
  12. ^ Pretrained models — transformers 2.0.0 documentation. huggingface.co. [2024-08-05]. (原始内容存档于2024-08-05). 
  13. ^ xlnet. GitHub. [2 January 2024]. (原始内容存档于2 January 2024). 
  14. ^ Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. 2 January 2020. arXiv:1906.08237可免费查阅 [cs.CL]. 
  15. ^ GPT-2: 1.5B Release. OpenAI. 2019-11-05 [2019-11-14]. (原始内容存档于2019-11-14) (英语). 
  16. ^ Better language models and their implications. openai.com. [2023-03-13]. (原始内容存档于2023-03-16). 
  17. ^ 17.0 17.1 OpenAI's GPT-3 Language Model: A Technical Overview. lambdalabs.com. 3 June 2020 [13 March 2023]. (原始内容存档于27 March 2023). 
  18. ^ 18.0 18.1 openai-community/gpt2-xl · Hugging Face. huggingface.co. [2024-07-24]. (原始内容存档于2024-07-24). 
  19. ^ gpt-2. GitHub. [13 March 2023]. (原始内容存档于11 March 2023). 
  20. ^ Wiggers, Kyle. The emerging types of language models and why they matter. TechCrunch. 28 April 2022 [9 March 2023]. (原始内容存档于16 March 2023). 
  21. ^ Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario. Language Models are Few-Shot Learners. May 28, 2020. arXiv:2005.14165v4可免费查阅 [cs.CL]. 
  22. ^ ChatGPT: Optimizing Language Models for Dialogue. OpenAI. 2022-11-30 [2023-01-13]. (原始内容存档于2022-11-30). 
  23. ^ GPT Neo. March 15, 2023 [March 12, 2023]. (原始内容存档于March 12, 2023) –通过GitHub. 
  24. ^ 24.0 24.1 24.2 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. 31 December 2020. arXiv:2101.00027可免费查阅 [cs.CL]. 
  25. ^ 25.0 25.1 Iyer, Abhishek. GPT-3's free alternative GPT-Neo is something to be excited about. VentureBeat. 15 May 2021 [13 March 2023]. (原始内容存档于9 March 2023). 
  26. ^ GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront. www.forefront.ai. [2023-02-28]. (原始内容存档于2023-03-09). 
  27. ^ 27.0 27.1 27.2 27.3 Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom, Marvin; Hestness, Joel. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. 2023-04-01. arXiv:2304.03208可免费查阅 [cs.LG]. 
  28. ^ Alvi, Ali; Kharya, Paresh. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model. Microsoft Research. 11 October 2021 [13 March 2023]. (原始内容存档于13 March 2023). 
  29. ^ 29.0 29.1 Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. 2022-02-04. arXiv:2201.11990可免费查阅 [cs.CL]. 
  30. ^ 30.0 30.1 Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, 2022-07-21, arXiv:2201.05596可免费查阅 
  31. ^ Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan; Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai, Yangfan; Chen, Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu, Tian; Zeng, Wei; Li, Ge; Gao, Wen; Wang, Haifeng. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. December 23, 2021. arXiv:2112.12731可免费查阅 [cs.CL]. 
  32. ^ Product. Anthropic. [14 March 2023]. (原始内容存档于16 March 2023). 
  33. ^ 33.0 33.1 Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. A General Language Assistant as a Laboratory for Alignment. 9 December 2021. arXiv:2112.00861可免费查阅 [cs.CL]. 
  34. ^ Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. Constitutional AI: Harmlessness from AI Feedback. 15 December 2022. arXiv:2212.08073可免费查阅 [cs.CL]. 
  35. ^ 35.0 35.1 35.2 Dai, Andrew M; Du, Nan. More Efficient In-Context Learning with GLaM. ai.googleblog.com. December 9, 2021 [2023-03-09]. (原始内容存档于2023-03-12). 
  36. ^ Language modelling at scale: Gopher, ethical considerations, and retrieval. www.deepmind.com. 8 December 2021 [20 March 2023]. (原始内容存档于20 March 2023). 
  37. ^ 37.0 37.1 37.2 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. Training Compute-Optimal Large Language Models. 29 March 2022. arXiv:2203.15556可免费查阅 [cs.CL]. 
  38. ^ 38.0 38.1 38.2 38.3 Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways 互联网档案馆存檔,存档日期2023-06-10.
  39. ^ 39.0 39.1 Cheng, Heng-Tze; Thoppilan, Romal. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything. ai.googleblog.com. January 21, 2022 [2023-03-09]. (原始内容存档于2022-03-25). 
  40. ^ Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-Tze; Jin, Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven; Ghafouri, Amin; Menegali, Marcelo. LaMDA: Language Models for Dialog Applications. 2022-01-01. arXiv:2201.08239可免费查阅 [cs.CL]. 
  41. ^ Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models: 95–136. 2022-05-01 [2022-12-19]. (原始内容存档于2022-12-10). 
  42. ^ 42.0 42.1 42.2 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent. An empirical analysis of compute-optimal large language model training. Deepmind Blog. 12 April 2022 [9 March 2023]. (原始内容存档于13 April 2022). 
  43. ^ Narang, Sharan; Chowdhery, Aakanksha. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance. ai.googleblog.com. April 4, 2022 [2023-03-09]. (原始内容存档于2022-04-04) (英语). 
  44. ^ Susan Zhang; Mona Diab; Luke Zettlemoyer. Democratizing access to large-scale language models with OPT-175B. ai.facebook.com. [2023-03-12]. (原始内容存档于2023-03-12). 
  45. ^ Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke. OPT: Open Pre-trained Transformer Language Models. 21 June 2022. arXiv:2205.01068可免费查阅 [cs.CL]. 
  46. ^ metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq. GitHub. [2024-10-18] (英语). 
  47. ^ 47.0 47.1 Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay, YaLM 100B, 2022-06-22 [2023-03-18], (原始内容存档于2023-06-16) 
  48. ^ 48.0 48.1 Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant. Solving Quantitative Reasoning Problems with Language Models. 30 June 2022. arXiv:2206.14858可免费查阅 [cs.CL]. 
  49. ^ Minerva: Solving Quantitative Reasoning Problems with Language Models. ai.googleblog.com. 30 June 2022 [20 March 2023]. 
  50. ^ Ananthaswamy, Anil. In AI, is bigger always better?. Nature. 8 March 2023, 615 (7951): 202–205 [9 March 2023]. Bibcode:2023Natur.615..202A. PMID 36890378. S2CID 257380916. doi:10.1038/d41586-023-00641-w. (原始内容存档于16 March 2023). 
  51. ^ bigscience/bloom · Hugging Face. huggingface.co. [2023-03-13]. (原始内容存档于2023-04-12). 
  52. ^ Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert. Galactica: A Large Language Model for Science. 16 November 2022. arXiv:2211.09085可免费查阅 [cs.CL]. 
  53. ^ 20B-parameter Alexa model sets new marks in few-shot learning. Amazon Science. 2 August 2022 [12 March 2023]. (原始内容存档于15 March 2023). 
  54. ^ Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model. 3 August 2022. arXiv:2208.01448可免费查阅 [cs.CL]. 
  55. ^ AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog. aws.amazon.com. 17 November 2022 [13 March 2023]. (原始内容存档于13 March 2023). 
  56. ^ 56.0 56.1 56.2 Introducing LLaMA: A foundational, 65-billion-parameter large language model. Meta AI. 24 February 2023 [9 March 2023]. (原始内容存档于3 March 2023). 
  57. ^ 57.0 57.1 57.2 The Falcon has landed in the Hugging Face ecosystem. huggingface.co. [2023-06-20]. (原始内容存档于2023-06-20). 
  58. ^ GPT-4 Technical Report (PDF). OpenAI. 2023 [March 14, 2023]. (原始内容存档 (PDF)于March 14, 2023). 
  59. ^ Schreiner, Maximilian. GPT-4 architecture, datasets, costs and more leaked. THE DECODER. 2023-07-11 [2024-07-26]. (原始内容存档于2023-07-12) (美国英语). 
  60. ^ Dickson, Ben. Meta introduces Chameleon, a state-of-the-art multimodal model. VentureBeat. 22 May 2024. 
  61. ^ Dey, Nolan. Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models. Cerebras. March 28, 2023 [March 28, 2023]. (原始内容存档于March 28, 2023). 
  62. ^ Abu Dhabi-based TII launches its own version of ChatGPT. tii.ae. [2023-04-03]. (原始内容存档于2023-04-03). 
  63. ^ Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. 2023-06-01. arXiv:2306.01116可免费查阅 [cs.CL]. 
  64. ^ tiiuae/falcon-40b · Hugging Face. huggingface.co. 2023-06-09 [2023-06-20]. 
  65. ^ UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free 互联网档案馆存檔,存档日期2024-02-08., 31 May 2023
  66. ^ Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur, Prabhanjan; Rosenberg, David; Mann, Gideon. BloombergGPT: A Large Language Model for Finance. March 30, 2023. arXiv:2303.17564可免费查阅 [cs.LG]. 
  67. ^ Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun. PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. March 19, 2023. arXiv:2303.10845可免费查阅 [cs.CL]. 
  68. ^ Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew. OpenAssistant Conversations – Democratizing Large Language Model Alignment. 2023-04-14. arXiv:2304.07327可免费查阅 [cs.CL]. 
  69. ^ Wrobel, Sharon. Tel Aviv startup rolls out new advanced AI language model to rival OpenAI. www.timesofisrael.com. [2023-07-24]. (原始内容存档于2023-07-24). 
  70. ^ Wiggers, Kyle. With Bedrock, Amazon enters the generative AI race. TechCrunch. 2023-04-13 [2023-07-24]. (原始内容存档于2023-07-24). 
  71. ^ 71.0 71.1 Elias, Jennifer. Google's newest A.I. model uses nearly five times more text data for training than its predecessor. CNBC. 16 May 2023 [18 May 2023]. (原始内容存档于16 May 2023). 
  72. ^ Introducing PaLM 2. Google. May 10, 2023 [May 18, 2023]. (原始内容存档于May 18, 2023). 
  73. ^ 73.0 73.1 Introducing Llama 2: The Next Generation of Our Open Source Large Language Model. Meta AI. 2023 [2023-07-19]. (原始内容存档于2024-01-05). 
  74. ^ llama/MODEL_CARD.md at main · meta-llama/llama. GitHub. [2024-05-28]. (原始内容存档于2024-05-28). 
  75. ^ Claude 2. anthropic.com. [12 December 2023]. (原始内容存档于15 December 2023). 
  76. ^ Nirmal, Dinesh. Building AI for business: IBM's Granite foundation models. IBM Blog. 2023-09-07 [2024-08-11]. (原始内容存档于2024-07-22) (美国英语). 
  77. ^ Announcing Mistral 7B. Mistral. 2023 [2023-10-06]. (原始内容存档于2024-01-06). 
  78. ^ Introducing Claude 2.1. anthropic.com. [12 December 2023]. (原始内容存档于15 December 2023). 
  79. ^ xai-org/grok-1, xai-org, 2024-03-19 [2024-03-19], (原始内容存档于2024-05-28) 
  80. ^ Grok-1 model card. x.ai. [12 December 2023]. 
  81. ^ Gemini – Google DeepMind. deepmind.google. [12 December 2023]. (原始内容存档于8 December 2023). 
  82. ^ Franzen, Carl. Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance. VentureBeat. 11 December 2023 [12 December 2023]. (原始内容存档于11 December 2023). 
  83. ^ Mixtral of experts. mistral.ai. 11 December 2023 [12 December 2023]. (原始内容存档于13 February 2024). 
  84. ^ AI, Mistral. Cheaper, Better, Faster, Stronger. mistral.ai. 2024-04-17 [2024-05-05]. (原始内容存档于2024-05-05). 
  85. ^ 85.0 85.1 DeepSeek-AI; Bi, Xiao; Chen, Deli; Chen, Guanting; Chen, Shanhuang; Dai, Damai; Deng, Chengqi; Ding, Honghui; Dong, Kai, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, 2024-01-05, arXiv:2401.02954可免费查阅 
  86. ^ 86.0 86.1 Hughes, Alyssa. Phi-2: The surprising power of small language models. Microsoft Research. 12 December 2023 [13 December 2023]. (原始内容存档于12 December 2023). 
  87. ^ Our next-generation model: Gemini 1.5. Google. 15 February 2024 [16 February 2024]. (原始内容存档于16 February 2024). This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens. 
  88. ^ Gemma –通过GitHub. 
  89. ^ Introducing the next generation of Claude. www.anthropic.com. [2024-03-04]. (原始内容存档于2024-03-04). 
  90. ^ Fugaku-LLM/Fugaku-LLM-13B · Hugging Face. huggingface.co. [2024-05-17]. (原始内容存档于2024-05-17). 
  91. ^ Phi-3. azure.microsoft.com. 23 April 2024 [2024-04-28]. (原始内容存档于2024-04-27). 
  92. ^ Phi-3 Model Documentation. huggingface.co. [2024-04-28]. (原始内容存档于2024-05-13). 
  93. ^ Qwen2. GitHub. [2024-06-17]. (原始内容存档于2024-06-17). 
  94. ^ DeepSeek-AI; Liu, Aixin; Feng, Bei; Wang, Bin; Wang, Bingxuan; Liu, Bo; Zhao, Chenggang; Dengr, Chengqi; Ruan, Chong, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, 2024-06-19, arXiv:2405.04434可免费查阅 
  95. ^ nvidia/Nemotron-4-340B-Base · Hugging Face. huggingface.co. 2024-06-14 [2024-06-15]. (原始内容存档于2024-06-15). 
  96. ^ Nemotron-4 340B | Research. research.nvidia.com. [2024-06-15]. (原始内容存档于2024-06-15). 
  97. ^ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta
  98. ^ llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models. GitHub. [2024-07-23]. (原始内容存档于2024-07-23) (英语). 
  99. ^ deepseek-ai/DeepSeek-V3, DeepSeek, 2024-12-26 [2024-12-26] 
  100. ^ Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3, Amazon, 2024-12-27 [2024-12-27] 
  101. ^ deepseek-ai/DeepSeek-R1, DeepSeek, 2025-01-21 [2025-01-21] 
  102. ^ DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025-01-22, arXiv:2501.12948可免费查阅 
  103. ^ Qwen; Yang, An; Yang, Baosong; Zhang, Beichen; Hui, Binyuan; Zheng, Bo; Yu, Bowen; Li, Chengyuan; Liu, Dayiheng, Qwen2.5 Technical Report, 2025-01-03, arXiv:2412.15115可免费查阅 
  104. ^ 104.0 104.1 MiniMax; Li, Aonian; Gong, Bangwei; Yang, Bo; Shan, Boji; Liu, Chang; Zhu, Cheng; Zhang, Chunhao; Guo, Congchao, MiniMax-01: Scaling Foundation Models with Lightning Attention, 2025-01-14 [2025-01-26], arXiv:2501.08313可免费查阅 
  105. ^ MiniMax-AI/MiniMax-01, MiniMax, 2025-01-26 [2025-01-26] 
  106. ^ Kavukcuoglu, Koray. Gemini 2.0 is now available to everyone. Google. [6 February 2025]. 
  107. ^ Gemini 2.0: Flash, Flash-Lite and Pro. Google for Developers. [6 February 2025]. 
  108. ^ Franzen, Carl. Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search. VentureBeat. 5 February 2025 [6 February 2025].