One Year of ChatGPT: The Impact of Open-Source Large Language Models on Humanity
-
Since its release in late 2022, ChatGPT has brought significant changes to the research and commercial fields of artificial intelligence. Through supervised fine-tuning and reinforcement learning from human feedback, the model can answer human questions and follow instructions across a wide range of tasks. Following this success, interest in LLMs has surged, with new LLMs continuously emerging in academia and industry, including many startups focused on LLMs.
Although closed-source LLMs (such as OpenAI's GPT and Anthropic's Claude) generally outperform their open-source counterparts, the progress of the latter has been rapid, with claims of achieving performance comparable to or even surpassing ChatGPT in certain tasks. This has profoundly impacted research on large language models and holds remarkable commercial value. On the one-year anniversary of ChatGPT's release, this article aims to provide a comprehensive overview of the success of open-source LLMs and thoroughly investigate the tasks where open-source LLMs claim to match or exceed ChatGPT's performance.
Note: The latest version of this article was updated on December 5 and does not yet include the recently released first open-source MoE large model, Mixtral (8x7B), which is said to have reached or even surpassed the levels of LLaMA2 (70B) and GPT-3.5. (The following content was compiled and published by OneFlow. For reprinting, please contact for authorization. Original source: https://arxiv.org/pdf/2311.16989.pdf)
A year ago, OpenAI released ChatGPT, which quickly swept through the AI community and the world. It was the first application-based AI chatbot capable of providing useful, safe, and detailed answers to most questions, following instructions, and even admitting and correcting previous mistakes. Notably, it seemed to excel at natural language tasks typically performed by language models that were pre-trained and then fine-tuned for specific purposes (such as summarization or question-answering).
As a pioneering work in the field, ChatGPT garnered widespread attention—attracting 100 million users within two months of its launch, far outpacing the growth of other popular apps like TikTok or YouTube. [1] Due to its ability to reduce labor costs, automate workflows, and even create entirely new experiences for customers (Cheng et al., 2023), it also attracted massive investments.
However, ChatGPT was not open-sourced and remained under the control of a private company, leaving most technical details unknown. Although OpenAI claimed it followed the procedures introduced in InstructGPT (also known as GPT-3.5) (Ouyang et al., 2022b), its exact architecture, pre-training data, and fine-tuning data remain undisclosed. This closed-source nature has led to several critical issues. First, due to the lack of understanding of the internal details of pre-training and fine-tuning procedures, especially given that LLMs are known to generate harmful, unethical, or false content, it is difficult to accurately assess the potential risks ChatGPT poses to society. Second, there have been reports that ChatGPT's performance changes over time, hindering the reproducibility of results (Chen et al., 2023). Third, ChatGPT has experienced multiple failures, including two major outages in November 2023 alone, during which access to the ChatGPT website and its API was completely blocked. Fourth, companies adopting ChatGPT may worry about high API costs, service disruptions, data ownership, privacy issues, and other unforeseen events, such as the recent dramatic episode involving CEO Sam Altman's dismissal, employee pressure on the board, and his eventual return to the company's leadership.
On the other hand, open-source large language models potentially address or circumvent most of the aforementioned issues, offering a promising direction. For this reason, the research community has actively promoted the development of high-performance LLMs in an open-source environment. However, as of late 2023, it was widely believed that open-source LLMs like LLaMa-2 (Touvron et al., 2023) or Falcon (Almazrouei et al., 2023) lagged behind their proprietary counterparts, such as OpenAI's GPT-3.5 (ChatGPT) and GPT-4 (OpenAI, 2023b), Anthropic's Claude 2, or Google's Bard 3, with GPT-4 generally considered the leading proprietary model. Encouragingly, the gap between open-source and proprietary models is narrowing, and open-source LLMs are rapidly catching up.
In fact, as shown in Figure 1, the best open-source LLMs have already outperformed GPT-3.5-turbo on certain standard benchmarks. However, this is no easy challenge for open-source LLMs. The landscape continues to evolve: proprietary LLMs are regularly retrained on updated data for improvements, while open-source LLMs are also advancing with frequent new releases. Currently, the diversity of evaluation datasets and benchmarks for LLMs makes it difficult to determine the best-performing model.
This article aims to consolidate recent research on open-source LLMs and outline their performance in various domains, where they match or even surpass ChatGPT. Our contributions include the following three aspects:
- Consolidating various evaluations of open-source LLMs, providing a fair and comprehensive comparison with ChatGPT (see Figure 1, Section 3.1).
- Systematically reviewing open-source LLMs that match or exceed ChatGPT's performance across tasks, along with corresponding analyses (see Figure 2, Sections 3 and 4.2). We also maintain a live webpage to track the latest model updates.[4]
- Analyzing trends in open-source LLM development (Section 4.1), best practices for training them (Section 4.3), and potential issues (Section 4.4).
Who can benefit from this report? This study aims to help academia and industry understand the current landscape and future potential of open-source LLMs. For researchers, it provides a detailed review of the progress and evolving trends in open-source LLMs, highlighting promising directions for future work. For businesses, this survey offers valuable insights and guidance to help decision-makers evaluate the suitability and benefits of adopting open-source LLMs. Next, we will first introduce background concepts (Section 2), then conduct an in-depth exploration of open-source LLMs that outperform ChatGPT in various domains (Section 3), followed by a discussion on insights and issues regarding open-source LLMs (Section 4), and finally conclude (Section 5).
This section briefly introduces fundamental concepts related to LLMs.
2.1 Training Paradigms
Pre-training: All LLMs rely on large-scale self-supervised pre-training on internet text data (Radford et al., 2018; Brown et al., 2020). Decoder-only LLMs follow the causal language modeling objective, where the model learns to predict the next token given the sequence of previous tokens (Bengio et al., 2000). According to the pre-training details shared by open-source LLMs (Touvron et al., 2023a), text data sources include CommonCrawl5, C4 (Raffel et al., 2020), GitHub, Wikipedia, books, and online discussions such as Reddit or StackOverFlow. It is well-known that scaling up the pre-training corpus size improves model performance and complements scaling up the model size, a phenomenon known as the scaling law, which is analyzed in depth in (Hoffmann et al., 2022a). Today, the pre-training corpus for LLMs can reach hundreds of billions to trillions of tokens (Touvron et al., 2023b; Penedo et al., 2023).
Fine-tuning [8]: The goal is to adapt pre-trained LLMs to downstream tasks by updating weights using existing supervised information, typically with datasets several orders of magnitude smaller than those used for pre-training (Devlin et al., 2018). T5 (Raffel et al., 2020) was one of the earliest models to incorporate fine-tuning into a unified text-to-text framework, where each task is described by a natural language instruction.
Instruction Fine-tuning: Later, fine-tuning was expanded to multiple tasks (Wei et al., 2021a; Aribandi et al., 2021), with each task described by a natural language instruction. Instruction fine-tuning quickly gained popularity due to its ability to significantly improve the zero-shot performance of LLMs (including performance on new tasks not seen during training).
Standard instruction fine-tuning and multi-task supervised fine-tuning (often referred to as SFT) may still fail to produce models that align with human intent, are safe, ethical, and harmless. These models can be further improved through Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022b): human annotators score the outputs of the fine-tuned model, and the model is fine-tuned again using reinforcement learning (Ouyang et al., 2022b). Recent research suggests that human feedback can be replaced by feedback from LLMs, a process termed Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022b). Direct Preference Optimization (DPO) bypasses the need to fit a reward model to human preferences in RLHF, directly fine-tuning the policy using a cross-entropy objective, thereby more effectively aligning LLMs with human preferences.
Some studies focus on quality over quantity when constructing multi-task instruction fine-tuning datasets: Lima (Zhou et al., 2023a) fine-tuned Llama-65B with only 1,000 examples, outperforming GPT-3, while Alpagasus (Chen et al., 2023c) improved the performance of Alpaca (Taori et al., 2023) by cleaning its instruction fine-tuning dataset, reducing the number of examples from 52,000 to 9,000. Continuous pre-training: Refers to performing another round of pre-training on a pre-trained large language model (LLM) using typically smaller datasets than the initial phase. This process can be used to quickly adapt to new domains or elicit new capabilities from the LLM. For example, Lemur (Xu et al., 2023d) used continuous pre-training to improve coding and reasoning abilities, while Llama-2-long (Xiong et al., 2023) employed it to extend the context window.
Inference: Several methods exist as alternatives to autoregressive decoding for sequence generation with LLMs, differing in the degree of randomness and diversity in outputs. Increasing the temperature parameter during sampling can make outputs more diverse, while setting temperature to 0 reverts to greedy decoding, which may be necessary when deterministic outputs are required. Sampling methods like top-k (Fan et al., 2018) and top-p (Holtzman et al., 2019) limit the pool of tokens available for sampling at each decoding step.
Several techniques exist to improve inference speed, particularly for longer sequences where attention complexity grows quadratically with input length. FlashAttention (Dao et al., 2022) accelerates both training and inference by optimizing read/write operations between GPU memory hierarchies. FlashDecoding (Dao et al., 2023) parallelizes the loading of key-value (KV) caches in attention mechanisms, achieving up to 8× end-to-end speedup. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023b) uses an additional small language model to approximate the distribution of next tokens from the LLM, maintaining performance while accelerating decoding. vLLM (Kwon et al., 2023) leverages the PagedAttention algorithm (which optimizes memory usage for attention keys and values) to accelerate LLM inference and serving.
2.2 Task Domains and Evaluation
Proper evaluation of LLM capabilities remains an active research area due to the need for diverse and comprehensive assessments. Question-answering datasets (Joshi et al., 2017; Kwiatkowski et al., 2019; Lin et al., 2022) are popular benchmarks, though new benchmarks specifically tailored for LLM evaluation have recently emerged (Dubois et al., 2023; Beeching et al., 2023; Zheng et al., 2023).
Open-source LLMs vs. ChatGPT
In this section, we explore LLM capabilities across six major domains: generalization ability, agent capabilities, logical reasoning (including mathematical and coding abilities), long-text modeling, specific applications (such as QA or summarization), and trustworthiness. Due to space limitations, interested readers can refer to Section 3 of the original paper (https://arxiv.org/pdf/2311.16989.pdf) for details, with key conclusions presented in Section 4.
4.1 Trends in LLM Development Since Brown et al. (2020) demonstrated the remarkable zero-shot and few-shot performance of the GPT-3 model across various tasks, significant efforts have been devoted to the development and advancement of large language models (LLMs). One research direction focuses on scaling up model parameters, including models like Gopher (Rae et al., 2021), GLaM (Du et al., 2022), LaMDA (Thoppilan et al., 2022), MT-NLG (Smith et al., 2022), and PaLM (Chowdhery et al., 2022), with the largest model reaching 540 billion parameters. Despite their impressive capabilities, these models are closed-source, limiting their widespread application, which has led to increasing interest in developing open-source LLMs (Zhang et al., 2022; Workshop et al., 2022).
In contrast to scaling model size, another research direction explores better strategies or objectives for pre-training smaller models, such as Chinchilla (Hoffmann et al., 2022b) and UL2 (Tay et al., 2022). Beyond pre-training, efforts have also been directed toward instruction fine-tuning of language models, including FLAN (Wei et al., 2021b), T0 (Sanh et al., 2021), and Flan-T5 (Chung et al., 2022).
A year ago, OpenAI's ChatGPT significantly shifted the research focus of the natural language processing (NLP) community (Qin et al., 2023a). To catch up with OpenAI, Google and Anthropic introduced Bard and Claude, respectively. Although their performance is comparable to ChatGPT in many tasks, there remains a performance gap with OpenAI's latest model, GPT-4 (OpenAI, 2023b). Since the success of these models largely stems from reinforcement learning from human feedback (RLHF) (Schulman et al., 2017b; Ouyang et al., 2022a), researchers have explored various improvements to RLHF (Yuan et al., 2023; Rafailov et al., 2023b; Lee et al., 2023b).
To advance open-source LLM research, Meta released the LLaMA series of models (Touvron et al., 2023a, b). Since then, open-source models based on LLaMA have rapidly emerged. One representative research direction involves fine-tuning LLaMA with instruction data, including models like Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), Lima (Zhou et al., 2023b), and WizardLM (Xu et al., 2023a). Current research also explores enhancing agent capabilities (Xu et al., 2023d; Zeng et al., 2023; Patil et al., 2023; Qin et al., 2023b), logical reasoning (Roziere et al., 2023; Luo et al., 2023a, c), and long-context modeling (Tworkowski et al., 2023; Xiong et al., 2023; Xu et al., 2023b) in LLaMA-based open-source LLMs. Additionally, instead of building LLMs on LLaMA, many efforts focus on training powerful LLMs from scratch, such as MPT (Team, 2023), Falcon (Almazrouei et al., 2023), XGen (Nijkamp et al., 2023), Phi (Gunasekar et al., 2023; Li et al., 2023e), Baichuan (Yang et al., 2023a), Mistral (Jiang et al., 2023a), Grok (xAI, 2023), and Yi (01ai, 2023). We believe that developing more powerful and efficient open-source LLMs to democratize the capabilities of closed-source LLMs is a promising future research direction. In terms of comprehensive capabilities, Llama-2-chat-70B (Touvron et al., 2023b) outperforms GPT-3.5-turbo in certain benchmarks but still lags behind in most other tasks. Through distillation direct preference optimization, Zephir-7B (Tunstall et al., 2023) approaches the performance of 70B LLMs. WizardLM-70B (Xu et al., 2023a) and GodziLLa-70B (Philippines, 2023) achieve performance comparable to GPT-3.5-turbo, indicating a promising research direction.
In some domains, open-source LLMs surpass GPT-3.5-turbo. For LLM-based agents, with broader and task-specific pre-training and fine-tuning, open-source LLMs can outperform GPT-3.5-turbo in certain tasks. For example, Lemur-70B-chat (Xu et al., 2023d) performs better in exploring environments and following coding task feedback. AgentTuning (Zeng et al., 2023) shows improvements in unknown agent tasks.
ToolLLama (Qin et al., 2023b) excels in tool usage. Gorilla (Patil et al., 2023) is better than GPT-4 at writing API calls.
In logical reasoning, WizardCoder (Luo et al., 2023c) and WizardMath (Luo et al., 2023a) enhance reasoning capabilities through improved instruction fine-tuning. Lemur (Xu et al., 2023d) and Phi (Gunasekar et al., 2023; Li et al., 2023e) achieve stronger capabilities by pre-training on higher-quality data.
For modeling long contexts, Llama-2-long (Xiong et al., 2023) uses longer tokens and larger context windows for pre-training, improving performance in selected benchmarks. Xu et al. (2023b) enhance performance in seven long-context tasks by combining context window expansion with positional interpolation and retrieval augmentation. For specific application capabilities, InstructRetro (Wang et al., 2023a) improves performance in open-ended QA through retrieval and instruction fine-tuning. With task-specific fine-tuning, MentaLlama-chat-13B (Yang et al., 2023c) surpasses GPT-3.5-turbo in mental health analysis datasets. Radiology-Llama2 (Liu et al., 2023) enhances radiology report performance. Stru-Bench (Tang et al., 2023b), a fine-tuned 7B model, improves structured response generation compared to GPT-3.5-turbo, a core capability for supporting agent tasks. Shepherd (Wang et al., 2023c), with only 7B parameters, achieves performance comparable to or better than GPT-3.5-turbo in generating model feedback and evaluations. For trustworthy AI, hallucinations can be reduced using higher-quality fine-tuning data (Lee et al., 2023a), context-aware decoding techniques (Dhuliawala et al., 2023), external knowledge enhancement (Li et al., 2023c; Yu et al., 2023b; Peng et al., 2023; Feng et al., 2023), or multi-agent dialogue (Cohen et al., 2023; Du et al., 2023). In the field of artificial intelligence safety, GPT-3.5-turbo and GPT-4 remain unsurpassed. Due to the large-scale RLHF (Reinforcement Learning from Human Feedback) involved in GPT models (Bai et al., 2022a), they are widely considered to exhibit safer and more ethical behavior. This may be more critical for commercial LLMs than for open-source LLMs. However, as the RLHF process becomes more accessible (Bai et al., 2022b; Rafailov et al., 2023a), open-source LLMs are expected to achieve further improvements in safety.
4.3 Secrets of the Best Open-Source LLMs
Training large language models involves complex practices and requires substantial resources, including data collection, preprocessing, model design, and training. Although the release of open-source LLMs is increasing, the detailed practices of leading models are often kept confidential. Below are some widely recognized best practices in the community.
Data: Pretraining involves the use of trillions of tokens from publicly accessible sources. From an ethical standpoint, it is crucial to exclude all data containing private information (Touvron et al., 2023b). Unlike pretraining data, fine-tuning data is smaller in quantity but higher in quality. LLMs fine-tuned with high-quality data have demonstrated superior performance in specific domains (Philippines, 2023; Zeng et al., 2023; Xu et al., 2023d, a).
Model Architecture: Although most LLMs use decoder-only Transformer architectures, they also employ various techniques to optimize efficiency. Llama-2 utilizes Ghost Attention to improve multi-turn dialogue control (Touvron et al., 2023b). Mistral (Jiang et al., 2023b) uses sliding window attention to handle extended context lengths.
Training: The process of supervised fine-tuning (SFT) with instruction-tuning data is critical. To achieve high-quality results, tens of thousands of SFT annotations are required—for example, Llama-2 used 27,540 annotations (Touvron et al., 2023b). The diversity and quality of the data are paramount (Xu et al., 2023a). During the RLHF stage, the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017a) is often preferred to better align model behavior with human preferences and instructions, making PPO crucial for enhancing LLM safety. Direct Preference Optimization (DPO) serves as an alternative to PPO (Rafailov et al., 2023a). For instance, Zephyr-7B (Tunstall et al., 2023) employs distilled DPO and demonstrates performance comparable to 70B-LLMs on various general benchmarks, even surpassing GPT-3.5-turbo on AlpacaEval.
4.4 Vulnerabilities and Potential Issues Data Contamination in Pretraining: This issue has become increasingly prominent, especially after the release of foundation models without disclosed pretraining corpus sources. This lack of transparency may bias perceptions of large language models' (LLMs) true generalization capabilities. Beyond cases where benchmark data is annotated by human experts or larger models and manually integrated into training sets, the root cause of data contamination lies in benchmark data sources being included in pretraining corpora. Although these models are not intentionally pretrained with supervised data, they can still acquire exact knowledge. Therefore, addressing pretraining corpus detection (Shi et al., 2023), exploring overlaps between existing benchmarks and widely used pretraining corpora, and evaluating benchmark overfitting (Wei et al., 2023) are crucial for enhancing LLM reliability. Future directions may involve establishing standardized practices for disclosing pretraining corpus details and developing methods to mitigate data contamination throughout the model development lifecycle.
Closed-Source Alignment Development: Within the community, RLHF applications using general preference data for alignment have gained increasing attention. However, due to the scarcity of high-quality, publicly available preference datasets and pretrained reward models, only a few open-source LLMs have implemented RLHF for enhanced alignment. Some initiatives (Bai et al., 2022a; Wu et al., 2023; Cui et al., 2023) have attempted to contribute to the open-source community. Nevertheless, challenges remain in complex reasoning, programming, and safety scenarios due to the lack of diverse, high-quality, and scalable preference data.
Difficulties in Continuously Improving Model Capabilities: The breakthroughs in fundamental capabilities outlined here reveal several challenging issues: (1) During pretraining, significant effort has been devoted to exploring improved data compositions for building more robust foundation models. However, the associated costs make such attempts impractical for real-world applications. (2) Models outperforming GPT-3.5-turbo or GPT-4 primarily leverage knowledge distillation from closed-source models and additional expert annotations. While efficient, over-reliance on knowledge distillation may obscure issues that arise when scaling these methods to teacher models.
Additionally, while LLMs are expected to function as agents and provide reasonable explanations to support decision-making, annotating agent-style data for real-world scenarios remains expensive and time-consuming. Essentially, optimizing solely through knowledge distillation or expert annotation cannot sustainably improve LLMs' fundamental capabilities and may soon reach diminishing returns. Future research may explore new methodologies like unsupervised or self-supervised learning paradigms to achieve continuous progress in core LLM capabilities while mitigating associated challenges and costs. In this report, we systematically review high-performance open-source LLMs that have surpassed or caught up with ChatGPT across various tasks one year after its release (Section 3). Additionally, we provide in-depth insights and analysis of open-source large language models and explore potential issues (Section 4). We believe this survey will contribute to exploring future development directions for open-source LLMs, stimulate further research and development in this field, and help narrow the gap between open-source and proprietary models.