Supporting 200K Ultra-Long Context, Capable of Reading 300,000 Chinese Characters at Once, "InternLM" 2.0 Officially Open-Sourced

baoshi.rao

On January 17th, SenseTime and Shanghai AI Lab, in collaboration with The Chinese University of Hong Kong and Fudan University, officially released the next-generation large language model InternLM2.0. The core philosophy of InternLM2 lies in returning to the essence of language modeling, striving to achieve qualitative improvements in the model's foundational language modeling capabilities by enhancing the quality and information density of the training corpus. This has led to significant progress in various areas such as mathematics, coding, dialogue, and creative writing, with overall performance reaching the leading level among open-source models.

InternLM2 was trained on a high-quality corpus of 2.6 trillion tokens. Following the setup of the first-generation InternLM, InternLM2 includes two parameter sizes (7B and 20B) and versions such as base and dialogue, catering to different complex application scenarios. It continues to be open-source, providing free commercial licensing.

Open-source links:

Github: https://github.com/InternLM/InternLM

HuggingFace: https://huggingface.co/internlm ModelScope: https://modelscope.cn/organization/Shanghai_AI_Laboratory

Research on large language models should return to the essence of language modeling, the foundation for improving various performances of large models lies in the enhancement of language modeling capabilities. To this end, the joint team proposed a new generation of data cleaning and filtering technology, which strengthens the foundation of large model capabilities through higher-quality corpora and higher information density.

The team has mainly developed technical methods in the following aspects:

Multi-dimensional data value assessment: Comprehensive evaluation and improvement of data value based on dimensions such as text quality, information quality, and information density;

High-quality corpus-driven data enrichment: Utilizing the characteristics of high-quality corpora to further enrich similar corpora from the physical world, the internet, and corpus repositories; Targeted Data Completion: Targeted supplementation of corpora, focusing on strengthening core capabilities such as real-world knowledge, mathematics, and coding.

Currently, the data cleaning and filtering technology behind Pu Yu has undergone three rounds of iterative upgrades. Using only about 60% of the training data, it achieves performance comparable to training with 1T tokens of second-generation data, significantly improving model training efficiency.

Based on the third-generation data cleaning and filtering technology, the language modeling capabilities of InternLM2 have been significantly enhanced.

The ability to handle and understand long-context inputs can significantly expand the application scenarios of large models, such as supporting large document processing, complex reasoning calculations, and tool calls in real-world scenarios. However, the limited context length of large models remains a major challenge in academia and industry. By expanding the training window size and improving position encoding, InternLM2 supports a context of 200,000 tokens, capable of accepting and processing input content of approximately 300,000 Chinese characters (equivalent to about five or six hundred pages of documents) at once, accurately extracting key information, achieving the 'finding a needle in a haystack' in long texts. Following industry benchmarks, researchers conducted a "needle-in-a-haystack" test on InternLM2: randomly inserting key information at different positions within a long text and setting questions to test the model's ability to extract the critical information.

The above diagram shows InternLM2's recall accuracy (vertical axis) for retrieving key information across different context lengths (horizontal axis) and positions within the context (vertical axis). Red indicates lower recall accuracy, while green represents higher recall rates. The test results demonstrate that InternLM2 maintains near-perfect recall success rates even when context length extends to 200K, validating its robust support for ultra-long contexts.

To test InternLM2's capability in real-world long-text processing tasks, researchers input a 3-hour transcript of a public meeting recording into the model and asked it to extract key information. The results showed that despite numerous uncorrected typos in the raw transcript, InternLM2 accurately distilled the key information and summarized the main speakers' viewpoints.

InternLM2 shows comprehensive improvements across all capabilities, with particularly significant advancements in reasoning, mathematics, and coding compared to the first-generation InternLM, outperforming other open-source models of similar scale. Based on the application methods of large language models and key areas of user focus, researchers have defined six capability dimensions: language, knowledge, reasoning, mathematics, coding, and examinations. A comprehensive evaluation was conducted on multiple models of similar scale across 55 mainstream benchmark datasets. The results show that InternLM2's lightweight and medium-sized versions perform exceptionally well among models of comparable scale.

The following table compares the performance of various InternLM2 versions with ChatGPT (GPT-3.5) and GPT-4 on typical benchmark datasets. Notably, InternLM2 achieves performance comparable to ChatGPT with only 20B parameters in a medium-sized model. In benchmarks requiring high reasoning capabilities such as AGIEval, BigBench-Hard (BBH), GSM8K, and MATH, InternLM2 even outperforms ChatGPT.

Meanwhile, the enhanced comprehensive performance brings all-round improvements in downstream tasks. The newly released InternLM2 offers excellent conversational and creative experiences, supports multi-round task planning and tool invocation, and provides practical data analysis capabilities.

InternLM2 not only shows significant improvements in objective performance metrics but also demonstrates noticeable enhancements in subjective user experience, delivering superior conversational and interactive experiences. Research tests indicate that InternLM2-Chat can accurately understand and follow user intentions, demonstrating strong empathy and rich structured creative capabilities. Below are some examples: Example 1: Developing a course syllabus under strict formatting requirements

Example 2: Providing comforting responses with humanistic care to users

Example 3: Unleashing imagination to write the script for "The Wandering Earth 3"

The improvement in conversational and creative experiences stems from two factors: significant enhancement in fundamental language capabilities and advancements in fine-tuning techniques. During the fine-tuning of InternLM2, instruction fine-tuning corpora processed through third-generation data cleaning and filtering technology were employed, along with stronger Online RLHF. Researchers conducted three rounds of iterative updates to both the reward model and dialogue model during the fine-tuning process, with each update refining preference data and prompts based on the previous model's performance. In both the Reward Model (RM) training and Proximal Policy Optimization (PPO) phases, researchers balanced various types of prompts, which not only enhanced dialogue safety but also improved user experience. With enhanced capabilities in instruction understanding, tool selection, and result reflection, InternLM2 supports the construction of complex agents, enabling multi-round tool invocation and multi-step planning to accomplish intricate tasks. The collaborative team developed a fine-grained tool invocation evaluation set called T-Eval (https://open-compass.github.io/T-Eval). On this benchmark, InternLM2-Chat-7B outperforms Claude-2.1 and current open-source models, with performance approaching that of GPT-3.5.

Through tool invocation, large language models can acquire knowledge and handle more complex problems via search, computation, and code interpreters, thereby expanding application boundaries. Researchers conducted granular decomposition and analysis of the model's tool invocation process, implementing targeted enhancements and optimizations for planning, reasoning, tool selection, comprehension, execution, and reflection steps.

Mathematical capability is a crucial manifestation of a model's logical thinking and reasoning abilities. Shanghai AI Lab has comprehensively improved InternLM2's mathematical capabilities, bringing them to the benchmark level of current open-source models.

Based on scientifically constructed pre-training corpora, InternLM2 has developed strong endogenous computational abilities. Without relying on external tools like calculators, it achieves nearly 100% accuracy in simple arithmetic operations within 100 and approximately 80% accuracy within 1000. In GSM8K and MATH evaluations, InternLM2-20B surpasses ChatGPT (GPT-3.5). To address various complex computations, InternLM2-Chat can leverage the Code Interpreter to write code for calculations or formally verify reasoning results, thereby solving problems with higher computational demands or more intricate derivation processes. On typical mathematical evaluation sets like GSM8K and MATH, InternLM2 achieved higher evaluation scores with the assistance of the code interpreter. Notably, on the more challenging MATH dataset, InternLM2's computational accuracy significantly improved from 32.5 to 51.2, even surpassing GPT-4's performance.

The following example demonstrates how InternLM2 can collaborate with the code interpreter to solve complex advanced mathematical problems.

With its robust foundational capabilities in computation and tool invocation, InternLM2 has developed practical abilities in data analysis and visualization within language models, further aligning with user application scenarios.