Anonymous Paper Proposes Clever Trick to Enhance Long-Text Capabilities of Large Models
-
When it comes to improving large models' long-text capabilities, do you immediately think of length extrapolation or context window expansion?
No, these methods are too hardware-intensive.
Here's a fascinating new approach: Unlike approaches such as KV caching and length extrapolation, it uses the model's parameters to store vast amounts of contextual information.
The specific method involves creating a temporary Lora module that only undergoes "streaming updates" during long-text generation. This means using previously generated content as continuous input to serve as training data, thereby ensuring knowledge is stored within the model parameters. Then once the inference is complete, discard it, ensuring no long-term impact on the model parameters.
This method allows us to freely store contextual information without expanding the context window, storing as much as we want.
Experiments show that this approach: It can significantly improve the quality of long-text tasks for models, achieving a 29.6% reduction in perplexity and a 53.2% increase in long-text translation quality (BLUE score).
It is also compatible with and enhances most existing long-text generation methods.
Most importantly, it can greatly reduce computational costs.
While ensuring a slight improvement in generation quality (3.8% reduction in perplexity), the required FLOPs for inference are reduced by 70.5%, and latency is reduced by 51.5%! To understand the specifics, let's delve into the paper.
The method is called Temp-Lora, and its architecture diagram is as follows:
At its core, it involves training temporary Lora modules step-by-step on previously generated text in an autoregressive manner.
This module is highly adaptable and can continuously adjust, enabling a deep understanding of both near and distant contexts. The specific algorithm is as follows:
During the generation process, tokens are generated chunk by chunk. Each time a chunk is generated, the latest Lx token is used as input X to generate subsequent tokens.
Once the number of generated tokens reaches the predefined block size ∆, the latest block is used to initiate the training of the Temp-Lora module, and then the next block generation begins. In the experiments, the authors set ∆+Lx to W to fully utilize the model's context window size. For the training of Temp-Lora modules, learning to generate new blocks without any conditions may not form an effective training objective and could lead to severe overfitting.
To address this issue, the authors incorporated the LT tokens preceding each block into the training process, using them as inputs and the blocks as outputs.
Finally, the authors also proposed a strategy called Cache Reuse to achieve more efficient inference. Generally, after updating the Temp-Loramo module in a standard framework, we need to recalculate the KV state using the updated parameters.
Alternatively, we can reuse the existing cached KV state while using the updated model for subsequent text generation.
Specifically, we only recalculate the KV state with the latest Temp-Lora module when the model generates the maximum length (context window size W). This cache reuse method can significantly speed up generation without noticeably affecting the quality.
That's all for the introduction to the Temp-Lora method. Now, let's focus on the testing.
The authors evaluated the Temp-Lora framework on Llama2-7B-4K, Llama2-13B-4K, Llama2-7B-32K, and Yi-Chat-6B models, covering two types of long-text tasks: generation and translation.
The test dataset is a subset of the long-text language modeling benchmark PG19, from which 40 books were randomly selected. Another is a randomly sampled subset from the WMT2023 Chinese-style dataset, containing 20 Chinese web novels professionally translated into English.
First, let's look at the results on PG19.
The table below shows the comparison of PPL (perplexity, reflecting the model's uncertainty for a given input, lower is better) for various models with and without the Temp-Lora module on PG19. Each document is divided into segments ranging from 0-100K to 500K+ tokens. It can be observed that all models show significant PPL reduction after applying Temp-Lora, with the effect becoming more pronounced as the text length increases (only 3.6% reduction for 1-100K, but 13.2% reduction for 500K+). Therefore, we can simply conclude: the longer the text, the more necessary it becomes to use Temp-Lora.
Additionally, we found that adjusting the block size from 1024 to 2048 and 4096 resulted in a slight increase in PPL.
This isn't surprising, as the Temp-Lora module is trained on data from previous blocks. This data primarily tells us that block size selection is a critical trade-off between generation quality and computational efficiency (further analysis can be found in the paper).
Finally, we can also observe that cache reuse does not lead to any performance degradation.
This is very encouraging news.
Below are the results on the traditional Chinese dataset. It is evident that Temp-Lora has a significant impact on long-text literary translation tasks.
Compared to the base model, all metrics show substantial improvements: PPL decreased by -29.6%, BLEU score (measuring the similarity between machine-translated text and high-quality reference translations) increased by +53.2%, and COMET score (another quality metric) improved by +8.4%. Finally, there are explorations in computational efficiency and quality.
Through experiments, the author found that using the most "economical" Temp-Lora configuration (∆=2K, W=4K) can reduce PPL by 3.8% while saving 70.5% of FLOPs and 51.5% of latency. On the contrary, if we completely ignore computational costs and use the most "luxurious" configuration (∆=1K and W=24K), we can achieve a 5.0% reduction in PPL, with an additional 17% increase in FLOPs and a 19.6% increase in latency.
Summarizing the above results, the authors also provide three practical recommendations for applying Temp-Lora:
- For applications requiring the highest level of long-text generation, integrating Temp-Lora into existing models without changing any parameters can significantly improve performance at a relatively moderate cost. 2. For applications prioritizing minimal latency or memory usage, computational costs can be significantly reduced by decreasing input length and the context information stored in Temp-Lora.
In this setup, we can use a fixed short window size (such as 2K or 4K) to process nearly infinite-length texts (up to 500K+ in the author's experiments).
- Finally, note that in scenarios without large amounts of text, such as in pretraining where the context is smaller than the model's window size, Temp-Lora is entirely useless.
It's worth mentioning that despite inventing such a simple yet innovative method, the author left very little source information: The institution is directly signed as 'Confidential Institution', and only the full surnames of the three authors are provided. However, based on the email information, they may be from universities such as City University of Hong Kong or Chinese University of Hong Kong.
Finally, what do you think of this method?