Microsoft Open Sources SliceGPT: Compresses Large Models by ~25% While Maintaining Performance

baoshi.rao

Microsoft and ETH Zurich researchers jointly open-sourced SliceGPT, a technology that can drastically compress the weight matrices of large models, reducing model size by approximately 25% while maintaining performance. Experimental data shows that SliceGPT has been successfully applied to several large models, such as LLAMA-270B, OPT66B, and Phi-2, while preserving zero-shot task performance.

The core technology of SliceGPT lies in leveraging computational invariance to simplify and compress models. By applying orthogonal matrix transformations to each weight matrix, SliceGPT achieves extreme model compression. Additionally, the sliced models can run directly on consumer-grade GPUs, such as NVIDIA's 4090 and 4080, without requiring additional code optimization, making deployment more convenient.

In experiments, researchers found that SliceGPT's slicing technique is remarkably simple and efficient. Model compression can be completed within a few hours using a single GPU, without the need for complex fine-tuning processes. The sliced models maintain high-quality generative task performance while improving throughput, delivering overall satisfactory results. The open-source release of SliceGPT offers a novel and effective method for compressing large models, substantially reducing deployment resources while preserving model performance. This technology is expected to provide developers and enterprises with more convenient and efficient solutions for large model applications.

Open-source address: https://github.com/microsoft/TransformerCompression

Paper address: https://arxiv.org/abs/2401.15024