Apple's Large Model MM1 Enters the Arena: 30 Billion Parameters, Over Half of Authors Are Chinese

baoshi.rao

Apple has recently released a large multimodal foundational model named MM1, boasting 30 billion parameters, utilizing a MoE architecture, with over half of its authors being Chinese. This model holds significant importance in the multimodal domain and may hint at future Apple products related to this technology.

This year, Apple has significantly increased its investment in the generative artificial intelligence (GenAI) field, demonstrating its determination to make major advancements in GenAI. It is reported that some teams at Apple, originally working on car projects, have now shifted their focus to research and development in the GenAI domain.

Paper address: https://arxiv.org/pdf/2403.09611.pdf The launch of MM1 has attracted significant attention. According to the research paper, this model adopts a variant of MoE and has shown leading performance in pretraining metrics and multiple multimodal benchmarks. Through extensive ablation studies, the researchers explored the importance of model architecture, pretraining data selection, and training procedures. They found that image resolution, visual encoder loss, and pretraining data all play crucial roles in model design.

In terms of pretraining data selection, the researchers identified several key insights: interleaved data helps improve few-shot and pure-text performance, while caption data significantly enhances zero-shot performance.

Additionally, pure text data is equally vital for improving few-shot and pure-text performance. By appropriately mixing image and text data, optimal multimodal performance can be achieved while maintaining strong text capabilities. The researchers also discovered that synthetic data aids few-shot learning. Ultimately, the researchers determined the final configuration of MM1, including the selection of image encoder, vision-language connector, and pretraining data. They also scaled the LLM to 3B, 7B, and 30B parameters, further extending the model through Mixture of Experts (MoE). In supervised fine-tuning experiments, MM1 demonstrated competitive performance across multiple benchmarks, with the MoE model outperforming dense models in nearly all tests.

The release of MM1 marks a significant advancement for Apple in the multimodal domain and lays the technical foundation for potential future products. The research outcomes hold substantial importance for advancing the field of generative AI, warranting close attention from the industry.