Open-source ChatGPT-like Platform Mistral AI Secures Massive Funding Again

baoshi.rao

On December 6th, Bloomberg reported that the open-source ChatGPT-like platform Mistral AI raised €450 million (approximately ¥3.5 billion) in funding, with a valuation nearing $2 billion (¥14.2 billion). The investment was led by NVIDIA and Salesforce.

Mistral AI's open-source large language model, Mistral 7B, is known for its small parameter size, low energy consumption, strong performance, and commercial usability. It supports text/code generation, data fine-tuning, content summarization, and more, currently boasting 4.5k stars on GitHub.

Notably, Mistral AI previously secured $113 million in seed funding without releasing any product, making it one of the largest seed rounds in European tech history.

Open-source address: https://github.com/mistralai/mistral-src

Documentation: https://docs.mistral.ai/

API interface: https://docs.mistral.ai/api

Compared to the metaverse, ChatGPT which just celebrated its 1st birthday has withstood multiple tests including commercial implementation and user adoption, and has driven numerous tech companies to participate in the generative AI transformation.

Currently, the landscape is divided into two main camps: closed-source and open-source. After Meta's Llama fired the first shot, the open-source large language model field has seen the emergence of outstanding companies like Writer, Baichuan Intelligence, Together.ai, and Mistral AI, which have also gained recognition in the capital markets. These companies firmly believe that open-source is one of the shortcuts for large models to achieve AGI.

As early as June this year, 'AIGC Open Community' introduced Mistral AI, leaving a strong impression at the time. Without releasing any products, their official website had only three sentences: We are assembling a world-class technical team to develop the best generative AI models.

We operate in Europe, with our headquarters in Paris, France. If you have extensive research or development experience in AI, please contact us.

With just three sentences, they secured $113 million in seed funding at a $260 million valuation. Typically, such companies either ride the hype wave to get funding and then make minor model adjustments while waiting to fade away, or they are technical masters who make a grand entrance and shake the industry. Judging by this funding round, Mistral AI clearly belongs to the latter category and has real substance.

Public records show that Mistral AI's three co-founders—Timothée Lacroix, Guillaume Lample, and Arthur Mensch—are no small players. They boast impressive resumes from major tech companies and successful experiences with renowned projects, while also being university alumni.

Timothée and Guillaume previously worked in Meta's AI research department, where they led the development of LLaMA, the pioneer of open-source ChatGPT-like models. Arthur worked at Google's AI research lab, DeepMind.

In terms of products, Mistral AI's Mistral7B, launched on September 27, is currently the strongest open-source large language model, outperforming Llama213B in all benchmarks. It matches or exceeds Llama134B in many benchmarks and performs comparably to CodeLlama7B in coding tests. To enable faster and more energy-efficient inference, Mistral AI utilizes two key mechanisms: grouped query attention and sliding window attention.

Grouped query attention is an improvement over the standard attention mechanism, reducing computational complexity by grouping queries. In Transformer models, attention mechanisms typically involve three sets of vectors: queries, keys, and values.

In standard self-attention, each query is matched with all keys, which creates a significant computational burden for long sequences. Grouped query attention works by combining multiple queries into a group. Each group's query vectors then interact with only a subset of key vectors rather than all of them, resulting in highly efficient overall performance.

Sliding window attention is a technique used in sequence processing tasks to limit the scope of attention mechanisms and reduce computational load. In this method, the attention for each element is not calculated over the entire sequence but is instead restricted to elements within a nearby window.

By doing so, each part of the model only needs to process information within the window, thereby reducing the number of elements involved in each attention calculation.

This not only decreases computational requirements but also limits the model's contextual scope, helping it focus on local information.