Alibaba Tongyi's Fun-ASR Speech Model Upgraded with Over 15% Accuracy Leap in Vertical Domains

baoshi.rao

Alibaba Tongyi officially launched the new-generation end-to-end large speech recognition model Fun-ASR, which achieves a breakthrough improvement of over 15% in speech recognition accuracy in vertical industry scenarios such as home decoration and insurance by enhancing context awareness and high-precision transcription capabilities. Test data shows an 18% accuracy increase in the insurance sector compared to the previous generation, with improvements of 15%-20% in home decoration, livestock, and other fields.

As a large language model-driven speech recognition algorithm, Fun-ASR employs self-developed speech algorithms and Qwen3 supervised fine-tuning technology, combining cutting-edge model architecture with text modality alignment techniques. While maintaining advantages in language processing, it integrates a RAG retrieval enhancement solution, supporting the import of over 1,000 custom hotwords. This feature automatically matches domain-specific hotwords, historical documents, and contextual records in audio, significantly optimizing keyword recognition in specific scenarios.

Alibaba Tongyi's New-Generation Speech Model Fun-ASR Evolves Further with Over 15% Accuracy Improvement in Vertical Domains

To address pain points in speech recognition such as noise interference, language confusion, and generation hallucinations, the R&D team innovatively introduced reinforcement learning (RL) technology, dynamically optimizing strategies to reduce recognition errors and substantially improving system stability and reliability. Notably, the model outperforms similar products in recognizing dialects such as Sichuanese, Cantonese, and Hokkien, while adapting to complex acoustic environments like far-field pickup and near-field noise reduction, covering diverse scenarios including meeting rooms, workstations, supermarkets, and outdoor settings.

In terms of training data, Fun-ASR is built on hundreds of millions of hours of audio data, deeply integrating professional terminology libraries from over ten fields such as the internet, technology, livestock, and automotive. This data advantage gives it a significant edge in vertical industry recognition—for example, accurately identifying key commands from livestock sounds and environmental noise in the livestock industry.

The Alibaba Tongyi technical team stated that the evolution of Fun-ASR marks the deep penetration of speech recognition technology from general scenarios to specialized and scenario-specific applications. As the model is deployed in more industries, its dynamic hotword updates and multimodal interaction capabilities will further drive innovation in speech interaction efficiency.