Multimodal Neural Network SALMONN: An AI Model Capable of Understanding the Auditory World

baoshi.rao

SALMONN is a multimodal neural network capable of directly processing and understanding general audio inputs, including speech, audio events, and music, while delivering competitive performance across multiple speech and audio tasks.

Paper address: https://arxiv.org/pdf/2310.13289v1.pdf

SALMONN employs two complementary audio encoders—one for processing speech and another for non-speech audio events—to achieve superior performance across diverse audio tasks.

The paper introduces an activation adjustment phase to address the issue of SALMONN overfitting to certain tasks during training. This phase enables SALMONN to develop cross-modal capabilities, such as question-answering and narration. This research is expected to advance the development of artificial intelligence with general auditory capabilities.