Audio & Speech Perception: Speech recognition, auditory scene analysis, and multimodal audio-visual integration
Keywords:
Speech recognition, auditory scene analysis, audio-visual integration, computational modeling, neural mechanisms, predictive coding, deep learning, multisensory perceptionAbstract
Audio and speech perception is central to human communication, relying on complex neural processes to decode acoustic signals into meaningful information. This paper synthesizes advances in speech recognition, auditory scene analysis (ASA), and multimodal audio-visual integration. It compares deep learning-based automatic speech recognition (ASR) with human speech processing, explores how the auditory system segregates complex acoustic scenes in ASA, and examines the integration of auditory and visual cues in speech perception, emphasizing temporal synchrony and predictive coding. The review highlights the interaction between bottom-up processing and top-down cognitive influences, addressing challenges in noisy environments and individual differences. A unified framework is proposed to bridge these domains, with future directions for theoretical models and applications in assistive technologies and human-machine interaction.