Audio & Speech Perception: Speech recognition, auditory scene analysis, and multimodal audio-visual integration

Michael Rodriguez

Authors

Michael Rodriguez

Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany

Author

Keywords:

Speech recognition, auditory scene analysis, audio-visual integration, computational modeling, neural mechanisms, predictive coding, deep learning, multisensory perception

Abstract

Audio and speech perception is central to human communication, relying on complex neural processes to decode acoustic signals into meaningful information. This paper synthesizes advances in speech recognition, auditory scene analysis (ASA), and multimodal audio-visual integration. It compares deep learning-based automatic speech recognition (ASR) with human speech processing, explores how the auditory system segregates complex acoustic scenes in ASA, and examines the integration of auditory and visual cues in speech perception, emphasizing temporal synchrony and predictive coding. The review highlights the interaction between bottom-up processing and top-down cognitive influences, addressing challenges in noisy environments and individual differences. A unified framework is proposed to bridge these domains, with future directions for theoretical models and applications in assistive technologies and human-machine interaction.

Audio & Speech Perception: Speech recognition, auditory scene analysis, and multimodal audio-visual integration

Authors

Michael Rodriguez

Keywords:

Abstract

Downloads