Responsible: Yu Zhang
Duration of the project: Aug. 2019 - Jul. 2021
Humans are by nature voice experts. We produce voices in idiosyncratic ways and are fully capable of perceiving others’ unique vocal productions. However, the relationship between acoustics and perception is yet not fully understood. Past research on voice recognition focused on the time-invariant glottal and vocal tract attributes of speech (e.g. fundamental and formant frequencies) as perceptually useful acoustic cues of individual voice. Recent studies showed that the dynamic nature of speech articulation is also critical in encoding speaker identity. Thus, I propose to look at how speaker-specific information is perceived in the temporal organization of the speech signal. Our previous studies showed that temporal variations in parts of the speech signal corresponding to the mouth-closing gestures (signal negative dynamics hereafter) contain more speaker-specific information and it might be a universal phenomenon independent of language. However, the presence of certain idiosyncratic temporal features tested in acoustic analysis does not necessarily mean that listeners use them as identity cues in voice recognition. The proposed project thus aims to bridge this gap. Using behavioral voice recognition methods to test whether and to what extent voice recognition is facilitated by signal negative dynamics. I also plan to test to what degree parts of the speech signal correlated with negative dynamics can assist automatic speaker recognition (ASR) systems. Since signal negative dynamics contain more speaker-specific information, we can expect that they will provide important complementary information to ASR systems to achieve higher recognition accuracy and robustness. The findings will enrich our understanding of the perceptual mechanisms underlying voice identity processing, and shed light on the evolution of human communication. It transcends the traditional ASR approach and opens up new avenues for developing better ASR systems with higher performance and less time cost.
Keywords: speaker individuality, signal dynamics, human performance, machine performance
Funding source(s): Forschungskredit UZH (Candoc): FK-19-069
Partners: Volker Dellwo