Dépôt numérique

Towards multiple vocal effort speaker verification: exploring speaker-dependent invariant information between normal and whispered speech.

Sarria Paja, Milton Orlando (2016). Towards multiple vocal effort speaker verification: exploring speaker-dependent invariant information between normal and whispered speech. Thèse. Québec, Université du Québec, Institut national de la recherche scientifique, Doctorat en télécommunications, 171 p.

Télécharger (6MB) | Prévisualisation


Speech-based biometrics is one of the most effective ways for identity management and one of the preferred methods by users and companies given its flexibility, speed and reduced cost. In speaker recognition, the two most popular tasks are speaker identification (SI) and speaker verification (SV). Commonly, SV exhibits greater practical applications related to SI, especially in access control and identity management applications. Current state-of-the-art SV systems are known to be strongly dependent on the condition of the speech material provided as input and can be affected by unexpected variability presented during testing, such as environmental noise or changes in vocal effort. In this thesis, SV using whispered speech is explored, as whispered speech is known to be a natural speaking style that despite its reduced perceptibility, still contains relevant information regarding the intended message (i.e., intelligibility), as well as the speaker identity and gender. However, given the acoustic differences between whispered and normally-phonated speech, speech applications trained on the latter but tested with the former exhibit unacceptable performance levels. Within an automated speaker verification task, previous research has shown that i) conventional features (e.g., mel-frequency cepstral coefficients, MFCCs) do not convey sufficient speaker discrimination cues across the two vocal efforts, and ii) multi-style training, while improving the performance for whispered speech, tends to deteriorate the performance for normal speech. Moreover, by exploring the performance boundaries achievable with whispered speech for speaker verification, it was found that the lack of sufficient data to train whispered speech speaker models is a major limiting factor. As such, this thesis aims to address these three shortcomings and proposes i) innovative features that are less sensitive to changes from normal to whispered speech, thus helping to reduce the mismatch gap, ii) fusion strategies that improve accuracy for both normal and whispered speech, and iii) machine learning principles that overcome the limited resources/data problem. To properly address the task at hand, we first perform a comparative analysis among some of the most common feature based approaches and training techniques reported in related fields that have shown benefits in the mismatch problem induced by vocal effort variations, including whispered speech. This allowed us to narrow down candidate solutions and strategies that could be useful for whispered speech speaker verification after some careful fine tuning. Next, by combining these insights, with those previously reported in the literature, as well as statistical analysis tools, innovative features are proposed that provide not only invariant information across the two vocal efforts, but also features that provide complementary information. Subsequently, in order to take advantage of the complementary information extracted from the different feature representations, we explore fusion schemes at different levels. When using a combination of normal and whispered speech data during parameter estimation, improved multi vocal effort speaker verification is achieved and relative improvements as high as 68% and 70% for normal and whispered speech, respectively, could be seen relative to a baseline system based on MFCC features. When including whispered speech during enrollment, further improvements are achieved for whispered speech without severely affecting performance for normal speech. These results show that the proposed strategies allow to efficiently use the limited resources available, and achieve high performance levels for whispered speech inline with performance obtained for normal speech.

Type de document: Thèse Thèse
Directeur de mémoire/thèse: Falk, Tiago H.
Mots-clés libres: whispered speech; speaker verification; modulation spectrum; mutual information; system fusion; mismatch problem; neural networks
Centre: Centre Énergie Matériaux Télécommunications
Date de dépôt: 16 févr. 2018 21:21
Dernière modification: 16 févr. 2018 21:21
URI: http://espace.inrs.ca/id/eprint/6539

Actions (Identification requise)

Modifier la notice Modifier la notice