Dépôt numérique
RECHERCHER

Applications of deep learning to speech enhancement.

Téléchargements

Téléchargements par mois depuis la dernière année

Santos, João Felipe (2019). Applications of deep learning to speech enhancement. Thèse. Québec, Université du Québec, Institut national de la recherche scientifique, Doctorat en télécommunications, 156 p.

[thumbnail of Santos; Joao Felipe.pdf]
Prévisualisation
PDF
Télécharger (4MB) | Prévisualisation

Résumé

Deep neural networks (DNNs) have been successfully employed in a broad range of applications, having achieved state-of-the-art results in tasks such as acoustic modeling for automatic speech recognition and image classification. However, their application to speech enhancement problems such as denoising, dereverberation, and source separation is more recent work and therefore in more preliminary stages. In this work, we explore DNN-based speech enhancement from three different and complementary points of view. First, we propose a model to perform speech dereverberation by estimating its spectral magnitude from the reverberant counterpart. Our models are capable of extracting features that take into account both short and long-term dependencies in the signal through a convolutional encoder and a recurrent neural network for extracting long-term information. Our model outperforms a recently proposed model that uses different context information depending on the reverberation time, without requiring any sort of additional input, yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA relative to reverberant speech. We also show our model is able to generalize to real room impulse responses even when only trained with simulated room impulse responses, different speakers, and high reverberation times. Lastly, listening tests show the proposed method outperforming benchmark models in reduction of perceived reverberation. We also study the role of residual and highway connections in deep neural networks for speech enhancement, and verify whether they function in the same way that their digital signal processing counterparts. We visualize the outputs of such connections, projected back to the spectral domain, in models trained for speech denoising, and show that while skip connections do not necessarily improve performance with regards to the number of parameters, they make speech enhancement models more interpretable. We also discover, through visualization of the hidden units of the context-aware model we proposed, that many of the neurons are in a state that we call "stuck" (i.e. their outputs is a constant value different from zero). We propose a method to prune those neurons away from the model without having an impact in performance, and compare this method to other methods in the literature. The proposed method can be applied post hoc to any pretrained models that use sigmoid or hyperbolic tangent activations. It also leads to dense models, therefore not requiring special software/hardware to take advantage of the compressed model. Finally, in order to investigate how useful current objective speech quality metrics are for DNNbased speech enhancement, we performed online listening tests using the outputs of three different DNN-based speech enhancement models for both denoising and dereverberation. When assessing the predictive power of several objective metrics, we found that existing non-intrusive methods fail at monitoring signal quality. To overcome this limitation, we propose a new metric based on a combination of a handful of relevant acoustic features. Results inline with those obtained with intrusive measures are then attained. In a leave-one-model-out test, the proposed non-intrusive metric is also shown to outperform two non-intrusive benchmarks for all three DNN enhancement methods, showing the proposed method is capable of generalizing to unseen models. We then take the first steps in incorporating such knowledge into the training of DNN-based speech enhancement models by designing a quality-aware cost function. While our approach increases PESQ scores for a model with a vocoder output, participants of a small-scale preference test considered it less natural than the same model trained with a conventional MSE loss function, which signals more work is needed in the development of quality-aware models.

Type de document: Thèse Thèse
Directeur de mémoire/thèse: Falk, Tiago H.
Mots-clés libres: speech enhancement; dereverberation; denoising; deep neural networks; recurrent neural networks; interpretability; pruning; speech quality
Centre: Centre Énergie Matériaux Télécommunications
Date de dépôt: 08 nov. 2019 17:41
Dernière modification: 08 nov. 2019 17:41
URI: https://espace.inrs.ca/id/eprint/9018

Gestion Actions (Identification requise)

Modifier la notice Modifier la notice