In the recent noise reduction study, it was found that optimized iterative spectral subtraction (SS) results in speech enhancement with almost no musical noise generation, but this method is valid only for stationary noise. In this chapter, we introduce a musical-noise-free blind speech extraction method using a microphone array for application to nonstationary noise. Experimental results show the proposed method outperforms conventional baseline methods with the TVG prior even though proper shape parameter values are not given in the evaluation. The sub-network can be trained by backpropagation from the outputs of the whole network using any criteria such as signal-level mean square error or even ASR errors if the WPE-CGG computational graph is connected to that of the ASR network. Blind estimation of the shape parameter of priors is realized by adding a shape parameter estimator as a sub-network to WPE-CGG, treated as a differentiable neural network. For this purpose, we extend WPE by introducing a complex-valued generalized Gaussian (CGG) prior and its shape parameter estimator inside of processing to deal with a variety of super-Gaussian sources depending on sources. according to each utterance is thus required. On-demand estimation of source priors e.g. Although based on this assumption WPE works well in practice, generally proper priors depend on sources, and they cannot be known in advance of the processing. WPE usually assumes that desired source signals always follow predefined specific source priors such as Gaussian with time-varying variances (TVG). Our comparative evaluations in terms of Segmental SNR (SegSNR), Perceptual Evaluation of Speech Quality (PESQ), and Log-Likelihood Ratio (LLR) distance demonstrate the superior performance of the proposed β-order ImpLapMMSE estimator.Weighted-prediction-error (WPE) is one of the well-known dereverberation signal processing methods especially for alleviating degradation of performance of automatic speech recognition (ASR) in a distant speaker scenario. To this end, the input noisy signal and the outputs of MMSE STSA, β-order STSA, and ImpLapMMSE estimators have been compared with the output of the proposed estimator. We have compared the performance of the proposed estimator with the state-of-the-art estimators that assume either Gaussian or Laplacian probability density functions for the real and imaginary parts of the DFT coefficients of clean speech. The value of β is adapted as a function of frame Signal to Noise Ratio (SNR). Using some approximations for the joint probability density function and the Bessel function, we also present an improved closed-form version of the estimator (called β-order ImpLapMMSE). The analytical solution, named β-order LapMMSE, does not have a closed form and is highly non-linear and computationally complex. We also assume Gaussian distribution for the real and imaginary parts of the DFT coefficients of the noise. We present an analytical solution for β-order MMSE STSA estimator assuming Laplacian prior for the real and imaginary parts of the Discrete Fourier Transform (DFT) coefficients of (clean) speech. The motivation has been to take advantages of both Laplacian speech modeling and β-order cost function in MMSE estimation of clean speech. This paper addresses the problem of speech enhancement employing the Minimum Mean-Square Error (MMSE) of β-order Short Time Spectral Amplitude (STSA).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |