Kalman Filtering with Machine Learning Methods for Speech Enhancement Author Roy, Sujan K Published 2021-05-04 Thesis Type Thesis (PhD Doctorate) School School of Eng & Built Env DOI https://doi.org/10.25904/1912/4179 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/404456 Griffith Research Online https://research-repository.griffith.edu.au GRIFFITH UNIVERSITY School of Engineering and Built Environment Kalman Filtering with Machine Learning Methods for Speech Enhancement Sujan Kumar Roy B.Sc (Honours), M.Sc (Research), M.A.Sc (Research) GUID: s5080774 April 2021 Dissertation submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Principal Supervisor: Professor Kuldip K. Paliwal Associate Supervisor: Doctor Stephen So 2 To my parents and wife Abstract Speech corrupted by background noise (or noisy speech) can reduce the ef- ficiency of communication between man-man and man-machine. A speech enhancement algorithm (SEA) can be used to suppress the embedded back- ground noise and increase the quality and intelligibility of noisy speech. Many applications, such as speech communication systems, hearing aid de- vices, and speech recognition systems, typically rely upon speech enhance- ment algorithms for robustness. This dissertation focuses on single-channel speech enhancement using Kalman filtering with machine learning methods. In Kalman filter (KF)-based speech enhancement, each clean speech frame is represented by an auto-regressive (AR) process, whose parame- ters comprise the linear prediction coefficients (LPCs) and prediction error variance. The LPC parameters and the additive noise variance are used to form the recursive equations of the KF. In augmented KF (AKF), both the clean speech and additive noise LPC parameters are incorporated into an augmented matrix to construct the recursive equations of AKF. Given a frame of noisy speech samples, the KF and AKF give a linear MMSE esti- mate of the clean speech samples using the recursive equations. Usually, the inaccurate estimates of the parameters introduce bias in the KF and AKF gain, leading to a degradation in speech enhancement performance. iii iv The research contributions in this dissertation can be grouped into three focus areas. In the first work, we propose an iterative KF (IT-KF) to offset the bias in KF gain for speech enhancement through utilizing the parame- ters in real-life noise conditions. In the second work, we jointly incorporate the robustness and sensitivity metrics to offset the bias in the KF and AKF gain—which address speech enhancement in real-life noise conditions. The third focus area consists of the deep neural network (DNN) and whitening filter assisted KF and AKF for speech enhancement. Specifically, DNN and whitening filter-based approaches utilize the parameter estimates for the KF and AKF for speech enhancement. However, the whitening filter still pro- duces biased speech LPC estimates for the KF and AKF, results in degraded speech. To address this, we propose a DeepLPC framework constructed with the state-of-the-art residual network and temporal convolutional network (ResNet-TCN) to jointly estimate the speech and noise LPC parameters from the noisy speech for the AKF. Recently, the multi-head self-attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than ResNet-TCN. Therefore, we employ the MHANet within DeepLPC, termed as DeepLPC-MHANet, to further improve the speech and noise LPC parameter estimates for the AKF. Finally, we perform a comprehensive study on four different training targets for LPC estimation using ResNet-TCN and MHANet. This study aims to determine which training target as well as DNN method produces accurate speech and noise LPC parameter with an application of AKF-based speech enhancement in practice. Objective and subjective scores demonstrate that the proposed methods in this dissertation produce enhanced speech with higher quality and intelligibility than the competing methods in various noise conditions for a wide range of signal-to-noise ratio (SNR) levels. Statement of Originality I hereby declare that this dissertation is the outcome of the research con- ducted by myself under the principal supervisor Prof. Kuldip K. Paliwal, in the School of Engineering and Built Environment at Griffith University, Brisbane, Australia. This work has not previously been submitted for a de- gree or diploma in any university. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made in the thesis itself. ———————– Sujan Kumar Roy April 2021 v vi Acknowledgements First of all, I would like to express my sincerest gratitude and appreciation to my principal supervisor Prof. Kuldip K. Paliwal for his expert guidance, mentorship, encouragement, and support at all levels of my doctoral study. I am also grateful to Dr. Aaron Nicolson for his collaboration and assistance. I would like to thank my lab mates, Dr. Timothy Roberts, Dr. Jack Hanson, Dr. Aadel Alatwi, Miss. Sisi, Mr. Jaswinder Singh, Mr. Jaspreet Singh, and Mr. Anil Kumar Hanumanthappa for their friendship and cooperation. I would also like to thank Mrs. Natalie Dunstan, Mrs. Lynda Ashworth, and other Griffith University officials for their help and supports. I acknowledge Griffith University for providing me with the GUIPRS, GUPRS, and Engineering Top-Up scholarships, without which I would not have been able to undertake doctoral study. I would also like to thank the University of Rajshahi, Bangladesh, to grant me a study leave during my doctoral study. I am also grateful to my wife Laboni Rani for her valuable time and encour- agement during my doctoral study. At last but not the least, I would like to thank my parents for their lifelong love and support without boundaries, which have always been a source of motivation and happiness for me. vii viii Contents Abstract iii Statement of Originality v Acknowledgements vii I General Introduction 3 1 Introduction 5 1.1 Overview of Speech Enhancement . . . . . . . . . . . . . . . . 5 1.2 KF and AKF-based Speech Enhancement . . . . . . . . . . . 6 1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . 7 1.4 Publications Resulting from Research . . . . . . . . . . . . . 8 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Ethical Clearances for Experimentation . . . . . . . . . . . . 14 ix x CONTENTS II Iterative Kalman Filtering for Speech Enhancement 15 2 An Iterative Kalman Filter with Reduced-Biased Kalman Gain for Single Channel Speech Enhancement in Non-stationary Noise Condition 17 III Tuning of Kalman Filter as well as Augmented Kalman Filter for Speech Enhancement 27 3 Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement 29 4 Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single-Channel Speech En- hancement 53 IV Kalman Filtering and Machine Learning Methods for Speech Enhancement 71 5 Deep Learning-Based Kalman Filter and Augmented Kalman Filter for Speech Enhancement 73 6 DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement 87 7 DeepLPC-MHANet: Multi-Head Self-Attention for Aug- mented Kalman Filter-Based Speech Enhancement 105 CONTENTS xi 8 On Training Targets for Supervised LPC Estimation to Aug- mented Kalman Filter-Based Speech Enhancement 121 V Conclusion 137 9 Summary, Conclusions, and Future Work 139 9.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . 139 9.1.1 Chapter 2: An Iterative Kalman Filter with Reduced- Biased Kalman Gain for Single Channel Speech En- hancement in Non-stationary Noise Condition . . . . . 139 9.1.2 Chapter 3: Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement . . . . . . . . 140 9.1.3 Chapter 4: Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single- Channel Speech Enhancement . . . . . . . . . . . . . . 141 9.1.4 Chapter 5: Deep Learning-Based Kalman Filter and Augmented Kalman Filter for Speech Enhancement . 142 9.1.5 Chapter 6: DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement . . . . . . . . . . . . . . . . . . . 142 9.1.6 Chapter 7: DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-Based Speech Enhance- ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 CONTENTS 1 9.1.7 Chapter 8: On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-based Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . 144 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Bibliography 145 2 CONTENTS Part I General Introduction 3 Chapter 1 Introduction 1.1 Overview of Speech Enhancement In real-life, clean speech corrupted by background noise (called noisy speech) can reduce the efficiency of communication between man-man and man- machine. A speech enhancement algorithm (SEA) can be used to suppress the embedded background noise and increase the quality and intelligibility of noisy speech. SEA is useful in many applications, where noise corrupted speech is undesirable and unavoidable. For example, speech communication systems, hearing aid devices, and speech recognition systems typically rely upon speech enhancement for robustness. SEA can be classified according to the number of input channels or mi- crophones (single/multiple) and the domain of processing (time/transform). In the single-channel SEA, one noisy mixture gives the overall spectral in- formation of the degraded speech. Thus, a single-channel SEA estimates the clean speech from a noisy speech. Conversely, in multi-channel speech en- hancement, a couple of microphones are available for capturing speech and 5 6 CHAPTER 1. INTRODUCTION noise sources. Therefore, the multi-channel SEAs are suitable for enhanc- ing speech corrupted by background noise and reverberation from surface reflections (or noisy-reverberant speech). The research area of this dissertation is on single-channel speech en- hancement. Various single-channel SEAs, such as spectral subtraction (SS), the Wiener filter (WF), minimum mean square error (MMSE) estimators, the Kalman filter (KF), the augmented KF (AKF), computational auditory scene analysis (CASA), and deep learning methods have been introduced over the decades. In this dissertation, we focus on Kalman filtering with machine learning methods for single-channel speech enhancement. 1.2 KF and AKF-based Speech Enhancement In KF, each clean speech frame is represented by an auto-regressive (AR) process, whose parameters comprise the linear prediction coefficients (LPCs) and prediction error variance. The LPC parameters and noise variance are used to construct the KF recursive equations. In augmented KF (AKF), both the clean speech and additive noise are represented by two AR pro- cesses. The speech and noise LPC parameters are incorporated in an aug- mented matrix to construct the recursive equations of the AKF. Given a frame of noisy speech samples, the KF and AKF give a linear MMSE esti- mate of the clean speech samples using the recursive equations. Therefore, the KF performance somehow depends on the accuracy of the LPC parame- ters and noise variance estimates in practice. The AKF performance largely relies on the accuracy of speech and noise LPC parameter estimation in practice. Usually, the inaccurate estimates of the parameters introduce bias in the KF and AKF gain, leading to a degradation in speech enhancement CHAPTER 1. INTRODUCTION 7 performance. 1.3 Research Contributions The research contributions in this dissertation can be grouped into three focus areas. In the first work, we propose an iterative KF (IT-KF) to offset the bias in KF gain for speech enhancement through utilizing the parame- ters in real-life noise conditions. In the second work, we jointly incorporate the robustness and sensitivity metrics to offset the bias in the KF and AKF gain—which address speech enhancement in real-life noise conditions. The third focus area consists of the deep neural network (DNN) and whitening filter assisted KF and AKF for speech enhancement. Specifically, DNN and whitening filter-based approaches utilize the parameter estimates for the KF and AKF for speech enhancement. However, the whitening filter still pro- duces biased speech LPC estimates for the KF and AKF, results in degraded speech. To address this, we propose a DeepLPC framework constructed with the state-of-the-art residual network and temporal convolutional network (ResNet-TCN) to jointly estimate the speech and noise LPC parameters from the noisy speech for the AKF. Recently, the multi-head self-attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than ResNet-TCN. Therefore, we employ the MHANet within DeepLPC, termed as DeepLPC-MHANet, to further improve the speech and noise LPC parameter estimates for the AKF. Finally, we perform a comprehensive study on four different training targets for LPC estimation using ResNet-TCN and MHANet. This study aims to determine which training target as well as DNN method produces accurate speech and noise LPC parameter with an application of AKF-based 8 CHAPTER 1. INTRODUCTION speech enhancement in real-life noise conditions. 1.4 Publications Resulting from Research [1] Sujan Kumar Roy and Kuldip K. Paliwal. An iterative Kalman filter with reduced-biased Kalman gain for single channel speech enhancement in non-stationary noise condition. International Journal of Signal Processing Systems, 7(1):7–13, March 2019. [2] Sujan Kumar Roy and Kuldip K. Paliwal. Robustness and sensitivity tuning of the Kalman filter for speech enhancement. Under review with: Signals (Submitted at 26 Feb. 2021). [3] Sujan Kumar Roy and Kuldip K. Paliwal. Robustness and sensitiv- ity metrics-based tuning of the augmented Kalman filter for single-channel speech enhancement. Under review with: Applied Acoustics (Submitted at 4 March 2021). [4] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. A deep learning-based Kalman filter for speech enhancement. Proc. Interspeech2020, pages 2692–2696, October 2020. [5] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. Deep learning with augmented Kalman filter for single-channel speech enhance- ment. IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, October 2020. [6] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement. IEEE Access (Submitted Revisions at 18 March, 2021). [7] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. DeepLPC- MHANet: Multi-head self-attention for augmented Kalman filter-based speech CHAPTER 1. INTRODUCTION 9 enhancement. Under review with: IEEE Access (Submitted at 08 April, 2021). [8] Sujan Kumar Roy and Kuldip K. Paliwal. On training targets for supervised LPC estimation to augmented Kalman filter-based speech en- hancement. Under review with: Speech Communication (Submitted at 12 April, 2021). [9] Sujan Kumar Roy and Kuldip K. Paliwal. A non-iterative Kalman filter for single channel speech enhancement in non-stationary noise condi- tion. 12th International Conference on Signal Processing and Communica- tion Systems (ICSPCS), pages 1–7, December 2018. [10] Sujan Kumar Roy and Kuldip K. Paliwal. Sensitivity metric-based tuning of the augmented Kalman filter for speech enhancement. 14th In- ternational Conference on Signal Processing and Communication Systems (ICSPCS), pages 1–6, December 2020. [11] Sujan Kumar Roy and Kuldip K. Paliwal. Causal convolution en- coder decoder-based augmented Kalman filter for speech enhancement. 14th International Conference on Signal Processing and Communication Systems (ICSPCS), pages 1–7, December 2020. [12] Sujan Kumar Roy and Kuldip K. Paliwal. Deep residual network- based augmented Kalman filter for speech enhancement. Asia-Pacific Sig- nal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 667–673, December 2020. [13] Sujan Kumar Roy and Kuldip K. Paliwal. Causal convolutional neural network-based Kalman filter for speech enhancement. Asia-Pacific Conference on Computer Science and Data Engineering, December 2020. 10 CHAPTER 1. INTRODUCTION 1.5 Thesis Outline This dissertation is organized into five parts. The first part consists of the general introduction. The second to fourth parts include chapters 2- 8 as the research contributions in this dissertation. Chapters 2-8 include the publications listed in [1]-[8]. Specifically, in the second part: Chapter 2 includes the iterative KF for speech enhancement as in [1]. In the third part: Chapter 3-4 includes robustness and sensitivity metrics-based tuning of the KF and AKF for speech enhancement as in [2, 3]. In the fourth part, Chapter 5 covers [4, 5], deep learning and whitening filter assisted KF and AKF for speech enhancement, Chapter 6 includes a DeepLPC framework within a residual network and temporal convolutional network (ResNet-TCN) to AKF-based speech enhancement as in [6], Chapter 7 includes a DeepLPC- MHANet framework within multi-head attention (MHANet) for AKF-based speech enhancement as in [7], and Chapter 8 performs a comprehensive study on training targets for supervised LPC estimation with an application for AKF-based speech enhancement as in [8]. Finally, in the fifth part, Chapter 9 summarises the dissertation and provides recommendations for future research. Detailed summaries of the chapters in the second to fifth parts of this dissertation are given below: Chapter 2: This chapter presents an iterative KF (IT-KF)-based SEA as in [1]. It is demonstrated that the inaccurate estimates of the speech LPC parameters and noise variance introduce bias in the KF gain, leading to a degradation in speech enhancement performance. To address this, unlike existing IT-KF, in the proposed SEA, the speech LPC parameters are computed from the pre-smoothed speech. The noise variance is estimated from each noisy speech frame. Then con- CHAPTER 1. INTRODUCTION 11 struct the KF with the estimated parameters and process the noisy speech at the first iteration. The parameters are re-estimated from the processed speech, re-construct the KF, and the process is repeated at the second iteration. It is shown that the re-estimated parame- ters at the second iteration result in a significant reduction of bias in the KF gain— which address speech enhancement in non-stationary noise conditions. Objective and subjective scores demonstrate that the proposed SEA outperforms the competing methods in non-stationary noise conditions. Chapter 3: In this chapter, we propose a tuning algorithm for the KF to perform speech enhancement in real-life noise conditions. Al- though the current tuning methods offset the bias in KF, particularly in stationary noise conditions, they do not address the real-life noise conditions. To address this, we first estimate the speech LPC parame- ters and noise variance from each noisy speech frame. Then construct the KF with the estimated parameters, where the robustness and sensi- tivity metrics are jointly incorporated to dynamically offsets the bias in KF gain to achieve better noise reduction. It is shown that the reduced-biased KF gain achieved by the proposed tuning algorithm address speech enhancement in real-life noise conditions. Objective and subjective scores demonstrate that the proposed method outper- forms the competing methods in real-life noise conditions. Chapter 4: This chapter presents a tuning algorithm for the AKF to perform speech enhancement in real-life noise conditions. Although current tuning methods are capable of offsetting the bias in AKF gain in colored noise conditions, however, do not adequately address non- 12 CHAPTER 1. INTRODUCTION stationary noise conditions. To cope with this problem, we first es- timate the speech and noise LPC parameters from each noisy speech frame. The AKF is then constructed with the estimated speech and noise LPC parameters. To achieve better noise reduction, the pro- posed tuning algorithm jointly incorporates the robustness and sensi- tivity metrics to dynamically offset the bias in AKF gain. It is shown that the reduced-biased AKF gain achieved by the proposed tuning al- gorithm address speech enhancement in real-life noise conditions. Ob- jective and subjective scores demonstrate that the proposed method outperforms the competing methods in various noise conditions. Chapter 5: This chapter investigates deep learning and whitening filter assisted KF and AKF for speech enhancement. Specifically, a deep learning framework estimates noise PSD from each noisy speech frame to compute the noise parameters for the KF and AKF. Then construct a whitening filter (with its coefficients computed from the estimated noise PSD) to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. Then construct the KF and AKF with the estimated parameters for speech enhancement. Ob- jective and subjective scores demonstrate that the proposed KF and AKF methods outperform other competing methods in various noise conditions. Chapter 6: In this chapter, we propose a DeepLPC framework within ResNet-TCN to jointly estimate the speech and noise LPC parameters for the AKF in real-life noise conditions. Specifically, DeepLPC learns to map each frame of the noisy speech magnitude spectrum to the clean speech and noise LPC power spectra (LPC-PS). The clean speech and CHAPTER 1. INTRODUCTION 13 noise LPC-PS estimates are then used to the LPC estimates required to construct the AKF. The DeepLPC framework demonstrates a lower spectral distortion (SD) level in the estimated clean speech LPCs than existing deep learning methods. It is shown that the AKF constructed with the DeepLPC driven speech and noise LPC parameters address speech enhancement in real-life noise conditions. Objective and sub- jective scores demonstrate that the proposed method produces en- hanced speech with higher quality and intelligibility than competing methods in various noise conditions. Chapter 7: This chapter presents an extension of the DeepLPC framework within a multi-head attention network (MHANet), called DeepLPC-MHANet, to further improve the accuracy of speech and noise LPC parameter estimation in practice. This is due to the MHANet has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than ResNet-TCN. Therefore, MHANet learns a better mapping of each frame of the noisy speech magni- tude spectrum to the clean speech and noise LPC-PS. The speech and noise LPC parameters are computed from the estimated LPC- PSs to construct the AKF. The DeepLPC-MHANet demonstrates a lower spectral distortion (SD) level in the estimated speech LPCs than DeepLPC-ResNet-TCN. Objective and subjective scores indicate that the AKF constructed with DeepLPC-MHANet driven LPC parameters outperforms other competing methods in various noise conditions. Chapter 8: This chapter presents a comprehensive study on train- ing targets for supervised LPC estimation. In a supervised technique, typically, a deep neural network (DNN) is trained to learn a mapping 14 CHAPTER 1. INTRODUCTION from noisy speech features to clean speech and noise LPCs. Train- ing targets for DNN to clean speech and noise LPC estimation fall into four categories: line spectrum frequency (LSF), LPC power spec- trum (LPC-PS), power spectrum (PS), and magnitude spectrum (MS). In this study, two state-of-the-art DNN methods— ResNet-TCN and MHANet are used to evaluate the training targets. However, the choice of appropriate training target as well as the DNN method can have a significant impact on LPC estimation in practice. Motivated by this, we aim to determine which training target as well as DNN method pro- duces accurate speech and noise LPC parameter estimates in practice. It is demonstrated that the LPC-PS training target with MHANet produced a lower spectral distortion (SD) level in the estimated clean speech LPCs in real-life noise conditions. Objective and subjective scores also demonstrate that the AKF constructed with MHANet- LPC-PS driven speech and noise LPC parameters produced enhanced speech with higher quality and intelligibility than competing methods. Chapter 9: This chapter summarizes the main findings and conclu- sions presented in research Chapters 2-8 of this dissertation. It also outlines the future direction of research. 1.6 Ethical Clearances for Experimentation All subjective listening experiments requiring human participants were con- ducted with approval from the Griffith University’s Human Research Ethics Committee: database protocol number 2018/671. Part II Iterative Kalman Filtering for Speech Enhancement 15 Chapter 2 An Iterative Kalman Filter with Reduced-Biased Kalman Gain for Single Channel Speech Enhancement in Non-stationary Noise Condition 17 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy and Kuldip K. Paliwal, "An Iterative Kalman Filter with Reduced-Biased Kalman Gain for Single Channel Speech Enhancement in Non-stationary Noise Condition," International Journal of Signal Processing Systems, Vol. 7, No. 1, pp. 7-13, March 2019, doi: 10.18178/ijsps.7.1.7-13. My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) ______________ _____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) ____ ___ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 An Iterative Kalman Filter with Reduced-Biased Kalman Gain for Single Channel Speech Enhancement in Non-stationary Noise Condition Sujan Kumar Roy and Kuldip K. Paliwal Signal Processing Laboratory, Griffith School of Engineering, Griffith University, Brisbane, QLD, Australia, 4111 Email:

[email protected]

,

[email protected]

Abstract—This paper presents an iterative Kalman filter regional statistics based noise PSD tracking method has (IT-KF) with a reduced-biased Kalman gain for single been proposed, which could be used to further improve channel speech enhancement in Non-stationary Noise the performance of MMSE method [2]. However, Conditions (NNCs). The proposed IT-KF aims to offset the significant background noise still remains in the enhanced bias in Kalman gain through efficient parameter estimation leading to improve the speech enhancement performance. speech. Paliwal and Basu for the first time introduced To do this, we introduce a Decision Directed (DD) and a KF-based SEA in white noise condition [3]. Here, the posteriori SNR based noise variance estimation method LPC parameters are computed from the clean speech. controlled through Speech Activity Detector (SAD). The Gibson et al. introduced an iterative KF (IT- KF) for proposed SAD incorporates a majority voting of three colored noise suppression [6], where the LPC parameters distinct SAD fusions. The LPC parameters are computed are estimated by processing the noisy speech frame with from the pre-smoothing of noisy speech. With these initial 3-4 iterations. Due to inaccurate estimates of parameter, estimated parameters, an IT-KF processes the noisy speech the Kalman gain becomes biased and a significant at the first iteration. The parameters are re-estimated from amount of background noise remains in the enhanced the processed speech, re-adjust the Kalman gain, and the process is repeated at the second iteration. It is shown that speech. So and Paliwal [7] studied the impact of the Long the adjusted Kalman gain enables the IT-KF to minimize Tapered Window (LTW) for LPC estimation that the remaining artifacts of the processed speech, yielding the influences the KF performance. However, windowing enhanced speech. Extensive simulation results reveal that impacts the KF performance to some extent. In [8], a the proposed method outperforms other benchmark Sub-band (SB) IT-KF has been introduced. It employs an methods in NNCs for a wide range of SNRs. IT-KF to the partially reconstructed high frequency (HF) SBs among the 16 decomposed SBs, while keeping the Index Terms—speech enhancement, Kalman filter, non- low frequency (LF) SBs unchanged. The full-band stationary noise, speech activity detector, pre-smoothing. enhanced speech is obtained by adding the HF enhanced speech with the LF SBs. However, the LF SBs can also be affected by noise when processing I. INTRODUCTION the non-stationary noise corrupted speech. Recently, The background noises degrade the speech signal Roy and Paliwal [9] introduced an NIT-KF-based SEA to during voice communication over telephone, speech minimize the biasing effect of Kalman gain recognition, speech coding, etc. A Speech Enhancement through efficient parameter estimation. However, Algorithm (SEA) acts as a front-end tool for these some artifacts still remain in the enhanced speech. applications by providing an estimate of clean speech. Although most of the SEAs perform relatively well in Many SEAs, such as Spectral Subtraction (SS) [1], white noise condition, the performance becomes MMSE [2], Kalman Filter (KF) [3] etc., have been degraded in NNCs. The authors in [9] showed that the KF proposed over decades. However, the speech performance deteriorates due to the biased estimate of enhancement performance varies over the SEAs and Kalman gain in NNCs. In this paper, we focus on the deteriorates when the speech is corrupted by non- further adjustment of the biased Kalman gain through stationary noise. improving the initial estimates of parameters followed by The enhanced speech by SS method suffers from re-estimation of them in the subsequent iterations of IT- musical noise and distortion due to under/over subtraction KF. The initial estimated parameters are applied to IT-KF of the estimated noise spectrum from the noisy spectrum for filtering the noisy speech at the first iteration. The [4]. Although, the MMSE method [2] shows parameters are re-estimated to compute the Kalman gain, improvement over SS, the efficiency of this method and the process is repeated again at the second iteration. completely depends on the accuracy of a priori and a The adjusted Kalman gain in IT-KF is effective in posteriori SNR computation in noisy condition. In [5], minimizing the remaining artifacts of the processed speech, yielding the enhanced speech. The efficiency of the proposed method with respect to other benchmark Manuscript received December 24, 2018; revised February 18, 2019. ©2019 Int. J. Sig. Process. Syst. 7 doi: 10.18178/ijsps.7.1.7-13 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 SEAs in terms of subjective and objective testing is where 𝑢(𝑛) is a white Gaussian excitation with zero reported in this paper. mean and a variance of 𝜎𝑢2 . The rest of the paper is organized as follows. Section II Eqs. (1) and (2) are used to form the following SSM describes the conventional KF for speech enhancement (where the bold faced letters represent vectors/ matrices). and problem statement in II-A. Section III introduces the 𝒙(𝑛) = 𝜳𝒙(𝑛 − 1) + 𝒄𝑢(𝑛) (3) proposed speech enhancement system followed by parameter estimation in II-A, proposed SAD algorithm in 𝑦(𝑛) = 𝒅𝒙(𝑛) + 𝑣(𝑛) (4) III-A (1), proposed noise variance estimation in III-A(2), where 𝜳 is the state transition matrix containing the 𝑎 𝑖 ’s, estimation of initial LPC parameters in III-A(3), proposed 𝒙(𝑛) = [𝑠(𝑛 − 𝑝 + 1) 𝑠(𝑛 − 𝑝 + 2) … 𝑠(𝑛)] 𝑇 is parameter re-estimation method in III-A(4), summary of the state vector, 𝒅 = 𝒄 𝑻 = [0 0 0 1 ] are the the proposed IT-KF based SEA in III-B, and optimality measurement vectors for the excitation and observation comparison of Kalman gain in III-C. Section IV describes noises, respectively. the speech enhancement experiment, where the For a particular frame, KF computes an unbiased and simulation setup is given in IV-A, IV-B deals with the linear MMSE estimate 𝒙�(𝑛|𝑛) of 𝒙(𝑛) at time 𝑛, given experimental results and discussion. Section V gives 𝑦(𝑛), 𝑦(𝑛 − 1), . . . , 𝑦(1) by using the following recursive some concluding remarks and future research directions. equations [3] II. CONVENTIONAL KF FOR SPEECH ENHANCEMENT 𝐱�(𝑛|𝑛 − 1) = 𝚿𝐱�(𝑛 − 1|𝑛 − 1) (5) th The noisy speech 𝑦(𝑛) (n sample) captured by a 𝚺(𝑛|𝑛 − 1) = 𝚿𝚺(𝑛 − 1|𝑛 − 1)𝚿 𝑇 + 𝐜𝜎𝑢2 𝐜 𝑇 (6) single microphone is represented as 𝑇 𝑇 𝐊(𝑛) = 𝚺(𝑛|𝑛 − 1)𝐝 (𝐝𝚺(𝑛|𝑛 − 1)𝐝 + 𝜎𝑣2 )−1 (7) 𝑦(𝑛) = 𝑠(𝑛) + 𝑣(𝑛) (1) 𝐱�(𝑛|𝑛) = 𝐱�(𝑛|𝑛 − 1) + 𝐊(𝑛)(𝑦(𝑛) − 𝐝𝐱�(𝑛|𝑛 − 1)) (8) where 𝑠(𝑛) is the clean speech, 𝑣(𝑛) is the additive noise 𝚺(𝑛|𝑛) = (𝐈 − 𝐊(𝑛)𝐝)𝚺(𝑛|𝑛 − 1) (9) with variance 𝜎𝑣2 and uncorrelated with 𝑠(𝑛). The clean speech 𝑠(𝑛) in eq. (1) can be represented The estimated speech at time 𝑛 is given by 𝑠̂ (𝑛) = with a 𝑝 𝑡ℎ order LPCs (𝑎𝑖 ’s) as [10] 𝒅𝒙�(𝑛|𝑛) . The above procedure is repeated for the 𝑝 following frames, yielding the enhanced speech 𝑠̂ (𝑛). 𝑠(𝑛) = − ∑𝑖=1 𝑎𝑖 𝑠(𝑛 − 𝑖) + 𝑢(𝑛) (2) Figure 1. The impact of 𝜎 𝑢2 and 𝜎𝑣2 on biased 𝐾0(𝑛) of NIT-KF: (a) spectrogram of clean speech, (b) spectrogram of noisy speech (corrupted by 0 dB restaurant noise), (c) 𝐾 0(𝑛), where 𝜎𝑢2 and 𝜎𝑣2 are computed in oracle case, (d) spectrogram of enhanced speech (oracle case, PESQ=2.42), (e) 𝐾 0(𝑛), where 𝜎𝑢2 and 𝜎𝑣2 are computed from noisy speech, (f) spectrogram of enhanced speech (noisy case, PESQ=2.07). Gibson et al. introduced an IT-KF by repeating eqs. A. Problem Statement (5)-(9) iteratively, where 𝜳 is formed with the 𝑝 𝑡ℎ and 𝑞 𝑡ℎ order LPCs of 𝑠(𝑛) and 𝑣(𝑛). The parameters are re- Though the conventional KF works reasonably well for estimated at the end of each iteration leading to increases the stationary noise condition, its performance suffers in the computational complexity. To make computationally NNCs. Roy and Paliwal showed that the poor estimates efficient, Roy et al. showed that the 𝜳 of IT-KF can be of LPC parameters ({𝑎 𝑖 } and 𝜎𝑢2 ) and noise variance (𝜎𝑣2 ) formed with the LPCs of 𝑠(𝑛) only and effective for non- introduce biasing effect to the first component (𝐾 0 (𝑛)) of stationary noise suppression [8]. Unlike the IT-KF in [8], Kalman gain 𝑲(𝑛), particularly during the silent activity, the proposed IT-KF re-estimates the parameters in the resulting a significant amount of residual noise in the subsequent iterations differently based on SAD. enhanced speech [9]. We will briefly review the impact 8 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 of biased 𝑲(𝑛) on SE performance. To do this, we further enhanced speech (Fig. 1(d)) obtained by oracle KF simplify the 𝑲(𝑛) (eq. (7)) to represent 𝐾0 (𝑛) as [7] is almost identical to the clean speech (Fig. 1(a)). Σ0 (𝑛|𝑛−1) 𝐾0 (𝑛) = (10) Σ0 (𝑛|𝑛−1)+𝜎𝑣2 where Σ0 (𝑛|𝑛 − 1) corresponds to prediction error 𝜎 𝑢2 of the first component of the a priori state estimate 𝒙�(𝑛|𝑛 − 1). By replacing Σ0(𝑛|𝑛 − 1) with 𝜎𝑢2 , eq. (10) can be represented as Figure 2. Comparing the variances (𝜎 𝑣2 ) of noisy speech Fig. 1(b) and 2 𝜎𝑢 predicted speech (inverted) with the prediction error variance (𝜎 𝑢2 ) for 𝐾0 (𝑛) = 2 +𝜎 2 (11) the same experimental setup used in Fig. 1. 𝜎𝑢 𝑣 To examine the impact of the biased 𝐾 0 (𝑛) on speech Whereas in practice, the {𝑎 𝑖 }, 𝜎𝑢2 , and 𝜎𝑣2 are computed enhancement (SE) by KF, we further simplify the a from the noisy speech. Thus, the predicted 𝜎 𝑢2 and 𝜎𝑣2 posteriori state estimate 𝐱�(𝑛|𝑛) (eq. (8)) and represent it become worse and rise up to 1 (normalized form). For in a scalar form (𝑥�(𝑛|𝑛)) as [7] example, it can be seen from Fig. 2 that the predicted 𝜎 𝑣2 𝑥�(𝑛|𝑛) = 𝐾0 (𝑛)𝑦(𝑛) + (1 − 𝐾0 (𝑛))𝑥�(𝑛|𝑛 − 1) (12) and 𝜎𝑢2 remain closer from 0.9s to 2.18s. Therefore, the computed 𝐾0 (𝑛) in noisy case remains biased around 0.5 In special case of 𝜎𝑣2 = 0, according to eq. (11), the between 0.9s to 2.18s (Fig. 1 (e)). The 𝐾0 (𝑛) for other 𝐾0 (𝑛) becomes unity and the output is equal to 𝑦(𝑛) (eq. regions varies accordingly. With 0.5 biased 𝐾 0 (𝑛) (12)). Whereas 𝜎𝑢2 = 0 during the silent activity, then according to eq. (12), 50% of 𝑦(𝑛) is passed to the output. 𝐾0 (𝑛) = 0 and no corrupting noise from 𝑦(𝑛) is passed As a result, the corresponding enhanced speech to the output (enhanced speech). contains 50% background noise as clearly visible in Fig. In oracle case, {𝑎𝑖}, 𝜎𝑢2, and 𝜎𝑣2 are computed from 1(f) specifically ranging from 0.9s to 2.18s. This is the clean speech and noise, respectively. Thus, the attributed as the biasing effect of Kalman gain. computed 𝐾0(𝑛) in oracle case shows a smooth transition This paper aims to offset the bias in 𝐾 0 (𝑛) through between 0 and 1 depending on the silent/speech activity. improving the initial estimate of {𝑎 𝑖 } , 𝜎𝑣2 , and 𝜎𝑢2 The smooth 𝐾0(𝑛) can blend 𝑦(𝑛) and 𝑥�(𝑛|𝑛 − 1) (eq. followed by re-estimation of these in subsequent (12)) in an effective manner, yielding better 𝑥�(𝑛|𝑛). iterations of IT-KF. Therefore, the Figure 3. Schematic diagram of the proposed IT-KF based speech enhancement system. computed from a pre-smoothing speech 𝑠̂ 𝑝 (𝑛, 𝑘) of noisy III. PROPOSED IT-KF BASED SEA 𝑦(𝑛, 𝑘). Whereas {𝑎 𝑖 }, 𝜎𝑢2 , and 𝜎𝑣2 are re-estimated in the Fig. 3 shows the schematic diagram of proposed SEA. subsequent iterations of IT-KF based on SAD. The next Firstly, 𝑦(𝑛) is converted through overlap and sections describe the proposed SAD and parameter windowing into frames 𝑦(𝑛, 𝑘), where 𝑛 is the time index estimation methods. ( 𝑛 = 0,1,2, … , 𝑀-1) and 𝑘 is the frame index ( 𝑘 = 1) Proposed SAD Method: The proposed SAD is 0,1,2,…, 𝑁-1). We have used 50% overlapped Kaiser implemented through a majority voting (MV) of three window with 𝛽 = 2.5 for generating 𝑦(𝑛, 𝑘) as can be distinct SAD fusions corresponding to Spectral Flatness found effective in terms of bias reduced 𝐾 0(𝑛) [9]. (SF), zero-crossing rate-weighted root mean square A. Parameter Estimation energy (ZCRMS), and Kaiser-Teager Energy (KTE). It is observed that the SF approaches 1/0 depending on the The 𝜎𝑣2 is computed from the estimated noise spectrum silent/speech activity [11]. The degree of speech activity |𝑉�(𝑚, 𝑘)|, and the LPC parameters ( {𝑎 𝑖 } and 𝜎𝑢2 ) are 9 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 can be predicted through ZCRMS when it rises to 1, b) if 𝜂(𝑘) < 𝑡𝜂 then while close to 0 for silent frames. Whereas during the 𝑓𝜂 = 1 speech activity, KTE rises up to 1 and gives a prominent elseif 𝜁(𝑘) > 𝑡 𝜁 then local peak, while goes down to 0 for silent frames [12]. 𝑓𝜁 = 1 elseif 𝜆(𝑘) > 𝑡𝜆 then 𝑓𝜆 = 1 end if c) 𝑀𝑉 = (𝑓𝜂 + 𝑓𝜁 + 𝑓𝜆 )/3 d) if 𝑀𝑉 > 0.5 then 𝐹𝐿𝐴𝐺(𝑘) = 1 [Speech Activity] else 𝐹𝐿𝐴𝐺(𝑘) = 0 [Silent Activity] end if end for Figure 4. (a) Noisy speech (corrupted by 0 dB restaurant noise), (b) computed SF, ZCRMS, and KTE from (a). In noisy conditions, Fig. 4 reveals that the SF still varies between 0 to 1 depending on the speech/silent activity. Whereas the ZCRMS and KTE rise up to 1 once the speech activity is present and approaching 0 at silent activity. However, to make the SAD robust against noise, the threshold for each feature continually updated on a Figure 5. Comparing the reference and detected SAD flags for clean framewise basis to implement the corresponding SAD (Fig. 1 (a)) and noisy speech (corrupted by 5 dB restaurant noise). fusions. MV takes the average of SAD fusions ∈ {0,1} (0: silent and 1: speech activity) for each frame and speech It can be seen from Fig. 5 that few miss-detections are activity is present if 𝑀𝑉 > 0.5, otherwise, silent. found between the detected and reference SAD flags. For 𝑘 𝑡ℎ frame, the SF (denoted by 𝜂) is computed as Note that the reference SAD flags are generated by [11] visually inspecting the clean speech (Fig. 1 (a)) frames (0: 𝐿 �∏𝐿−1 silence and -1: speech activity). 𝑚=0 |𝑌(𝑚,𝑘)| 𝜂(𝑘) = 1 𝐿−1 (13) 2) Proposed 𝜎𝑣2 Estimation Method: The initial noise ∑ |𝑌(𝑚,𝑘)| 𝐿 𝑚=0 periodogram |𝑉� (𝑚, 𝑘)|2 is computed by assuming the 5 where |𝑌(𝑚, 𝑘)| is the magnitude spectrum of 𝑦(𝑛, 𝑘), 𝑚 starting 𝑦(𝑛, 𝑘)’s are silent as is the acoustic frequency bin index, and 𝑚 = 0,1,2, … , 1 𝐿-1. For 𝑘𝑡ℎ frame, the ZCRMS (denoted by𝜁) is given |𝑉̂ (𝑚, 𝑘)|2 = ∑4𝑘=0 |𝑌(𝑚, 𝑘)|2 (16) 5 by [12] 1 During silent activity of 𝑦(𝑛, 𝑘) (𝑘 > 4), |𝑉� (𝑚, 𝑘)|2 is � ∑𝑀−1 2 𝑛=0 𝑦 (𝑛,𝑘) 𝑀 updated by using the DD approach as [4] 𝜁(𝑘) = 1 (14) ∑𝑀 |𝑠𝑖𝑔𝑛(𝑦(𝑛,𝑘))−𝑠𝑖𝑔𝑛(𝑦(𝑛−1,𝑘))| 2𝑀 𝑛=1 |𝑉�(𝑚, 𝑘)|2 = 𝐺|𝑉� (𝑚, 𝑘 − 1)|2 + (1 − 𝐺)|𝑌(𝑚, 𝑘)| 2 (17) For 𝑘 𝑡ℎ frame, the KTE (denoted by 𝜆) is given by [12] where 𝐺 is a smoothing parameter and set to 0.9. In stationary noise conditions, the estimation of 𝜆(𝑘) = �∑M−1 2 n=0 {y (n, k) − y(n − 1, k)y(n + 1, k)} (15) |𝑉�(𝑚, 𝑘)|2 during non-speech activity is effective [4]. Since the non-stationary noise is characterized by time By assuming the 5 starting 𝑦(𝑛, 𝑘) ’s are silent, the varying amplitude, the active speech regions also affected proposed SAD algorithm is given below (where 𝑡 𝜂 , 𝑡𝜁 , by noise. Therefore, the traditional DD approach is not and 𝑡𝜆 are the adaptive threshold of 𝜂 , 𝜁 , and 𝜆 , appropriate to estimate |𝑉� (𝑚, 𝑘)|2 in NNCs. To address respectively) this issue, we compute the a posteriori SNR (denoted by 𝛾) during speech activity to asses the amount of noise Algorithm 1: Proposed SAD Algorithm available. For 𝑘 𝑡ℎ frame, 𝛾(𝑘) is computed as 1) Initialization |𝑌(𝑚,𝑘)|2 𝑓𝜂 = 0, 𝑓𝜁 = 0, 𝑓𝜆 = 0 𝛾(𝑘) = 10log 10 � � � (18) 4 4 4 |𝑉 (𝑚,𝑘−1)|2 1 1 1 𝕊𝜂 = � 𝜂(𝑘), 𝕊𝜁 = � 𝜁(𝑘), 𝕊𝜆 = � 𝜆(𝑘) 5 𝑘=0 5 𝑘=0 5 𝑘=0 It is observed that the 𝛾(𝑘) becomes lower (mostly 𝐹𝐿𝐴𝐺(𝑘) = 0 for 𝑘 = 0,1,2,3,4 negative) if the active speech region is highly affected by 2) for 𝑘 = 5 𝑡𝑜 𝑁-1 do [framewise loop] additive noise. Thus, we compute an adaptive threshold a) Update Thresholds (𝑡𝛾 ) by taking the average of 𝛾(𝑘)’s up to frame 𝑘 during 𝕊𝜂 = 𝕊𝜂 + 𝜂(𝑘), 𝑡𝜂 = 𝕊𝜂 /𝑘 processing the 𝑘 𝑡ℎ frame as 𝕊𝜁 = 𝕊𝜁 + 𝜁(𝑘), 𝑡𝜁 = 𝕊𝜁 /𝑘 1 𝑡𝛾 = ∑𝑘𝑖 =−01 𝛾(𝑖) (19) 𝕊𝜆 = 𝕊𝜆 + 𝜆(𝑘), 𝑡𝜆 = 𝕊𝜆 /𝑘 𝑘 10 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 During the speech activity of 𝑦(𝑛, 𝑘), if 𝛾(𝑘) ≤ 𝑡 𝛾 , 𝑒𝑗 (𝑛, 𝑘) = 𝑠̂𝑗−1 (𝑛, 𝑘) − 𝐝𝐱�𝑗 (𝑛|𝑛 − 1, 𝑘) (24) � |𝑉(𝑚, 𝑘)|2 is updated by eq. (17), otherwise, keep 𝐊𝑗 (𝑛, 𝑘) = 𝚺𝑗 (𝑛|𝑛 − 1, 𝑘)𝐝 𝑇 (𝐝𝚺𝑗 (𝑛|𝑛 − 1, 𝑘)𝐝 𝑇 |𝑉�(𝑚, 𝑘)|2 unchanged. The 𝜎𝑣2 is computed from 𝑣�(𝑛, 𝑘) +𝜎𝑣2 )−1 (25) (IDFT of |𝑉� (𝑚, 𝑘)|𝑒𝑥𝑝[∠𝑌(𝑚, 𝑘)]) as 1 𝐱�𝑗 (𝑛|𝑛, 𝑘) = 𝐱�𝑗 (𝑛|𝑛 − 1, 𝑘) + 𝐊𝑗 (𝑛, 𝑘)𝑒𝑗 (𝑛, 𝑘) (26) 𝜎𝑣2 = ∑𝑀−1 � 2 (𝑛, 𝑘) 𝑛=0 𝑣 (20) 𝑀 𝚺𝑗 (𝑛|𝑛, 𝑘) = (𝐈 − 𝐊𝑗 (𝑛, 𝑘)𝐝)𝚺𝑗 (𝑛|𝑛 − 1, 𝑘) (27) 3) Initial {𝑎 𝑖 } and 𝜎𝑢2 Computation: The LPC parameters ( {𝑎 𝑖 } and 𝜎𝑢2 ) are very sensitive to noise, 𝑠̂𝑗 (𝑛, 𝑘) = 𝐝𝐱�𝑗 (𝑛|𝑛, 𝑘) (28) specially at low SNRs. The existing IT-KF methods end for [end of samplewise processing loop] compute these parameters from the noisy speech at first b) Re-estimate {𝑎 𝑖}, 𝜎𝑢2, and 𝜎𝑣2 from 𝑠̂𝑗 (𝑛, 𝑘) (III-A4) iteration and re-estimate at the subsequent iterations [6, 8]. end for [end of iteration loop] Due to compute these parameters from noisy speech at 3) Set 𝑠̂(𝑛, 𝑘) = 𝑠̂𝑗(𝑛, 𝑘) and employ the overlap-add first iteration, the 𝜎𝑢2 rise up to 1 and introduce biasing synthesis to 𝑠̂(𝑛, 𝑘), yielding the enhanced speech 𝑠̂ (𝑛). effect in 𝐾0 (𝑛) (eq. (11)). To address this issue, we employ a 5𝑡ℎ order triangular smoothing to noisy 𝑦(𝑛, 𝑘) C. Optimality Comparison of Kalman Gain for reducing the noise effect, giving a pre-smoothing Fig. 6 shows that the re-estimated LPC envelope at 2 𝑛𝑑 speech 𝑠̂ 𝑝 (𝑛, 𝑘) as [13] iteration of IT-KF is sharper than the initial LPC 1 +(𝐿𝑠 −1)/2 envelope and closer to the clean speech envelope. 𝑠̂𝑝 (𝑛, 𝑘) = ∑𝑖=−(𝐿 𝑠 −1)/2 𝑤[𝑖 + (𝐿𝑠 + 1)/2]𝑦(𝑛 − 𝑖, 𝑘) Whereas the LPC envelope computed from the 9 (21) corresponding noisy frame deviates a bit from the clean envelope. Due to improved {𝑎 𝑖 } , the re-estimated 𝜎𝑢2 where 𝑤 = [1 2 3 2 1] is a 5 𝑡ℎ order triangular becomes lower than that of the initial estimated 𝜎 𝑢2 . Also, smoothing window and 𝐿 𝑠 is the length of 𝑤. the re-estimation of 𝜎𝑣2 during silent activity makes it Then compute {𝑎 𝑖 } and 𝜎𝑢2 from 𝑠̂ 𝑝 (𝑛, 𝑘) by using the more effective. Therefore, the 𝜎 𝑢2 and 𝜎𝑣2 offset the bias in autocorrelation based method [10]. The performance 𝐾0 (𝑛) effectively. It can be seen from Fig. 7 that the comparison of the initial estimated {𝑎 𝑖 } is shown in Fig. 6. adjusted 𝐾0 (𝑛) at 2𝑛𝑑 iteration of IT-KF is almost free of 4) Re-estimation of {𝑎 𝑖 }, 𝜎𝑢2 , and 𝜎𝑣2 in Proposed IT- biasing effect and shown smooth transition at the edges as KF: The conventional IT-KF [6] re-estimates the {𝑎 𝑖 } and like oracle case 𝐾0(𝑛), even at low SNR of 0 dB. Whereas 𝜎𝑢2 from 𝑠̂𝑗 (𝑛, 𝑘) (processed speech at 𝑗 𝑡ℎ iteration) while the 𝐾0(𝑛) computed from noisy speech is biased around no action takes on 𝜎𝑣2 . Since each iteration gives more 0.5 almost the entire trajectory. refined enhanced speech, the additive noise effect becomes reduced. Therefore, it is also necessary to update 𝜎𝑣2 from 𝑠̂𝑗 (𝑛, 𝑘) as introduced in our proposed IT- KF method. To make the re-estimation of parameters effective, unlike conventional IT-KF [6], we do it in a different manner based on SAD. Specifically, during the silent activity, the 𝑠̂ 𝑗 (𝑛, 𝑘) is filled up completely with noise. Thus, 𝜎𝑣2 is updated at silent activity of 𝑠̂ 𝑗 (𝑛, 𝑘) while keeping the {𝑎 𝑖 } and 𝜎𝑢2 unchanged. During speech Figure 6. LPC spectrum comparison computed from the clean, noisy, 𝑠̂𝑝 (𝑛, 𝑘), and IT-KF (2 𝑛𝑑 iteration) for the same setup used in Fig. 1. activity of 𝑠̂𝑗 (𝑛, 𝑘), the {𝑎 𝑖 } and 𝜎𝑢2 are re-estimated and keeping 𝜎𝑣2 unchanged. B. Summary of the Proposed IT-KF Based SEA For 𝑘 𝑡ℎ frame, by letting MAX=2, the proposed IT-KF based SEA is summarized below. Algorithm 2: Proposed IT-KF Based SEA 1) Initialization: Figure 7. Comparing the trajectory of 𝐾 0(𝑛) computed through a) Extract 𝐹𝐿𝐴𝐺(𝑘) from 𝑦(𝑛, 𝑘) by SAD (III-A1) oracle, noisy, and proposed cases with the same experimental setup b) Compute 𝜎𝑣2 from 𝑦(𝑛, 𝑘) (III-A2) used in Fig. 1. c) Compute initial {𝑎 𝑖} and 𝜎𝑢2 from 𝑦(𝑛, 𝑘) (III-A3) d) Set 𝒙 �1(0|0) = 0 and 𝚺1(0|0) = [0]𝑝×𝑝 IV. SPEECH ENHANCEMENT EXPERIMENT e) Form 𝜳 with estimated {𝑎 𝑖} A. Simulation Setup f) Set 𝑠̂0(𝑛, 𝑘) = 𝑦(𝑛, 𝑘) To evaluate the performance of the proposed SEA, 30 2) for 𝑗 = 1 𝑡𝑜 𝑀𝐴𝑋 do [iteration loop] speech sentences belonging to six speakers are taken a) for 𝑛 = 0 𝑡𝑜 𝑀-1 do [samplewise loop] from the NOIZEUS corpus sampled at 16 kHz [14, 𝐱�𝑗 (𝑛|𝑛 − 1, 𝑘) = 𝚿𝑗 𝐱�𝑗 (𝑛 − 1|𝑛 − 1, 𝑘) (22) Chapter 12]. To perform experiments, we generate a 𝚺𝑗 (𝑛|𝑛 − 1, 𝑘) = 𝚿𝑗 𝚺𝑗 (𝑛 − 1|𝑛 − 1, 𝑘)𝚿𝑗𝑇 + 𝐜𝜎𝑢2 𝐜 𝑇 (23) stimuli set that has been corrupted by restaurant and 11 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 babble noises for a wide range of SNRs (-5dB to 15dB). evaluation results in Fig. 9 reveals that the proposed The objective quality evaluation was carried out by PESQ method ensures better quality in the enhanced speech and spectrogram analysis [14, Chapter 11]. We have used over benchmark methods. the quasi-stationary speech transmission index (QSTI) for objective intelligibility testing, which provides a rating in (%) [15]. The subjective evaluation was performed on two sentences (1 male and 1 female) randomly chosen from the stimuli set. Five English speaking listeners rate the quality of the enhanced speech obtained by all methods based on a pre-defined scale as introduced in the mean opinion score (MOS) test [14]. During this test, the listeners have no information about the proposed and benchmark methods to make it unbiased. The efficiency of the proposed method (IT-KF) is carried out by Figure 10. Spectrogram comparison among: (a) clean speech (as in comparing it with other benchmark methods, such as sub- Fig. 1(a)), (b) noisy speech (corrupted with restaurant noise at 5 dB band iterative KF (SBIT-KF) [8], MMSE with regional SNR), with enhanced speech obtained through (c) LTW-KF [7], (d) statistics (MMSE-RS) [5], and long tapered window MMSE-RS [5], (e) SBIT-KF [8], and (f) IT-KF (Proposed) methods. based KF (LTW-KF) [7]. These methods also compared in terms of their spectrograms in Fig. 10. Here, it can be seen that the IT- KF enhanced speech is almost free of noise floor, whereas the existing SEAs contain a significant amount of noise floor. The informal listening tests also confirm that the existing methods produce very annoying sounds as compared to the negligible audio artifacts produced by the proposed method. However, when compared with Figure 8. Average QSTI (%) comparison between the proposed and other SEAs on NOIZEUS corpus corrupted with: (a) restaurant and (b) clean speech spectrogram, the proposed method babble noises for SNRs (-5dB to 15dB). introduced a little bit distortion in the enhanced speech. This may result due to an over-suppression of the spectral valleys by the adjusted Kalman gain during the speech activity. Figure 9. Average PESQ comparison between the proposed and other SEAs on NOIZEUS corpus corrupted with: (a) restaurant and (b) babble noises for SNRs (-5dB to 15dB). Figure 11. Average MOS comparison between the proposed and other B. Simulation Results and Discussion SEAs on NOIZEUS corpus corrupted with: (a) restaurant and (b) babble noises for SNRs (-5dB to 15dB). The QSTI results for the restaurant noise experiment in Fig. 8 (a) specifies that the IT-KF method gives QSTI of Fig. 11 shows the subjective MOS results for a male 0.68 to 0.88 at all SNRs, whereas the competitive sentence "The birch canoe slid on the smooth planks" and methods give QSTI ranged between 0.57 to 0.77. The a female sentence "Bring your best compass to the third QSTI for the babble noise experiment in Fig. 8 (b) class". It can be seen from this figure that the proposed suggests that the proposed IT-KF method yields 0.63 to method was preferred by the listeners effectively and 0.91 followed by the other methods give average QSTI gives superior quality over other methods. Specifically, ranging from 0.4 to 0.8. The high QSTI of the proposed the restaurant noise experimental results (Fig. 11 (a)) method reveals that the enhanced speech provides better reveals that the proposed method gives average MOS of intelligibility than the benchmark methods for a wide 2.68 to 4.11 at all SNRs, whereas the competitive range of SNRs. methods ranging from 2.23 to 3.75. The proposed method The average PESQ results for the restaurant noise shows continuous improvement over benchmark methods experiment are shown in Fig. 9 (a). It can be seen from for the babble noise experiment (Fig. 9 (b)). Among the this figure that the proposed method (PESQ between 1.83 benchmark methods, the listeners preferred the SBIT-KF and 3.13) is better than the other methods (with PESQ [8] over other SEAs, apart from the oracle KF. ranging from 1.4 to 2.88). Similar results are obtained for the babble noise experiment as shown in Fig. 9 (b). Note V. CONCLUSION that the high PESQ indicates the enhanced speech provides natural quality of sound, whereas quality In this paper, an IT-KF with reduced-biased Kalman degradation for low PESQ. Therefore, the PESQ gain has been proposed for single channel speech 12 International Journal of Signal Processing Systems Vol. 7, No. 1, Marc h 2019 enhancement in NNCs. We have introduced an improved IEEE International Symposium on Circuits and Systems, Island of Kos, 2006. noise variance estimation method. The initial [13] T. O’Haver, A Pragmatic Introduction to Signal Processing: With LPC parameters are computed from a pre-smoothed Applications in Scientific Measurement, Create Space Independent speech. The estimated parameters are used at the Publishing Platform, 2017. first iteration of IT-KF to process the noisy speech and [14] P. C. Loizou, “Speech enhancement: Theory and practice,” Signal Processing and Communications, 2007. they are re-estimated at the second iteration from the [15] B. Schwerin and K. K. Paliwal, “An improved speech processed speech. It is shown that the re-estimated transmission index for intelligibility prediction,” Speech parameters offset the bias in Kalman gain and Communication, vol. 65, pp. 9-19, Dec. 2014. enables the IT-KF to minimize the noise effect, giving better enhanced speech. Experimental results reveal Sujan Kumar Roy was born in Kurigram, that the proposed method outperforms other Bangladesh, in 1983. He received the B.Sc. and benchmark SEAs in NNCs for a wide range of SNRs. M.Sc. degrees in Computer Science and An opportunity for further research lies in dynamically Engineering from the University of Rajshahi, Bangladesh, in 2008 and 2010, respectively. He offsetting the bias of the Kalman gain under NNCs. also received a Master of Applied Science (M.A.Sc) degree in Electrical and Computer REFERENCES Engineering from Concordia University, Canada in May 2016. He is currently a Ph.D [1] S. Ball, “Suppression of acoustic noise in speech using spectral candidate in the School of Engineering at subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Griffith University, Brisbane, Australia. His Processing, vol. 27, no. 2, pp. 113-120, April 1979. research interests include speech enhancement. [2] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Kuldip K. Paliwal was born in Aligarh, India, Processing, vol. 32, no. 6, pp. 1109-1121, Dec. 1984. in 1952. He received the B.S. degree from Agra [3] K. K. Paliwal and A. Basu, “A speech enhancement method based University, India in 1969, the M.S. degree from on Kalman filtering,” in Proc. IEEE International Conference on Aligarh Muslim University, India in 1971, and Acoustics, Speech, and Signal Processing (ICASSP), vol. 12, April the Ph.D degree from Bombay University, India 1987, pp. 177-180. in 1978. He has worked at Tata Institute of [4] N. Upadhyay and A. Karmakar, “Speech enhancement using Fundamental Research, Bombay, India spectral subtraction-type algorithms: A comparison and simulation Norwegian Institute of Technology, Trondheim, study,” Procedia Computer Science, vol. 54, pp. 574-584, Aug. Norway, University of Keele, U.K., AT&T Bell 2015. Laboratories, Murray Hill, New Jersey, U.S.A., [5] X. Li, L. Girin, S. Gannot, and R. Horaud, “Non-stationary noise AT&T Shannon Laboratories, Florham Park, New Jersey, U.S.A., and power spectral density estimation based on regional statistics,” in Advanced Telecommunication Research Laboratories, Kyoto, Japan. Proc. IEEE International Conference on Acoustics, Speech and Since July 1993, he has been a professor at Griffith University, Signal Processing (ICASSP), Shanghai, 2016, pp. 181-185. Brisbane, Australia, in the School of Microelectronic Engineering. His [6] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise current research interests include speech recognition, speech coding, for speech enhancement and coding,” IEEE Transactions on speaker recognition, speech enhancement, face recognition, image Signal Processing, vol. 39, no. 8, pp. 1732-1742, Aug. 1991. coding, pattern recognition and artificial neural networks. He has [7] S. So and K. K. Paliwal, “Suppressing the influence of additive published more than 300 papers in these research areas. He is a Fellow noise on the Kalman gain for low residual noise speech of Acoustical Society of India. He has served the IEEE Signal enhancement,” Speech Communication, vol. 53, no. 3, pp. 355-378, Processing Society’s Neural Networks Technical Committee as a March 2011. founding member from 1991 to 1995 and the Speech ©2016 Int. J. Sig. [8] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech Process. Syst. 268 Processing Technical Committee from 1999 to 2003. enhancement using subband iterative Kalman filter,” in Proc. He was an Associate Editor of the IEEE Transactions on Speech and IEEE International Symposium on Circuits and Systems (ISCAS), Audio Processing during the periods 1994-1997 and 2003-2004. He is May 2016, pp. 762-765. in the Editorial Board of the IEEE Signal Processing Magazine. He also [9] S. K. Roy and K. K. Paliwal, “A non-iterative Kalman filter for served as an Associate Editor of the IEEE Signal Processing Letters single channel speech enhancement in non-stationary noise from 1997 to 2000. He was the General Co-Chair of the Tenth IEEE condition,” in Proc. 12th International Conference on Signal Workshop on Neural Networks for Signal Processing (NNSP2000). He Processing and Communication Systems (ICSPCS), Cairns, has co-edited two books: “Speech Coding and Synthesis” (published by Australia, 2018. Elsevier), and “Speech and Speaker Recognition: Advanced Topics” [10] S. V. Vaseghi, Advanced Digital Signal Processing and Noise (published by Kluwer). He has received IEEE Signal Processing Reduction, Hoboken: John Wiley & Sons, Ltd., 2001, pp. 227-262. Society’s best (senior) paper award in 1995 for his paper on LPC [11] Y. Ma and A. Nishihara, “Efficient voice activity detection quantization. He served as the Editor-in- Chief of the Speech algorithm using long-term spectral flatness measure,” EURASIP Communication journal (published by Elsevier) during 2005-2011. Journal on Audio, Speech, and Music Processing, vol. 2013, pp. 87:1-87:18, July 2013. [12] C. Shahnaz, W. P. Zhu and M. O. Ahmad, “A multifeature voiced/unvoiced decision algorithm for noisy speech,” in Proc. 13 CHAPTER 2. AN ITERATIVE KALMAN FILTER WITH REDUCED-BIASED KALMAN GAIN FOR SINGLE CHANNEL SPEECH ENHANCEMENT IN 26 NON-STATIONARY NOISE CONDITION Part III Tuning of Kalman Filter as well as Augmented Kalman Filter for Speech Enhancement 27 Chapter 3 Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement 29 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Kuldip K. Paliwal, "Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement", Under review with: Signals (Submitted at 26 Feb. 2021). My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) ____ ____ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal Article Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement Sujan Kumar Roy*, Kuldip K. Paliwal Signal Processing Laboratory, Griffith University, Nathan Campus, Brisbane, QLD, 4111, Australia; k.paliwal@griffith.edu.au * Correspondence: sujankumar.roy@griffithuni.edu.au Abstract: The inaccurate estimates of linear prediction coefficient (LPC) and noise variance introduce bias in Kalman filter (KF) gain and degrade speech enhancement performance. Current tuning methods of the biased KF gain particularly operate in stationary noise conditions. This paper introduces a new tuning algorithm of the KF gain for speech enhancement in real-life noise conditions. Firstly, a speech presence probability (SPP) method estimates noise from each noisy speech frame to compute the noise variance. A whitening filter is constructed with its coefficients computed from the estimated noise to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. The KF is constructed with the estimated parameters, where the robustness metric offsets the bias in KF gain during speech absence to that of the sensitivity metric during speech presence to achieve better noise reduction. The noise variance and the speech model parameters are adopted as a speech activity detector. It is shown that the reduced-biased KF gain achieved by the proposed method addresses speech enhancement in real-life noise conditions. Objective and subjective scores on the NOIZEUS corpus demonstrate that the proposed produced higher quality and intelligible enhanced speech than competing methods. Keywords: Speech enhancement; Kalman filter; Kalman gain; robustness metric; sensitivity metric; LPC; whitening filter; real-life noise. Citation: Roy, S. K; Paliwal, K. K 1. Introduction Robustness and Sensitivity Tuning The main objective of a speech enhancement algorithm (SEA) is to improve the of the Kalman Filter for Speech Enhancement. Journal Not Specified quality and intelligibility of the noisy speech [1]. It can be achieved through eliminating 2021, 1, 0. https://doi.org/ the embedded noise from a noisy speech signal without distorting the speech. Many speech processing systems, such as speech communication systems, hearing aid devices, Received: and speech recognition systems, somehow rely upon the SEAs for robustness. Various Accepted: SEAs, namely spectral subtraction (SS) [2–5], Wiener Filter (WF) [6–8], minimum mean Published: square error (MMSE) [9–11], the Kalman filter (KF) [12], augmented KF (AKF) [13], and deep neural network (DNN) [14–16] have been introduced over the decades. This paper Publisher’s Note: MDPI stays neu- focuses on KF-based speech enhancement in real-life noise conditions. tral with regard to jurisdictional Kalman filter (KF) was first used for speech enhancement by Paliwal and Basu claims in published maps and insti- [12]. In KF, a speech signal is represented by an auto-regressive (AR) process, whose tutional affiliations. parameters comprise the linear prediction coefficients (LPCs) and prediction error vari- ance. The LPC parameters and noise variance are used to construct the KF recursion equations. KF gives a linear MMSE estimate of the current state of the clean speech given Copyright: © 2021 by the authors. the observed noisy speech for each sample within a frame. Therefore, the inaccurate esti- Submitted to Journal Not Specified mates of the LPC parameters and noise variance degrade the quality and intelligibility for possible open access publication of enhanced speech produced by the KF. In [12], it was demonstrated that the KF shows under the terms and conditions of the Creative Commons Attri- excellent performance in stationary white Gaussian noise (WGN) conditions when the bution (CC BY) license (https:// LPC parameters are estimated from the clean speech, which is unobserved in practice. creativecommons.org/licenses/by/ In [13], Gibson et al. introduced an augmented KF (AKF) to enhance colored noise 4.0/). corrupted speech. In this SEA, both the clean speech and noise are represented by two Version April 13, 2021 submitted to Journal Not Specified https://www.mdpi.com/journal/notspecified Version April 13, 2021 submitted to Journal Not Specified 2 of 21 AR processes. The speech and noise LPC parameters are incorporated in an augmented matrix form to construct the recursive equations of AKF. In [13], the AKF processes the colored noise corrupted speech iteratively (usually 3-4 iterations) to eliminate the noise, yielding the enhanced speech. During this, the LPC parameters for the current frame are computed from the corresponding filtered speech frame of the previous iteration by AKF. Although the enhanced speech of the AKF demonstrates an improvement in signal-to-noise ratio (SNR), it suffers from musical noise and speech distortion. Therefore, this method [13] does not adequately address the inaccurate speech and noise LPC parameter estimation issue in practice. In [17], Roy et al. introduced a sub-band (SB) iterative KF (SBIT-KF)-based SEA. This method enhances only the high-frequency sub-bands (SBs) using iterative KF among the 16 decomposed SBs of noisy speech for a given utterance, with the assumption that the impact of noise in low-frequency SBs (LF-SBs) is negligible. However, the LF-SBs can also be affected by noise typically when operating in conditions that have time-varying amplitudes. As demonstrated in [13], the SBIT-KF [17] also suffers from speech distortion due to the iterative processing of noisy speech by KF. In [18], Saha et al. proposed a robustness metric and a sensitivity metric for tuning the biased KF gain for instrument engineering applications. Later on, So et al. applied the tuning of biased KF gain for speech enhancement in WGN condition [19,20]. Specif- ically, the enhanced speech (for each sample within a noisy speech frame) is given by recursively averaging the observed noisy speech and the predicted speech weighted by a scalar KF gain [19]. However, the inaccurate estimates of the LPC parameters introduce bias in the KF gain, results in leaking a significant residual noise in the enhanced speech. In [19], a robustness metric is used to offset the bias in KF gain for speech enhancement. However, So et al. further showed that the robustness metric strongly suppresses the KF gain in speech regions, resulting in distorted speech [20]. In [20], a sensitivity metric was used to offset the bias in KF gain, which produced less distorted speech. In [21], George et al. proposed a robustness metric-based tuning of the AKF (AKF-RMBT) for enhancing colored noise corrupted speech. As like [19], it is shown that the tuning method in AKF- RMBT [21] also offsets the bias in AKF gain for silent frames to some extent; however, it over-suppresses the components in speech regions, resulting in distorted speech. The existing KF methods [19,20] address the tuning of biased KF gain in WGN condition with the prior assumption that the impact of WGN on LPCs is negligible. Though the AKF-RMBT [21] performs tuning of biased AKF gain in colored noise conditions; however, it still produces distorted speech. In this paper, we propose a new tuning algorithm of the KF gain for speech enhancement in real-life noise conditions. For this purpose, we estimate noise from each noisy speech frame using an SPP method to compute the noise variance. To minimize bias in the speech LPC parameters, we compute them from a pre-whitened speech for each noisy speech frame. Then construct the KF with the estimated parameters. To achieve better noise reduction, the robustness metric is employed to offset the bias in KF gain when there is speech absent of the noisy speech to that of the sensitivity metric during speech presence. We also adopt the noise variance and the AR model parameters as a speech activity detector. The proposed method aims to mitigate the weaknesses of previously proposed KF methods by providing a reduced-biased KF gain. The motivation of this is to produce enhanced speech at a higher quality and intelligibility–even for real-life noise conditions. The structure of this paper is as follows: Section 2 describes the KF for speech enhancement, including the paradigm shift of the KF recursive equations, the impact of biased KF gain on KF-based speech enhancement in WGN and real-life noise conditions. In section 3, we describe the proposed SEA, which includes the proposed parameter estimation as well as the proposed tuning algorithm. Following this, section 4 describes the experimental setup in terms of speech corpus, objective and subjective evaluation metrics, and specifications of competitive SEAs. The experimental results are then presented in section 5. Finally, section 6 gives some concluding remarks. Version April 13, 2021 submitted to Journal Not Specified 3 of 21 2. Kalman Filter for Speech Enhancement Assuming that the noise, v(n), to be additive and uncorrelated with the clean speech, s(n), at sample n, the noisy speech, y(n), can be represented as: y ( n ) = s ( n ) + v ( n ). (1) Since the KF operates on a frame-by-frame basis, firstly, a 20 ms rectangular window with 0% overlap is used to convert y(n) into frames as: y(n, k) = s(n, k) + v(n, k), (2) where k�{0, 1, 2, . . . , N − 1} is the frame index, N is the total number of frames in an utterance, and M is the total number of samples in each frame, i.e., n�{0, 1, 2, . . . , M − 1}. For simplicity of representation, the frame index is omitted in the KF recursive equations. Each clean speech frame in eq. (2) can be represented by a pth order autoregressive (AR) model as [22, Chapter 8]: p s ( n ) = − ∑ a i s ( n − i ) + w ( n ), (3) i =1 where { ai ; i = 1, 2, . . . , p} are the LPCs, w(n) is assumed to be a white noise with zero mean and variance σw2 . Equations (2)-(3) can be used to form the following state-space model (SSM) of the KF (where the bold variables denote vector/matrix quantities, as opposed to unbolded variables for scalar quantities): x(n) = Φx(n − 1) + dw(n), (4) y ( n ) = c � x ( n ) + v ( n ). (5) In the above SSM, 1. x(n) is a p × 1 state vector at sample n, given by: x(n) = [s(n) s ( n − 1) ... s(n − p + 1)]� , (6) 2. Φ is a p × p state transition matrix, represented as:   − a1 − a2 ... − a p −1 −a p  1 0 ... 0 0     0  Φ= 0 1 ... 0 , (7)  . .. .. .. ..   .. . . . .  0 0 ... 1 0 3. d and c are the p × 1 measurement vectors for the excitation noise and observation, written as: � �T d = c = 1 0 ... 0 , 4. y(n) is the observed noisy speech at sample n. Version April 13, 2021 submitted to Journal Not Specified 4 of 21 For a particular frame, the KF recursively computes an unbiased linear MMSE estimate, x̂(n|n), of the state-vector, x(n), given the observed noisy speech up to sample n, i.e., y(1), y(2), ..., y(n), using the following equations [12]: x̂(n|n − 1) = Φ x̂(n − 1|n − 1), (8) Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + σw2 dd� , � (9) K (n) = Ψ(n|n − 1)c[c� Ψ(n|n − 1)c + σv2 ]−1 , (10) � x̂(n|n) = x̂(n|n − 1) + K (n)[y(n) − c x̂(n|n − 1)], (11) � Ψ ( n | n ) = [ I − K ( n ) c ] Ψ ( n | n − 1). (12) In the above eqs. (8)-(12), Ψ(n|n − 1) and Ψ(n|n) are the error covariance matrices of the a priori and a posteriori state estimates, x̂(n|n − 1) and x̂(n|n); K (n) is the Kalman gain; σv2 is the variance of the additive noise, v(n); and I is the identity matrix. During processing each frame, the estimated LPC parameters, ({ ai }, σw2 ), and noise variance, σv2 , remain unchanged for that frame, while K (n), Ψ(n|n), and x̂(n|n) are continually updated on a sample-by-sample basis. The estimated speech at sample n is given by: ŝ(n|n) = c� x̂(n|n). Once all noisy speech frames being processed, synthesis over the enhanced frames yielding the enhanced speech, ŝ(n). 2.1. Paradigm shift of recursive equations The paradigm shift of the recursive eqs. (8)-(12) transforms them in scalar form. It exploits the understanding of the KF operation in speech enhancement context. The simplification starts with the output of the KF, ŝ(n|n) = c� x̂(n|n), which is re-written as:   ŝ(n|n) � �  ŝ(n|n − 1)   c� x̂(n|n) = 1 0 ... 0  .. ,  .  ŝ(n|n − p + 1) = ŝ(n|n). (13) Multiply c� on both sides of eq. (11) gives: c� x̂(n|n) = c� x̂(n|n − 1) + c� K (n)[y(n) − c� x̂(n|n − 1)]. (14) According to eq. (13), c� x̂(n|n − 1) is also given by: c� x̂(n|n − 1) = ŝ(n|n − 1). (15) In eq. (14), c� K (n) represents the first component, K0 (n), of the Kalman gain vector, K (n), i.e.,: K0 ( n ) = c � K ( n ) . (16) Substituting eq. (10) into eq. (16) gives: c � Ψ ( n | n − 1) c K0 ( n ) = . (17) c� Ψ(n|n − 1)c + σv2 With eq. (9), c� Ψ(n|n − 1)c of eq. (17) is simplified as: c� Ψ(n|n − 1)c = c� ΦΨ(n − 1|n − 1)Φ� c + c� σw2 dd� c. (18) Version April 13, 2021 submitted to Journal Not Specified 5 of 21 The linear algebra operation on c� σw2 dd� c, gives: c� σw2 dd� c = σw2 , (19) and c� ΦΨ(n − 1|n − 1)Φ� c represents the transmission of a posteriori error variance by the speech model from the previous time sample, n − 1, denoted as [20]: c� ΦΨ(n − 1|n − 1)Φ� c = α2 (n). (20) Substituting eqs. (19)-(20) into eq. (18) gives: c� Ψ(n|n − 1)c = α2 (n) + σw2 . (21) From eq. (21) and (17), K0 (n) in given by: α2 (n) + σw2 K0 ( n ) = . (22) α2 (n) + σw2 + σv2 Substituting eqs. (13), (15)-(16) into eq. (14) gives: ŝ(n|n) = ŝ(n|n − 1) + K0 [y(n) − ŝ(n|n − 1)]. (23) Re-arranging eq. (23) yields: ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)y(n). (24) Eq. (24) implies that the accurate estimates of ŝ(n|n) (output of the KF) will be achieved if K0 (n) becomes unbiased. However, in practice, the inaccurate estimates of ({ ai }, σw2 ) and σv2 introduce bias in K0 (n), resulting degraded ŝ(n|n). In [18], Saha et al. introduced a robustness and a sensitivity metrics to quantify the level of robustness and sensitivity of the KF, which can be used to offset the bias in K0 (n). In speech enhancement context, these metrics can be computed by simplifying the mean squared error, c� Ψ(n|n)c of the KF output, ŝ(n|n) as: c� Ψ(n|n)c = c� [ I − K (n)c� ]Ψ(n|n − 1)c, [ f rom (12)] = c� Ψ(n|n − 1)c − c� K (n)c� Ψ(n|n − 1)c. (25) Substituting eqs. (16) and (21) into (25) gives: Ψ0,0 (n|n) = α2 (n) + σw2 − K0 (n)[α2 (n) + σw2 ], [α2 (n) + σw2 ]2 Ψ0,0 (n|n) − α2 (n) = σw2 − , α2 (n) + σw2 + σv2 Ψ0,0 (n|n) − α2 (n) σ2 α2 (n) + σw2 2 2 = 2 w 2 − 2 , α (n) + σw α (n) + σw α (n) + σw2 + σv2 Ψ0,0 (n|n) − α2 (n) σw2 σv2 = + − 1, α2 (n) + σw2 α2 (n) + σw2 α2 (n) + σw2 + σv2 Ψ0,0 (n|n) − α2 (n) σw2 σv2 + 1 = + , α2 (n) + σw2 α2 (n) + σw2 α2 (n) + σw2 + σv2 ΔΨ(n|n) + 1 = J2 (n) + J1 (n), (26) Version April 13, 2021 submitted to Journal Not Specified 6 of 21 where J2 (n) and J1 (n) are the robustness and sensitivity metrics of the KF, given as [19,20]: σw2 J2 (n) = , (27) α (n) + σw2 2 σv2 J1 (n) = . (28) α (n) + σw2 + σv2 2 The KF-based SEAs in [19,20] address tuning of biased K0 (n) using J2 (n) and J1 (n) metrics for speech enhancement in WGN condition as described next. Figure 1. Review of existing KF-based SEA: (a)-(b) spectrograms of the clean speech (utterance sp05) and the noisy speech (corrupt (a) with 5 dB WGN), (c) J2 (n) and J1 (n) metrics, (d) oracle and non-oracle K0 (n) with adjusted K0� (n) and K0�� (n), spectrogram of enhanced speech produced by: (e) KF-Oracle method, (f) KF-Non-oracle method, (g)-(h) methods in [19,20]. 2.2. Impact of Biased K0 (n) on KF-based Speech Enhancement in WGN Condition We analyze the shortcomings of existing KF-based SEAs [19,20] in terms of biased interpretation of K0 (n). For this purpose, we conduct an experiment with the utterance Version April 13, 2021 submitted to Journal Not Specified 7 of 21 sp05 (“Wipe the grease off his dirty face”) of NOIZEUS corpus [1, Chapter 12] (sampled at 8 kHz) corrupted with 5 dB WGN noise [23]. In [19], So et al. first analyze K0 (n) in oracle case, where ({ ai }, σw2 ) (p = 10) and σv2 are computed from each frame of the clean speech and the noise signal, s(n, k) and v(n, k). It can be seen that K0 (n) approaching 1 when there is speech presence of the noisy speech, which passes almost clean speech to the output (e.g., 0.16-0.33 s or 0.9-1.06 s in Figure 1 (d)-(e)). Conversely, K0 (n) remains approximately 0 during speech absent of the noisy speech, which does not pass any corrupting noise (e.g., 0-0.15 s or 1.8-2.19 s in Figure 1 (d)-(e)). Thus, KF-Oracle method produced almost identical speech (Figure 1 (e)) to the clean speech (Figure 1 (a)). In non-oracle case, ({ ai }, σw2 ) are computed from each noisy speech frame, resulting biased ({ ãi }, σ̃w2 ). Then K0 (n) in (21) using biased σ̃w2 is given by: α2 (n) + σ̃w2 K0 ( n ) = . (29) α2 (n) + σ̃w2 + σv2 In [19,20], So et all assumed that the impact of WGN in { ãi } is negligible. Thus, σ̃w2 can be approximately estimated as: σ̃w2 ≈ σw2 + σv2 [19,20]. Substituting σ̃w2 ≈ σw2 + σv2 in eq. (29) and re-arranging yields: α2 (n) + σw2 + σv2 K0 ( n ) = . (30) α2 (n) + σw2 + 2σv2 During speech pauses of y(n, k); s(n, k ) = 0 gives α2 (n) = 0 and σw2 = 0. According to eq. (30), K0 (n) becomes biased around 0.5 (e.g., 0-0.15 s or 1.8-2.19 s in Figure 1 (d)). The biased K0 (n) leaking a significant residual noise in the enhanced speech as shown in Figure 1 (f). In non-oracle case, it is also observed that J2 (n) ≈ 1 typically during speech pauses of y(n, k) (e.g., 0-0.15 s or 1.8-2.19 s in Figure 1 (c)). Therefore, J2 (n) metric is found to be useful in tuning biased K0 (n) as [19]: K0� (n) = K0 (n)[1 − J2 (n)]. (31) Figure 1 (d) reveals that K0� (n) ≈ 0 during speech pauses. However, K0� (n) is over- suppressed during speech presence of y(n, k), resulting distorted speech as shown in Figure 1 (g). To address this, So et al. proposed a J1 (n) metric-based tuning of K0 (n) [20]. It can be seen from Figure 1 (c) that J1 (n) lies around 0.5 during speech pauses (e.g., 0-0.15 s or 1.8-2.19 s), whereas approaching 0 at speech regions (e.g., 0.16-0.33 s or 0.9-1.06 s). Therefore, the tuning of biased K0 (n) using J1 (n) metric is performed as [20]: K0�� (n) = K0 (n) − J1 (n). (32) It can be seen from Figure 1 (d) that K0�� (n) is closely similar to the oracle K0 (n), which minimizes distortion in the enhanced speech (Figure 1 (h)) as compared to (Figure 1 (g)). Technically, the real-life noise (colored/non-stationary) may contain time varying amplitudes, which impact ({ ai }, σw2 ) significantly as opposed to negligible impact of WGN in these parameters [19,20]. Therefore, the assumption of σ̃w2 �= σw2 + σv2 made in [19,20] is invalid for real-life noise conditions. Moreover, the existing methods [19,20] do not analyze the impact of noise variance, σv2 on K0 (n). According to eq. (22), besides α2 (n) and σw2 ; σv2 is also an important parameter to compute K0 (n). In light of the study in this section, the existing tuning methods in [19,20] are not applicable for speech enhancement in real-life noise conditions. Therefore, we perform a detail analysis of biasing effect of K0 (n) on KF-based speech enhancement in real-life noise conditions. 2.3. Impact of Biased K0 (n) on KF-based Speech Enhancement in Real-life Noise Condition To analyze the impact of biased K0 (n) on KF-based speech enhancement in real- life noise condition, we repeat the experiment in Figure 1 except the utterance sp05 Version April 13, 2021 submitted to Journal Not Specified 8 of 21 Figure 2. Biasing effect of K0 (n): (a)-(b) spectrograms of the clean speech and the noisy speech (corrupt sp05 with 5 dB babble noise), (c) K0 (n) computed in oracle and non-oracle cases, (d)-(e) [α2 (n) + σw2 ] and σv2 computed in oracle and non-oracle cases, (f) J2 (n) and J1 (n) computed from the noisy speech in (b), spectrogram of enhanced speech produced by: (g) KF-Oracle method, and (h) KF-Non-oracle method. is corrupted with a typical non-stationary noise, babble [23] at 5 dB SNR. A 32 ms rectangular window with 50% overlap [24, Sec 7.2.1] was considered for converting y(n) into frames, y(n, k) (as in eq. (2)). As shown in Section 2.2, in oracle case, K0 (n) also shows a smooth transition between 0 and 1 depending on the speech absent and speech presence of noisy speech (Figure 2 (c)). Technically, during speech pauses of y(n, k), the total a priori prediction error of the AR model, [α2 (n) + σw2 ] = 0 (e.g., 0-0.15 s or 1.8-2.19 s in Figure 2 (d)). Substituting [α2 (n) + σw2 ] = 0 in eq. (22)), gives K0 (n) = 0, which in turn ŝ(n|n) = 0 (eq. (24)), i.e., nothing is passed to the output (e.g., 0-0.15 s or 1.8-2.19 s of Figure 2 (c), (g)). Conversely, it is observed that [α2 (n) + σw2 ] >> σv2 in speech regions of y(n, k), for which K0 (n) is approaching 1 (e.g., 0.16-0.33 s or 0.9-1.06 s in Figure 2 (c)). As discussed in section 2.2, the higher K0 (n) almost passes the clean speech to the output. Therefore, the enhanced speech produced by oracle KF (Figure 2(g)) is closely similar to the clean speech (Figure 2(a)). Version April 13, 2021 submitted to Journal Not Specified 9 of 21 In non-oracle case, the biased estimates of ({ ãi }, σ̃w2 ) and σ̃v2 , resulting [α̃2 (n) + σ̃w2 ] ≈ σ̃v2 (e.g., 0-0.15 s or 1.8-2.19 s in Figure 2 (e)). According to eq. (22), this condition introduces around 0.5 bias in K0 (n) (e.g., 0-0.15 s or 1.8-2.19 s in Figure 2 (c)). During speech presence of y(n, k), it is observed that σ̃v2 >> [α̃2 (n) + σ̃w2 ] (e.g., 0.16-0.33 s or 0.9-1.06 s of Figure 2 (e)), resulting in under-estimated K0 (n) as compared to oracle case (Figure 2 (c)). The 0.5 biased K0 (n) leaking 50% residual noise to ŝ(n|n) particularly in silent regions (Figure 2 (h)). While the under-estimated K0 (n) in speech regions introduce a significant distortion in the enhanced (Figure 2 (h)). In addition, J2 (n) and J1 (n) metrics (Figure 2 (f)) do not comply with the desired characteristics as found in WGN condition (Figure 1 (c)). Therefore, it is inappropriate to apply J2 (n) and J1 (n) metrics (Figure 2 (f)) for tuning of the biased K0 (n) (Figure 2 (c)) using eqs. (31)-(32). In the AKF-RMBT [21], the noise LPC parameters are computed from the first noisy speech frame by assuming that there remains no speech. The computed noise LPC parameters also remain constant during processing all noisy speech frames for a given noisy speech utterance. The whitening filter is also constructed with the constant noise LPCs to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. The whitening filter just reduces bias in the speech LPC parameters, which utilizes J2 (n) metric for the tuning of biased K0 (n) in colored noise conditions [21, Figure 5 (d)]. As in [19], J2 (n) metric-based tuning of K0 (n) still produces distorted speech. In light of the analysis, the AKF-RMBT [21] do not adequately address speech enhancement in real-life noise conditions. Motivated by the shortcomings of existing tuning methods in [19–21], J2 (n) and J1 (n) metrics are jointly incorporated in the proposed tuning algorithm to dynamically offset the bias in K0 (n)—which address speech enhancement in real-life noise conditions. Figure 3. Block diagram of the proposed KF-based SEA. 3. Proposed Speech Enhancement Algorithm Figure 3 shows the block diagram of the proposed SEA. Firstly, y(n) is converted into frames, y(n, k ) with the same setup as used in section 2.3. To carried out the tuning of K0 (n) in real-life noise conditions, unlike biased J2 (n) and J1 (n) metrics (Figure 2 (f)), they should achieve the similar characteristics that occur in WGN condition (Figure 1 (c)). It can be achieved through improving the estimates of ({ âi }, σ̂w2 ) and σ̂v2 as described in the next section. Version April 13, 2021 submitted to Journal Not Specified 10 of 21 3.1. Parameter Estimation It is known that ({ ai }, σw2 ) are very sensitive to real-life noises. Since the clean speech, s(n, k) is unobserved in practice, it is difficult to accurately estimate ({ ai }, σw2 ) from noisy speech, y(n, k). Therefore, we first focus on noise estimation, v̂(n, k ) for each noisy speech frame using an speech presence probability (SPP) method [25] (described in section 3.2), to compute σ̂v2 . Given v̂(n, k), σ̂v2 is computed as: M −1 1 σ̂v2 = ∑ v̂2 (n, k). (33) M n =0 To reduce bias in the estimated ({ âi }, σ̂w2 ) for each noisy speech frame, we compute them from the corresponding pre-whitened speech, yw (n, k ) using the autocorrelation method [22]. The framewise yw (n, k ) is obtained by employing a whitening filter, Hw (z) to y(n, k). Hw (z) is given by [22]: q Hw (z) = 1 + ∑ b̂j z− j , (34) j =1 where the coefficients, {b̂ j } (q = 20) are computed from v̂(n, k) using the autocorrelation method [22]. 3.2. Proposed v̂(n, k) Estimation Method The proposed noise estimation is performed in acoustic-domain using the SPP method [25]. For this purpose, the noisy speech, y(n) (eq. (1)) is analyzed frame-wise using the short-time Fourier transform (STFT): Yk (m) = Sk (m) + Vk (m), (35) where Yk (m), Sk (m), and Vk (m) denote the complex-valued STFT coefficients of the noisy speech, the clean speech, and the noise signal, respectively, for time-frame index k and frequency bin index m�{0, 1, . . . , 255}. A Hamming window with 50% overlap is used in STFT analysis [24, Sec 7.2.1]. In polar form, Yk (m), Sk (m), and Vk (m) can be expressed as: Yk (m) = Rk (m)e jφk (m) , Sk (m) = Ak (m)e jϕk (m) , and Vk (m) = Dk (m)e jθk (m) , where Rk (m), Ak (m), and Dk (m) are the magnitude spectrums of the noisy speech, the clean speech, and the noise signal, respectively, and φk (m), ϕk (m), and θk (m) are the corresponding phase spectrums. We process each frequency bin of the single-sided noisy speech power spectrum, R2k (m) to estimate the noise power spectrum, D̂k2 (m), where m�{0, 1, . . . , 128} containing the DC and Nyquist frequency components. To initialize the algorithm, we consider the first frame (k = 0) of R20 (m) as silent, giving an estimate of noise power, D̂02 (m) = R20 (m). The noise PSD, λ̂0 (m) is also initialized as: λ̂0 (m) = D̂02 (m). For k ≥ 1, using the speech presence uncertainty principle [25], an MMSE estimate of D̂k2 (m) at mth frequency bin is given by: D̂k2 (m) = P( H0m | Rk (m)) R2k (m) + P( H1m | Rk (m))λ̂k−1 (m), (36) where P( H0m | Rk (m)) and P( H1m | Rk (m)) are the conditional probability of the speech absence and the speech presence, given Rk (m) at mth frequency bin. The simplified P( H1m | Rk (m)) estimate is given by1 [25]: � �� �� ���−1 R2k (m) ξ opt P( H1m | Rk (m)) = 1 + (1 + ξ opt ) exp − , (37) λ̂k−1 (m) 1 + ξ opt 1 The simplification is a result of assuming the a priori probability of the speech absence and presence, P( H0 ) and P( H1 ) as: P( H0 ) = P( H1 ). Version April 13, 2021 submitted to Journal Not Specified 11 of 21 where ξ opt is the optimal a priori SNR. In [25], the optimal choice for ξ opt is found to be 10 log10 (ξ opt ) = 15 dB, and P( H0m | Rk (m)) is given by P( H0m | Rk (m)) = 1 − P( H1m | Rk (m)). If P( H1m | Rk (m)) = 1 occurs at mth frequency bin, it causes stagnation, which stops updating D̂k2 (m) (eq. (36)). Unlike monitoring the status of P( H1m | Rk (m)) = 1 for a long time as reported in [25], we simply resolve this issue by setting P( H1m | Rk (m)) = 0.99 once this condition occurs prior to update D̂k2 (m). It is observed that R2k (m) is completely filled with additive noise during silent activity, thus giving an estimate of noise power. Therefore, unlike updating D̂k2 (m) using eq. (36) by existing method [25], we do it differently depending on the silent/speech activity of R2k (m) (for each frequency bin m). Specifically, at mth frequency bin (k ≥ 1), if P( H1m | Rk (m)) < 0.5, R2k (m) yields silent activity, resulting D̂k2 (m) = R2k (m), otherwise, D̂k2 (m) is estimated using eq. (36). With estimated D̂k2 (m), λ̂k (m) is updated as: λ̂k (m) = η λ̂k−1 (m) + (1 − η ) D̂k2 (m), (38) where the smoothing constant, η is set to 0.9. jφk (m) yields the estimated noise, v̂ ( n, k ), where P ( m ) = � The |IDFT| of Pv (m)e v λ̂k (m). To ensure the conjugate symmetry, the components of Pv (m) at m�{1, 2, . . . , 127} are flipped to that of the m�{129, 130, . . . , 255} of Pv (m) before taking the |IDFT|. a 10 -3 3 2 1 0 b 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 Figure 4. Comparing the estimated: (a) [α̂2 (n) + σ̂w2 ], σ̂v2 and (b) J2 (n), J1 (n) metrics from the noisy speech in Figure 2 (b). 3.3. Proposed K0 (n) Tuning Method Firstly, we construct KF with { âi }, σ̂w2 ) and σ̂v2 and extract the tuning parameters as shown in Figure 4. It can be seen from Figure 4 (a) that [α̂2 (n) + σ̂w2 ] and σ̂v2 achieves very similar characteristics as like KF-Oracle method (Figure 2 (e)). The improvement of [α̂2 (n) + σ̂w2 ] and σ̂v2 results in quite similar J2 (n) and J1 (n) metrics (Figure 4 (b)) as appear in WGN condition (Figure 1 (c)). Therefore, J2 (n) and J1 (n) metrics (Figure 4 (b)) are now eligible to dynamically offset the bias in K0 (n) in real-life noise conditions. However, as investigated in Section 2.2, it is evident that J2 (n) metric is suitable in offsetting the bias in K0 (n) during speech pauses, since it results in under-estimated K0 (n) during speech presence of the noisy speech [20]. On the contrary, since J1 (n) metric approaches 0 in speech regions of the noisy speech, according to eq. (32), it minimizes the under-estimation of K0 (n). In light of these observations, for each sample of y(n, k ), we incorporate J2 (n) metric during speech pauses and J1 (n) metric during speech presence to dynamically offset the bias in K0 (n). Version April 13, 2021 submitted to Journal Not Specified 12 of 21 We observed that [α̂2 (n) + σ̂w2 ] and σ̂v2 can be adopted as a speech activity detector for each sample of y(n, k). For example, during speech pauses, the condition σ̂v2 ≥ [α̂2 (n) + σ̂w2 ] holds (e.g., 0-0.15 s or 1.8-2.19 s of Figure 4 (a)). Conversely, [α̂2 (n) + σ̂w2 ] >> σ̂v2 is found in speech regions (e.g., 0.16-0.33 s or 0.9-1.06 s of Figure 4 (a)). Therefore, at sample n, if σ̂v2 ≥ [α̂2 (n) + σ̂w2 ], y(n, k) is termed as silent and set the decision parameter (denoted by ζ) as ζ (n) = 0; otherwise speech activity occurs and ζ (n) = 1. It can be 1 Detected Reference Amplitude 0 -1 0 0.5 1 1.5 2 Time (s) Figure 5. Comparing the detected flags from Figure 2 (b) to the reference flags computed from Figure 2 (a). seen from Figure 5 that the detected flags (0/1: silent/speech) by the proposed method is closely similar to that of the reference (0/-1: silent/speech; generated by visually inspecting the utterance sp05). At sample n, if ζ (n) = 0, the adjusted K0� (n) in the proposed SEA is given by: K0� (n) = K0 (n)[1 − J2 (n)], � �� � α̂2 (n) + σ̂w2 α̂2 (n) = , α̂2 (n) + σ̂w2 + σ̂v2 α̂2 (n) + σ̂w2 α̂2 (n) = . (39) α̂2 (n) + σ̂w2 + σ̂v2 a 10 -3 3 2.5 2 1.5 1 0.5 0 10 -5 b 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 Figure 6. K0� (n) responses in terms of: (a) α̂2 (n) and α̂2 (n) + σ̂w2 + σ̂v2 , and (b) [α̂2 (n) + σ̂w2 ]2 and [α̂2 (n) + σ̂w2 + σ̂v2 ]2 , where the same experimental setup of Figure 2 (b) is used. To justify the validity of K0� (n), Figure 6 (a) shows the numerator and the denom- inator of eq. (39) computed from the noisy speech in Figure 2 (b). It can be seen that α̂2 (n) ≈ 0 during speech pauses (e.g., 0-0.15 s or 1.8-2.19 s of Figure 6 (a)). According to eq. (39), it results K0� (n) ≈ 0. Since [α̂2 (n) + σ̂w2 + σ̂v2 ] >> α̂2 (n) occurs during speech Version April 13, 2021 submitted to Journal Not Specified 13 of 21 presence (e.g., 0.16-0.33 s or 0.9-1.06 s of Figure 6 (a)), it may result under-estimated K0� (n) as like WGN experiment (Figure 1 (d)). Thus, J2 (n) metric does not adequately address the tuning of biased K0 (n) during speech presence of y(n, k). As discussed in Section 2.2, during speech presence of y(n, k ), J1 (n) metric can be used to offset the bias in K0 (n). However, our further investigation on J1 (n) metric-based tuning in eq. (32) reveals that the subtraction of J1 (n) from biased K0 (n) may still produce under-estimated K0� (n). To cope with this problem, at sample n, if ζ (n) = 1, we propose a more appropriate tuning method of biased K0 (n) using J1 (n) metric as: K0� (n) = K0 (n)[1 − J1 (n)], � �� � α̂2 (n) + σ̂w2 α̂2 (n) + σ̂w2 = , α̂2 (n) + σ̂w2 + σ̂v2 α̂2 (n) + σ̂w2 + σ̂v2 [α̂2 (n) + σ̂w2 ]2 = . (40) [α̂2 (n) + σ̂w2+ σ̂v2 ]2 To justify the validity of K0� (n), the numerator and the denominator of eq. (40) are shown in Figure 6 (b). It can be seen that [α̂2 (n) + σ̂w2 + σ̂v2 ]2 ≥ [α̂2 (n) + σ̂w2 ]2 during speech presence of y(n, k) (e.g., 0.16-0.33 s or 0.9-1.06 s), which causes K0� (n) approaching 1. a 1 0.8 0.6 0.4 0.2 0 b 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 Figure 7. Comparing K0 (n) obtained using KF-Oracle, proposed, and AKF-RMBT [21] methods from the utterance sp05 corrupted with 5 dB: (a) non-stationary (babble) and (b) colored (f16) noises. To examine the performance of the proposed tuning algorithm in real-life non- stationary noise conditions, we repeat the experiment in Figure 2. It can be seen from Figure 7 (a) that the adjusted KF gain (K0� (n)) by the proposed method is closely similar to the oracle K0 (n). Specifically, it maintains a smooth transition at the edges and the temporal changes in speech regions are closely matched to the oracle K0 (n). Conversely, the AKF-RMBT method [21] produces a significant under-estimated K0 (n) in speech regions. We also repeat the experiment in Figure 2 except the utterance sp05 is corrupted by 5 dB colored (f16) noise. Figure 7 (b) also reveals that the biasing effect is reduced significantly in the adjusted K0� (n) by the proposed method and closely similar to the oracle K0 (n). However, the AKF-RMBT [21] still produced under-estimated K0 (n) in speech regions. In light of the KF gain comparative study, it is evident to say that the proposed method adequately addresses the tuning of biased K0 (n) both in real-life non-stationary and colored noise conditions. Also, the reduced-biased K0� (n) achieved by the proposed method leads to mitigate the risks of distortion in the enhanced speech than that of AKF-RMBT [21]. Version April 13, 2021 submitted to Journal Not Specified 14 of 21 4. Speech Enhancement Experiment 4.1. Test Set For objective experiments, 30 phonetically balanced utterances belonging to six speakers (three male and three female) are taken from the NOIZEUS corpus [1, Chapter 12]. The noisy speech for the test set is formed by mixing the clean speech with real-world non-stationary (babble, street) and colored (factory2 and f16) noise recordings at multiple SNR levels (from -5 dB to 15 dB, in 5 dB increments). The street noise is taken from [26] and the rest of the noises are from [23]. All the clean speech and noise recordings are single-channel with a sampling frequency of 8 kHz. This provides 30 examples per condition with 20 total conditions. 4.2. Objective Evaluation The objective measures are used to evaluate the quality and intelligibility of the enhanced speech with respect to the corresponding clean speech. The following objective evaluation metrics have been used in this paper: • Perceptual Evaluation of Speech Quality (PESQ) for objective quality evaluation [27]. The PESQ score ranged between -0.5 to 4.5. A higher PESQ score indicates better speech quality. • Short-time objective intelligibility (STOI) measure for objective intelligibility evalu- ation [28]. It ranged between 0 to 1 (or 0 to 100%). A higher STOI score indicates better speech intelligibility. We also analyzed the spectrograms of the enhanced speech produced by the pro- posed method and the competing methods to visually quantify the level of residual background noise as well as speech distortion. 4.3. Subjective Evaluation The subjective evaluation was carried out through a series of blind AB listening tests [5, Section 3.3.4]. To perform the tests, we generate a set of stimuli by corrupting the utterances sp05 and sp27 from the NOIZEUS corpus [1, Chapter 12]. The reference transcript for utterance sp05 is: “Wipe the grease off his dirty face”, and is corrupted with 5 dB babble noise. The reference transcript for utterance sp27 is: “Bring your best compass to the third class”, and is corrupted with 5 dB f16 noise. Utterances sp05 and sp27 were uttered by a male and a female, respectively. In this test, the enhanced speech produced by six SEAs as well as the corresponding clean speech and noise corrupted speech signals were played as stimuli pairs to the listeners. Specifically, the test is performed on a total of 112 stimuli pairs (56 for each utterance) played in a random order to each listener, excluding the comparisons between the same method. The listener gives the following ratings for each stimuli pair: prefers the first or second stimuli, which is perceptually better, or a third response indicating no difference is found between them. For a pairwise scoring, 100% award is given to the preferred method, 0% to the other, and 50% for the similar preference response. The participants could re-listen to stimuli if required. Ten English speaking listeners participate in the blind AB listening tests. The average of the preference scores given by the listeners, termed as mean preference score (%), which is used to compare the performance among the SEAs. 4.4. Specifications of the Competitive SEAs The performance of the proposed SEA is carried out by comparing it with the following benchmark SEAs (p : order of { ai }, σw2 : the excitation variance of AR model, w f : analysis frame duration (ms), s f : analysis frame shift (ms)). 1. Noisy: No-enhancement (speech corrupted with noise). Version April 13, 2021 submitted to Journal Not Specified 15 of 21 2. KF-Oracle: KF, where ({ ai }, σw2 ) and σv2 are computed from the clean speech and the noise signal, p = 10, w f = 32 ms, s f = 16 ms, and rectangular window is used for framing. 3. KF-Non-oracle: KF, where ({ ai }, σw2 ) and σv2 are computed from the noisy speech, p = 10, w f = 32 ms, s f = 16 ms, and rectangular window is used for framing. 4. MMSE-STSA [9]: It used w f = 25 ms, s f = 10 ms, and Hamming window is used for analysis and synthesis. 5. AKF-IT [13]: AKF operates with two iterations, where the initial ({ ai }, σw2 ) and ({b j }, σu2 ) are computed from the noisy speech followed by re-estimation of them from the filtered speech after first iteration, p = 10, noise LPC order q = 10, w f = 20 ms, s f = 0 ms, and rectangular window is used for framing. 6. AKF-RMBT [21]: Robustness metric-based tuning of the AKF, where ({ ai }, σw2 ) and ({b j }, σu2 ) are computed from the pre-whitened speech and the first noisy speech frame, respectively, p = 10, q = 40, w f = 20 ms, s f = 0 ms, and rectangular window is used for framing. 7. Proposed: Robustness and sensitivity metrics-based tuning of the KF gain, where ({ âi }, σ̂w2 ) and σ̂v2 are computed from the pre-whitened speech and estimated noise, p = 20, q = 20, w f = 32 ms, s f = 16 ms, rectangular window is used for time- domain frames, and Hamming window is used for acoustic domain analysis and synthesis. 5. Results and Discussion 5.1. Objective Quality Evaluation Figure 8 shows the average PESQ score for each SEA. It can be seen that the KF-Oracle method exhibits the highest PESQ score across the tested conditions. It is due to ({ ai }, σw2 ) and σv2 are computed from the clean speech and the noise signal. Thus, informally, it can be considered as an upper-bound of PESQ score improvement. Conversely, the Noisy shows the lowest PESQ score in any test condition. The proposed SEA consistently produces a higher PESQ score than the competing SEAs across the tested conditions. The PESQ scores for the proposed method are also very similar to that of the KF-Oracle method. It is due to the reduced-biased Kalman gain achieved by the proposed tuning algorithm is closely similar to that of the KF-Oracle method (Figure 7). Amongst the competing methods, AKF-RMBT [21] produced the highest PESQ score across the tested conditions. In light of this comparative study, it is evident to say that the proposed SEA exhibits higher quality in the enhanced speech than that of the competing SEAs across the tested conditions. 5.2. Objective Intelligibility Evaluation Figure 9 shows the average STOI score for each SEA. As in Section 5.1, the KF-Oracle method achieves the highest STOI score for each tested condition. The proposed method attained the highest STOI score for each tested condition than the competing methods. Amongst the competing methods, AKF-RMBT [21] produced the highest STOI score for each tested condition. Conversely, Noisy shows the lowest STOI score in any tested condition. In light of this study, it is evident to say that the proposed method produces more intelligible enhanced speech than the competing SEAs across the tested conditions. 5.3. Spectrogram Analysis of the SEAs Figures 10-11 compare the spectrograms of enhanced speech produced by each SEA for utterance sp05 corrupted with babble (non-stationary) and f16 (colored) noise sources at 5 dB SNR level. Specifically, the biased KF gain of the KF-Non-oracle passes a significant residual noise in the enhanced speech (Figures 10 (c)-11 (c)). Also, the poor estimates of the a priori SNR introduces a high degree of residual noise in the enhanced speech produced by MMSE-STSA [9] (Figures 10 (d)-11 (d)). The degree of residual noise decreases in the enhanced speech produced by AKF-IT [13] (Figures 10 (e)-11 (e)). Version April 13, 2021 submitted to Journal Not Specified 16 of 21 However, the residual noise appears as musical noise. The enhanced speech also suffer from significant distortion. The AKF-RMBT method [21] exhibits less residual noise in the enhanced speech, however, suffering from distortion due to the under-estimated AKF gain (Figures 10 (f)-11 (f)). It can be seen that there is less residual noise as well as distortion in the enhanced speech produced by proposed method (Figures 10 (g)-11 (g)). Finally, the enhanced speech produced by KF-Oracle (Figures 10 (h)-11 (h)) is almost similar to the clean speech (Figures 10 (a)-11 (a)). This is due to KF-Oracle uses the clean speech and noise (unobserved in practice) for parameter estimation. 5.4. Subjective Evaluation by AB Listening Test The mean preference score (%) comparison for each SEA are shown in Figures 12- 13. The non-stationary (babble) noise experiment in Figure 12 reveals that the proposed method is widely preferred (73%) by the listeners to that of the competing methods apart from the clean speech (100%) and the KF-Oracle method (81%). AKF-RMBT [21] is found to be the most preferred (60%) amongst the competing methods. AKF-IT [21] was preferred more (47%) than MMSE-STSA [9] (31%), even though MMSE-STSA attained higher objective scores (Section 5.1 and 5.2). This may be due to the fact that AKF-IT demonstrates superior noise suppression in regions of speech than MMSE-STSA, as indicated in [21]. The blind AB listening test results for the colored (f16) noise condition is shown in Figure 13. It can be seen that the proposed method achieves a significant preference score (75%) than the competing methods, except the clean speech (100%) and the KF-Oracle method (82%). As in previous experiment, AKF-RMBT [21] is found to be the most preferred (64%) amongst the competing methods, with AKF-IT [13] (48%) and MMSE-STSA [9] (32%). In light of the blind AB listening tests, it is evident to say that the enhanced speech produced by the proposed method exhibits the best perceived quality amongst all of the competing SEAs for both male and female utterances corrupted by real-life non-stationary and coloured noise sources. 6. Conclusion Robustness and sensitivity metrics-based tuning of the Kalman filter gain for single- channel speech enhancement has been investigated in this paper. At first, the noise variance is computed from the estimated noise for each noisy speech frame using an SPP method. A whitening filter is also constructed to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. Then the KF is constructed with the estimated parameters. To achieve better noise reduction, the robustness metric is employed to dynamically offset the bias in KF gain during speech absence of the noisy speech to that of the sensitivity metric during speech presence. The noise variance and the AR model parameters are adopted as a speech activity detector. It is shown that the reduced-biased KF gain achieved by the proposed tuning algorithm addresses speech enhancement in real-life noise conditions. Objective and subjective scores on the NOIZEUS test set demonstrate that the proposed method outperforms the competing methods in real-life noise conditions for a wide range of SNR levels. Author Contributions: The contribution of Sujan Kumar Roy includes: preliminary experiments, experiment design, conducted the experiments, code writing, analysis of results, literature review, and manuscript writing. Kuldip K. Paliwal: supervision. Institutional Review Board Statement: The subjective AB listening tests in Section 4.3 were conducted with approval from the Griffith University’s Human Research Ethics Committee: database protocol number 2018/671. Conflicts of Interest: The authors declare no conflict of interest. References 1. Loizou, P.C. Speech Enhancement: Theory and Practice, 2nd ed.; CRC Press, Inc.: Boca Raton, FL, USA, 2013. 2. Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 1979, 27, 113–120. doi:10.1109/TASSP.1979.1163209. Version April 13, 2021 submitted to Journal Not Specified 17 of 21 3. Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. IEEE International Conference on Acoustics, Speech, and Signal Processing 1979, 4, 208–211. doi:10.1109/TASSP.1979.1163209. 4. Kamath, S.; Loizou, P. A Multi-Band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. IEEE International Conference on Acoustics, Speech, and Signal Processing 2002, 4, 4160–4164. doi:10.1109/ICASSP.2002.5745591. 5. Paliwal, K.; Wójcicki, K.; Schwerin, B. Single-channel Speech Enhancement Using Spectral Subtraction in the Short-time Modulation Domain. Speech Communication 2010, 52, 450–475. doi:https://doi.org/10.1016/j.specom.2010.02.004. 6. Lim, J.S.; Oppenheim, A.V. Enhancement and bandwidth compression of noisy speech. Proceedings of the IEEE 1979, 67, 1586–1604. doi:10.1109/PROC.1979.11540. 7. Scalart, P.; Filho, J.V. Speech enhancement based on a priori signal to noise estimation. IEEE International Conference on Acoustics, Speech, and Signal Processing 1996, 2, 629–632. doi:10.1109/ICASSP.1996.543199. 8. Plapous, C.; Marro, C.; Mauuary, L.; Scalart, P. A two-step noise reduction technique. IEEE International Conference on Acoustics, Speech, and Signal Processing 2004, 1, 289–292. doi:10.1109/ICASSP.2004.1325979. 9. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984, 32, 1109–1121. doi:10.1109/TASSP.1984.1164453. 10. Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33, 443–445. doi:10.1109/TASSP.1985.1164550. 11. Paliwal, K.; Schwerin, B.; Wójcicki, K. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator. Speech Communication 2012, 54, 282–305. doi:https://doi.org/10.1016/j.specom.2011.09.003. 12. Paliwal, K.; Basu, A. A speech enhancement method based on Kalman filtering. IEEE International Conference on Acoustics, Speech, and Signal Processing 1987, 12, 177–180. doi:10.1109/ICASSP.1987.1169756. 13. Gibson, J.D.; Koo, B.; Gray, S.D. Filtering of colored noise for speech enhancement and coding. IEEE Transactions on Signal Processing 1991, 39, 1732–1742. doi:10.1109/78.91144. 14. Wang, Y.; Wang, D. Towards Scaling Up Classification-Based Speech Separation. IEEE Transactions on Audio, Speech, and Language Processing 2013, 21, 1381–1390. doi:10.1109/TASL.2013.2250961. 15. Xu, Y.; Du, J.; Dai, L.; Lee, C. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters 2014, 21, 65–68. doi:10.1109/LSP.2013.2291240. 16. Williamson, D.S.; Wang, Y.; Wang, D. Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2016, 24, 483–492. doi:10.1109/TASLP.2015.2512042. 17. Roy, S.K.; Zhu, W.P.; Champagne, B. Single channel speech enhancement using subband iterative Kalman filter. IEEE International Symposium on Circuits and Systems 2016, pp. 762–765. doi:10.1109/ISCAS.2016.7527352. 18. Saha, M.; Ghosh, R.; Goswami, B. Robustness and Sensitivity Metrics for Tuning the Extended Kalman Filter. IEEE Transactions on Instrumentation and Measurement 2014, 63, 964–971. doi:10.1109/TIM.2013.2283151. 19. So, S.; George, A.E.W.; Ghosh, R.; Paliwal, K.K. A non-iterative Kalman filtering algorithm with dynamic gain adjustment for single-channel speech enhancement. International Journal of Signal Processing Systems 2016, 4, 263–268. doi:10.18178/ijsps.4.4.263- 268. 20. So, S.; George, A.E.W.; Ghosh, R.; Paliwal, K.K. Kalman Filter with Sensitivity Tuning for Improved Noise Reduction in Speech. Circuits, Systems, and Signal Processing 2017, 36, 1476–1492. doi:10.1007/s00034-016-0363-y. 21. George, A.E.; So, S.; Ghosh, R.; Paliwal, K.K. Robustness metric-based tuning of the augmented Kalman filter for the enhancement of speech corrupted with coloured noise. Speech Communication 2018, 105, 62–76. doi:https://doi.org/10.1016/j.specom.2018.10.002. 22. V. Vaseghi, S. Linear Prediction Models. In Advanced Digital Signal Processing and Noise Reduction; John Wiley & Sons, 2009; chapter 8, pp. 227–262. 23. Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 1993, 12, 247–251. doi:https://doi.org/10.1016/0167- 6393(93)90095-3. 24. Oppenheim, A.V.; Schafer, R.W. Discrete-Time Signal Processing, 3rd ed.; Prentice Hall Press: Upper Saddle River, NJ, USA, 2009. 25. Gerkmann, T.; Hendriks, R.C. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay. IEEE Transactions on Audio, Speech, and Language Processing 2012, 20, 1383–1393. doi:10.1109/TASL.2011.2180896. 26. Pearce, D.; Hirsch, H.G. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. INTERSPEECH. ISCA, 2000, pp. 29–32. 27. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. IEEE International Conference on Acoustics, Speech, and Signal Processing 2001, 2, 749–752. doi:10.1109/ICASSP.2001.941023. 28. Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing 2011, 19, 2125–2136. doi:10.1109/TASL.2011.2114881. Version April 13, 2021 submitted to Journal Not Specified 18 of 21 PESQ 3 2 1 0 3 PESQ 2 1 0 3 PESQ 2 1 0 3 PESQ 2 1 0 -5 0 5 10 15 Input SNR(dB) KF-Oracle Proposed AKF-RMBT AKF-IT MMSE-STSA KF-Non-oracle Noisy Figure 8. Average PESQ score for each SEA found over all frames for each condition specified in Section 4.1. Version April 13, 2021 submitted to Journal Not Specified 19 of 21 100 80 STOI 60 40 20 0 100 80 STOI 60 40 20 0 100 80 STOI 60 40 20 0 100 80 STOI 60 40 20 0 -5 0 5 10 15 Input SNR(dB) KF-Oracle Proposed AKF-RMBT AKF-IT MMSE-STSA KF-Non-oracle Noisy Figure 9. Average STOI score for each SEA found over all frames for each condition specified in Section 4.1. Version April 13, 2021 submitted to Journal Not Specified 20 of 21 4 a 3 2 1 0 4 b 3 2 1 0 4 c 3 2 1 0 4 d 3 2 1 0 4 e 3 2 1 0 4 f 3 2 1 0 g 4 3 2 1 0 4 h 3 2 1 0 0.5 1 1.5 2 Time (s) Figure 10. Comparing the spectrograms of: (a) clean speech (utterance sp05), (b) noisy speech (corrupt sp05 with 5 dB babble noise) (PESQ=2.10), enhanced speech produced by the: (c) KF-Non-oracle (PESQ=2.18), (d) MMSE-STSA [9] (PESQ=2.32), (e) AKF-IT [13] (PESQ=2.26), (f) AKF-RMBT [21] (PESQ=2.42), (g) Proposed (PESQ=2.55), and (h) KF-Oracle (PESQ=2.61) methods. 4 a 3 2 1 0 4 b 3 2 1 0 4 c 3 2 1 0 4 d 3 2 1 0 4 e 3 2 1 0 4 f 3 2 1 0 g 4 3 2 1 0 4 h 3 2 1 0 0.5 1 1.5 2 Time (s) Figure 11. Comparing the spectrograms of: (a) clean speech (utterance sp05), (b) noisy speech (corrupt sp05 with 5 dB f16 noise) (PESQ=2.14), enhanced speech produced by the: (c) KF-Non-oracle (PESQ=2.26), (d) MMSE-STSA [9] (PESQ=2.39), (e) AKF-IT [13] (PESQ=2.31), (f) AKF-RMBT [21] (PESQ=2.53), (g) Proposed (PESQ=2.65), and (h) KF-Oracle (PESQ=2.70) methods. Version April 13, 2021 submitted to Journal Not Specified 21 of 21 100 Mean preference score (%) 80 60 40 20 0 SA IT BT an sy e ed e cl F- cl oi ST M le os ra ra AK N C R -O op - -o F- SE on KF Pr AK M -N M KF Speech Enhancement Methods Figure 12. The mean preference score (%) comparison between the proposed and benchmark SEAs for the utterance sp05 corrupted with 5 dB non-stationary babble noise. The error bars indicate the standard deviation of the scores. 100 Mean preference score (%) 80 60 40 20 0 A IT BT an sy e ed e TS cl F- l oi ac M le os ra AK -S N C R r -O op -o F- SE on KF Pr AK M -N M KF Speech Enhancement Methods Figure 13. The mean preference score (%) comparison between the proposed and benchmark SEAs for the utterance sp27 corrupted with 5 dB colored f16 noise. The error bars indicate the standard deviation of the scores. CHAPTER 3. ROBUSTNESS AND SENSITIVITY TUNING 52 OF THE KALMAN FILTER FOR SPEECH ENHANCEMENT Chapter 4 Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single-Channel Speech Enhancement 53 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Kuldip K. Paliwal, "Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single-Channel Speech Enhancement", Under review with: Applied Acoustics (Submitted at 04 March 2021). My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) ____ ___ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single-Channel Speech Enhancement Sujan Kumar Roy∗, Kuldip K. Paliwal Signal Processing Laboratory, Griffith University, Nathan Campus, Brisbane, QLD, 4111, Australia Abstract The inaccurate estimates of the speech and noise linear prediction coefficients (LPCs) introduce bias in augmented Kalman filter (AKF) gain, which impacts the quality and intelligibility of enhanced speech. Although current tuning methods offset the bias in AKF gain, particularly in colored noise conditions, they do not adequately address non- stationary noise conditions. This paper introduces a new tuning algorithm of the AKF gain for speech enhancement in real-life noise conditions. Due to this purpose, a speech presence probability (SPP) method first estimates the noise power spectral density (PSD) from each noisy speech frame to compute the noise LPC parameters. A whitening filter is constructed with the noise LPCs to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. The AKF is then constructed with the estimated speech and noise LPC parameters. To achieve better noise reduction, the robustness metric is employed to dynamically offset the bias in AKF gain during speech absence of the noisy speech to that of the sensitivity metric during speech presence. The speech activity is obtained through adopting the speech and noise production model parameters. It is shown that the reduced-biased AKF gain achieved by the proposed tuning algorithm addresses speech enhancement in real-life noise conditions. Objective and subjective scores on the NOIZEUS corpus demonstrate that the proposed method produces enhanced speech with higher quality and intelligibility than competing methods in real-life noise conditions for a wide range of signal-to-noise ratio (SNR) levels. Keywords: Speech enhancement, Kalman filter, augmented Kalman filter, robustness metric, sensitivity metric, LPC. 1. Introduction prediction error variance. The LPC parameters and addi- tive noise variances are used to construct the KF recursive The main objective of a speech enhancement algorithm equations. Given a frame of noisy speech samples, the KF (SEA) is to improve the quality and intelligibility of de- gives a linear MMSE estimate of the clean speech samples graded speech. It can be achieved through eliminating the using the recursive equations. Therefore, the KF perfor- embedded noise from the degraded speech. SEA is useful mance for speech enhancement largely depends upon the in many applications, where the noise corrupted speech is accuracy of LPCs, prediction error variance, and additive unreliable. For example, mobile communication systems, noise variance estimation in practice. hearing aid devices, and speech recognition systems, typ- In [12], Gibson et al. introduced an augmented KF ically, rely upon the accuracy of speech enhancement for (AKF) for speech enhancement in colored noise conditions. robustness. Various SEAs, such as spectral subtraction In AKF, both the clean speech and additive noise are rep- (SS) [1–4], the Wiener Filter (WF) [5–7], minimum mean resented by two AR models. The speech and noise LPC square error (MMSE) [8–10], the Kalman filter (KF) [11], parameters are incorporated in an augmented matrix form augmented KF (AKF) [12], computational auditory scene to construct the recursive equations of AKF. In [12], the analysis (CASA) [13], and deep neural network (DNN) AKF processes the colored noise corrupted speech itera- [14] have been introduced over the decades. This paper tively (usually three-four iterations) to eliminate the em- focuses on AKF-based single-channel speech enhancement bedded noise, yielding the enhanced speech. During this, in real-life noise conditions. the LPC parameters for the current frame are computed KF was first used for speech enhancement by Paliwal from the corresponding filtered speech frame of the pre- and Basu [11]. In KF, each clean speech frame is repre- vious iteration by AKF. Although the AKF demonstrates sented by an auto-regressive (AR) model, whose parame- an improvement in signal-to-noise ratio (SNR) of the noisy ters comprise the linear prediction coefficients (LPCs) and speech, however, it suffers from musical noise and speech distortion. Therefore, the AKF method in [12] does not ∗ Corresponding adequately address the inaccurate speech and noise LPC author Email addresses:

[email protected]

(Sujan parameter estimation issue in practice. Kumar Roy),

[email protected]

(Kuldip K. Paliwal) In [15], Roy et al. proposed a sub-band (SB) itera- Preprint submitted to Elsevier tive KF (SBIT-KF)-based SEA. In this SEA, the noisy for conditions that have time-varying amplitudes. In [20], speech is first decomposed into 16 sub-bands (SBs). Then Roy and Paliwal proposed an extension of the work [19] a partial reconstruction of noisy speech is made with the by employing a sensitivity metric-based tuning of the AKF high-frequency SBs (HFSBs). An iterative KF (two iter- (AKF-SMBT). In this SEA, the speech and noise LPC pa- ations) is employed to the partially reconstructed noisy rameters are computed with a similar process as in [19]. It speech, yielding a partial enhanced speech. As in [12], the is demonstrated that the application of sensitivity metric speech LPC parameters for the current frame are com- in the proposed tuning method [20] minimizes the under- puted from the corresponding filtered speech frame of the estimation issue of AKF gain, particularly in speech re- previous iteration by KF. Also, the noise variance is esti- gions [19]. It is also shown that the reduced-biased AKF mated using a derivative-based high-pass filter from each gain in [20] minimizes the amount of residual noise as well frame of the partially reconstructed noisy speech. Con- as distortion in the enhanced speech as compared to [19]. versely, the low-frequency SBs (LFSBs) keep unprocessed However, this SEA [20] also does not account for condi- with the assumption that the impact of noise on LFSBs tions that have time-varying amplitudes. is negligible. The partial enhanced speech is then added Motivated by the shortcomings of previous KF and with the LFSBs to reconstruct the final enhanced speech. AKF methods [17–20], in this paper, we propose a new However, the LFSBs can also be affected by noise typically tuning algorithm to dynamically offset the bias in AKF when operating in conditions that have time-varying am- gain for speech enhancement—even for the conditions that plitudes. As demonstrated in [12], the iterative processing have time-varying amplitudes. Firstly, we estimate the of the partially reconstructed noisy speech using KF [15] noise power spectral density (PSD) from each noisy speech also produced distorted speech. frame using a speech presence probability (SPP) method In [16], Saha et al. proposed a robustness metric and to compute the noise LPC parameters. A whitening filter a sensitivity metric for tuning the bias in KF gain for in- is also constructed with the noise LPCs to pre-whiten each strument engineering applications. Later on, So et al. em- noisy speech frame prior to computing the speech LPC ployed the tuning of KF gain in speech enhancement con- parameters. The AKF is then constructed with the esti- text [17]. Specifically, it is shown in [17] that the enhanced mated speech and noise LPC parameters, where a robust- speech (for each sample within a noisy speech frame) is ness metric is employed to dynamically offset the bias in given by recursively averaging the observed noisy speech AKF gain when there is speech absent of the noisy speech and the predicted speech weighted by a scalar KF gain. to that of the sensitivity metric during speech presence to However, the inaccurate estimates of the LPC parameters achieve better noise reduction. The proposed method aims introduce bias in KF gain, results in leaking a significant to mitigate the weaknesses of previously proposed tuning residual noise in the enhanced speech. In [17], a robust- methods by providing a reduced-biased AKF gain—even ness metric is used to offset the bias in KF gain for speech for noise conditions that have time-varying amplitudes. enhancement. In [18], So et al. further showed that the ro- The motivation of this is to produce enhanced speech at a bustness metric strongly suppresses the KF gain in speech higher quality and intelligibility. regions, resulting in distorted speech. To cope with this The structure of this paper is as follows: background problem, in [18], a sensitivity metric was used to offset the knowledge is presented in Section 2, including the signal bias in KF gain. It was shown that the sensitivity tuning model, AKF for speech enhancement, paradigm shift of of the KF gain produced less distorted speech than that the AKF recursive equations, and the impact of biased of [17]. However, both of the KF methods [17, 18] address AKF gain on speech enhancement in colored as well as speech enhancement, particularly in stationary white noise non-stationary noise conditions. Following this, Section 3 condition. In [19], George et al. introduced a robustness describes the proposed SEA, which includes speech and metric-based tuning of the AKF (AKF-RMBT) for speech noise LPC parameter estimation and proposed AKF gain enhancement in colored noise conditions. Firstly, the noise tuning method. Section 4 describes the experimental setup LPC parameters are computed from the first noisy speech in terms of speech corpus, objective and subjective evalu- frame by assuming that there remains no speech. The ation metrics, and specifications of the competing SEAs. computed noise LPC parameters remain constant during The experimental results are then presented in Section 5. processing all noisy speech frames for a given noisy speech Finally, Section 6 gives some concluding remarks. utterance. A whitening filter is also constructed with the noise LPCs to pre-whiten each noisy speech frame prior to 2. Background computing the speech LPC parameters. Then construct the AKF with the estimated LPC parameters. As like 2.1. Signal Model [17], it is shown that the robustness metric-based tuning The noisy speech y(n), at discrete-time sample n, is method offsets the bias in AKF gain for silent frames to assumed to be given by: some extent; however, it over-suppresses the components y(n) = s(n) + v(n), (1) in speech regions, resulting in distorted speech. In ad- dition, the speech and noise LPC parameters estimation where s(n) is the clean speech and v(n) is uncorrelated ad- process as well as the tuning method in [19] do not account ditive noise. Since the AKF operates on a frame-by-frame   basis for speech enhancement, firstly, a 20 ms rectangu- −b1 −b2 ... −bq−1 −bq lar window with 0% overlap is used to convert y(n) into  1 0 ... 0 0     0  frames [19], denoted by y(n, l): Φv =  0 1 ... 0 . (9)  .. .. .. .. ..   . . . . .  y(n, l) = s(n, l) + v(n, l), (2) 0 0 ... 1 0 where lǫ{0, 1, 2, . . . , L − 1} is the frame index, L is the Equations (8)-(9) are used to form the SSM of additive total number of frames in an utterance, and N is the total noise as: number of samples within each frame, i.e., nǫ{0, 1, . . . , N − 1}. v(n) = Φv v(n − 1) + dv u(n), (10) ⊤ 2.2. AKF for speech enhancement where dv = 1 0 . . . 0 . The SSMs of the speech and additive noise can be com- For simplicity, the frame index is omitted in the AKF bined into augmented matrix form as: recursive equations. Each frame of the clean speech and noise signal in (2) can be represented with pth and q th s(n) Φs 0 s(n − 1) d 0 w(n) = + s . order AR models, as in [21, Chapter 8]: v(n) 0 Φv v(n − 1) 0 dv u(n) p (11) X s(n) = − ai s(n − i) + w(n), (3) s(n) Φs 0 By replacing x(n) = , Φ = , d = i=1 v(n) 0 Φv Xq ds 0 w(n) v(n) = − bj v(n − j) + u(n), (4) , and z(n) = , eq. (11) can be written as: 0 dv u(n) k=1 x(n) = Φx(n − 1) + dz(n). (12) where {ai ; i = 1, 2, . . . , p} and {bj ; j = 1, 2, . . . , q} are the LPCs. w(n) and u(n) are assumed to be white noise with Whereas the noisy observation y(n) in eq. (2) can be zero mean and variances σw 2 and σu2 , respectively. represented in augmented matrix form as: The state-vector s(n) corresponding to clean speech ⊤ s(n) ⊤ samples s(n) is represented as: y(n) = cs cv , (13) v(n)   ⊤ ⊤ s(n) where cs = 1 0 . . . 0 and cv = 1 0 . . . 0  s(n − 1)    are the p × 1 and q × 1 vectors,respectively.   s(n) =  s(n − 2)  . (5) By replacing c⊤ = c⊤ s c⊤ v , eq. (13) becomes:  ..   .  y(n) = c⊤ x(n). (14) s(n − p + 1) Equations (12) and (14) together form the augmented The state transition matrix Φs of s(n) is given by: SSM (ASSM) of AKF. For each noisy speech frame, the   AKF computes an unbiased linear MMSE estimate, x̂(n|n) −a1 −a2 . . . −ap−1 −ap at sample n, given the observed noisy speech, y(n) by using  1 0 ... 0 0    the following recursive equations [12]:  0 1 ... 0 0  Φs =  . (6)  .. .. .. .. ..  x̂(n|n − 1) = Φx̂(n − 1|n − 1), (15)  . . . . .  ⊤ ⊤ 0 0 ... 1 0 Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + Qdd , (16) ⊤ −1 K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (17) Equations (5)-(6) are used to form the state-space model (SSM) of clean speech as: x̂(n|n) = x̂(n|n − 1) + K(n)[y(n) − c⊤ x̂(n|n − 1)], (18) Ψ(n|n) = [I − K(n)c⊤ ]Ψ(n|n − 1), (19) s(n) = Φs s(n − 1) + ds w(n), (7) where the process noise covariance matrix Q is given by: ⊤ 2 where ds = 1 0 . . . 0 . σw 0 Q= . (20) The additive noise state-vector v(n) and the corre- 0 σu2 sponding state transition matrix Φv are given by: For a noisy speech frame, the error covariances (Ψ(n|n−   v(n) 1) and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n))  v(n − 1)  and the Kalman gain, K(n) are continually updated on a   2   sample-by-sample basis, while ({ai }, σw ) and ({bk }, σu2 ) v(n) =  v(n − 2)  , (8)  ..  remain unchanged. Once all noisy speech frames for a  .  given utterance being processed, synthesis over the en- v(n − q + 1) hanced frames gives the enhanced speech, ŝ(n). 2.3. Paradigm shift of AKF recursive equations In eq. (23), g ⊤ K(n) gives the first component, K0 (n) of The paradigm shift of recursive equations (15)-(19) Kalman gain vector, K(n), which is written as: transform them in scalar form. It exploits the under- K0 (n) = g ⊤ K(n). (27) standing of the AKF operation in speech enhancement context. For this purpose, at sample n, we first extract Substituting eq. (17) into eq. (27) gives: the estimated speech, ŝ(n|n) (the output of the AKF) as: ⊤ g ⊤ x̂(n|n), where g = 1 0 0 . . . 0 is a (p + q) × 1 g ⊤ Ψ(n|n − 1)c K0 (n) = . (28) column vector. g ⊤ x̂(n|n) is simplified as [19]: c⊤ Ψ(n|n − 1)c   ŝ(n|n) With eq. (16), c⊤ Ψ(n|n − 1)c is expressed as:  ŝ(n|n − 1)     ..  c⊤ Ψ(n|n − 1)c = c⊤ ΦΨ(n − 1|n − 1)Φ⊤ c  .    + c⊤ Qdd⊤ c. (29)  ŝ(n|n − p + 1) ⊤ g x̂(n|n) = 1 0 0 . . . 0    . (21) v̂(n|n)    Qdd⊤ in the second term of eq. (29) is written as:  v̂(n|n − 1)     ..  2 ⊤  .  d 0 σw 0 ds 0 Qdd⊤ = s , v̂(n|n − q + 1) 0 dv 0 σu2 0 d⊤ v 2 ⊤ d s σw 0 ds 0 The matrix multiplication in eq.(21) gives: = , 0 dv σu2 0 d⊤ v 2 g ⊤ x̂(n|n) = ŝ(n|n). (22) σ d d⊤ 0 = w s s , 0 σu2 dv d⊤ By multiplying g ⊤ on both side of eq. (18) gives: 2 v σw 0 = . (30) g ⊤ x̂(n|n) = g ⊤ x̂(n|n − 1) + g ⊤ K(n)[y(n)− 0 σu2 c⊤ x̂(n|n − 1)]. (23) Now, c⊤ Qdd⊤ c is simplified as: According to eq. (22), g ⊤ x̂(n|n − 1) is given by: 2 σw 0 cs c⊤ Qdd⊤ c = c⊤ c ⊤ , s v 0 σu2 cv g ⊤ x̂(n|n − 1) = ŝ(n|n − 1). (24) 2 ⊤ σw cs = cs c⊤ , Also c⊤ x̂(n|n − 1) is re-written as: v σu2 cv   2 ⊤ = σw cs cs + σu2 c⊤ ŝ(n|n − 1) v cv ,  ŝ(n|n − 2)  2 = σw + σu2 . (31)    ..   .    In eq. (29), c⊤ ΦΨ(n − 1|n − 1)Φ⊤ c is written as:   ⊤ ŝ(n|n − p + 1) c⊤ x̂(n|n − 1) = c⊤ s c v  v̂(n|n − 1)  ,   ⊤ ⊤ ⊤ ⊤ Φs 0  v̂(n|n − 2)  c ΦΨ(n − 1|n − 1)Φ c = cs cv   0 Φv  ..  ⊤  .  Ψs (n − 1|n − 1) 0 Φs 0 cs ⊤ , v̂(n|n − q + 1) 0 Ψv (n − 1|n − 1) 0 Φv c v Ψs (n − 1|n − 1) 0   = c⊤ s Φ s c ⊤ v Φ v ŝ(n|n − 1) 0 Ψv (n − 1|n − 1) ⊤  ŝ(n|n − 2)  Φ s cs   ,  ..  Φ⊤  .  v cv   ŝ(n|n − p + 1) = 1 0 ... 0 1 0 ... 0   . (25)   v̂(n|n − 1)  = c⊤ s Φs Ψs (n − 1|n − 1) c⊤ v Φv Ψv (n − 1|n − 1)  v̂(n|n − 2)  ⊤   Φs cs  ..  ,  .  Φ⊤v cv v̂(n|n − q + 1) ⊤ = c⊤ s Φs Ψs (n − 1|n − 1)Φs cs + ⊤ The matrix multiplication in eq. (25) gives: c⊤ v Φv Ψv (n − 1|n − 1)Φv cv , c⊤ x̂(n|n − 1) = ŝ(n|n − 1) + v̂(n|n − 1). (26) = α2 (n) + β 2 (n). (32) where α2 (n) and β 2 (n) represents the transmission of a performance metric/index that can quantify the level of posteriori error variance of the speech and noise from the biasness in K0 (n). George et al. introduced a robust- previous time sample, given by [19]: ness metric and a sensitivity metric, which can be used ⊤ to offset the bias in K0 (n) [19]. In AKF-based SEA [19, α2 (n) = c⊤ s Φs Ψs (n − 1|n − 1)Φs cs , (33) Section 3.2], the robustness and sensitivity metrics are de- 2 β (n) = c⊤ v Φv Ψv (n − 1|n − 1)Φ⊤ v cv . (34) fined by simplifying the mean squared error, g ⊤ Ψ(n|n)g of the AKF output, ŝ(n|n) as: In equations (33)-(34), Ψs (n−1|n−1) and Ψv (n−1|n−1) represent the error covariance matrices of the a priori state g ⊤ Ψ(n|n)g = g ⊤ [I − K(n)c⊤ ]Ψ(n|n − 1)g, [f rom (19)] estimates, x̂(n|n − 1) and v̂(n|n − 1), form Ψ as: = g ⊤ Ψ(n|n − 1)g − g ⊤ K(n)c⊤ Ψ(n|n − 1)g, Ψs 0 = g ⊤ Ψ(n|n − 1)g − K0 (n)c⊤ Ψ(n|n − 1)g. [f rom (27)]. Ψ= . (35) (43) 0 Ψv By substituting equations (31)-(32) into eq. (29) gives: Substituting equations (40)-(41) into (43) gives: [α2 (n) + σw 2 2 ] c⊤ Ψ(n|n − 1)c = α2 (n) + β 2 (n) + σw 2 + σu2 . (36) Ψ0,0 (n|n) = α2 (n) + σw 2 − , 2 2 α (n) + β (n) + σw 2 + σ2 u Now, g ⊤ Ψ(n|n − 1)c in eq. (28) can be expressed as: [α2 (n) + σw 2 2 ] Ψ0,0 (n|n) − α2 (n) = σw 2 − 2 2 2 + σ2 , α (n) + β (n) + σw g ⊤ Ψ(n|n − 1)c = g ⊤ ΦΨ(n − 1|n − 1)Φ⊤ c u Ψ0,0 (n|n) − α2 (n) 2 σw + g ⊤ Qdd⊤ c. (37) = − α2 (n) + σw2 α2 (n) + σw 2 Using the expression similar to the derivation of eq. (32), α2 (n) + σw 2 , it can be shown that: α2 (n) + β 2 (n) + σw 2 + σ2 u Ψ0,0 (n|n) − α2 (n) 2 σw g ⊤ ΦΨ(n − 1|n − 1)Φ⊤ c = g ⊤ ΦΨ(n − 1|n − 1)Φ⊤ g, = + α2 (n) + σw2 α2 (n) + σw 2 ⊤ = c⊤ 2 s Φs Ψs (n − 1|n − 1)Φs cs = α (n). (38) β 2 (n) + σu2 − 1, Also, using the similar derivation in eq. (31) gives: α (n) + β 2 (n) + σw 2 2 + σ2 u g ⊤ Qdd⊤ c = g ⊤ Qdd⊤ g = σw 2 . (39) Ψ0,0 (n|n) − α2 (n) σ2 2 2 +1= 2 w 2 + Substituting equations (38)-(39) into eq. (37) yields: α (n) + σw α (n) + σw β (n) + σu2 2 g ⊤ Ψ(n|n − 1)c = g ⊤ Ψ(n|n − 1)g = c⊤ Ψ(n|n − 1)g , α (n) + β 2 (n) + σw 2 2 + σ2 u = α2 (n) + σw 2 . (40) ∆Ψ(n|n) + 1 = J2 (n) + J1 (n), (44) Substituting equations (40) and (36) into eq. (28) gives: where J2 (n) and J1 (n) are the robustness and sensitivity metrics of the AKF, given as [19]: α2 (n) + σw 2 K0 (n) = . (41) 2 σw α2 (n) 2 + β (n) + σw2 + σ2 u J2 (n) = , (45) 2 α (n) + σw2 Substituting equations (22), (24), (26)-(27) into eq. (23) β 2 (n) + σu2 yields: J1 (n) = 2 . (46) α (n) + β 2 (n) + σw2 + σ2 u ŝ(n|n) = ŝ(n|n − 1) + K0 (n)y(n) − K0 (n)[ŝ(n|n − 1) In AKF-RMBT [19], a J2 (n) metric-based tuning of + v̂(n|n − 1)], K0 (n) has been proposed for speech enhancement in col- = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)[y(n)− ored noise conditions. In AKF-SMBT [20], an extension v̂(n|n − 1)]. (42) of AKF-RMBT [19] using a J1 (n) metric-based tuning of K0 (n) has been proposed. Section 2.4 demonstrates the Equation (42) implies that the estimated speech at shortcomings of AKF-RMBT and AKF-SMBT [19, 20] in sample n, ŝ(n|n) is given by a sum of the predicted speech, terms of biased interpretation of K0 (n). ŝ(n|n−1) and the measurement innovation, [y(n)−v̂(n|n− 1)] weighted by the scalar Kalman gain, K0 (n). Therefore, 2.4. Impact of biased K0 (n) on AKF-based speech enhance- the temporal trajectory of K0 (n) is a useful indicator of ment in colored noise conditions ŝ(n|n) estimate. In practice, the inaccurate estimates of To analyze the shortcomings of AKF-RMBT and AKF- 2 ({ai }, σw ) and ({bj }, σu2 ) introduce bias in K0 (n), result- SMBT [19, 20], we conduct an experiment with the utter- ing in degraded ŝ(n|n). Therefore, there should have a ance sp27 (“Bring your best compass to the third class”) of the NOIZEUS corpus [22, Chapter 12] (sampled at 8 during processing all frames for a given utterance [19]. kHz) corrupted with colored (factory) noise (taken from That means, the total a priori prediction error of the RSG-10 database [23]) at 5 dB SNR level. As like [19, 20], noise model, [β 2 (n) + σu2 ] also remains constant for all 2 p = 10 and q = 40 have been used in this analysis. noisy speech frames. Conversely, ({ai }, σw ) computed from the noisy speech frames becomes biased, i.e., ({ãi }, 2 σ̃w ), which results in biased total a priori prediction er- ror of the speech model, [α̃2 (n) + σ̃w 2 ]. Since the silent frames of the noisy speech are completely filled with noise, it gives [α̃2 (n) + σ̃w 2 ] ≈ [β 2 (n) + σu2 ]. According to eq. (41), this condition introduces 0.5 bias in K̃0 (n) (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 1 (d)). With 0.5 biased K̃0 (n), eq. (42) implies that 50% of the measurement innovation, i.e., [y(n) − v̂(n|n − 1)] leaking into the enhanced speech, 2 ŝ(n|n) (e.g., 0.2-0.6 s of Fig. 1 (f)). {ãi }, σ̃w ) also pro- duces biased K̃0 (n) in speech regions. The biased K̃0 (n) passes a significant residual noise to the enhanced speech, ŝ(n|n) as shown in Fig. 1 (f)). In AKF-RMBT [19], a J2 (n) metric is used to offset the bias in K̃0 (n) as: K0′ (n) = K̃0 (n)[1 − J2 (n)]. (47) To perform the tuning of K̃0 (n) using eq. (47), it requires J2 (n) ≈ 1. However, it is shown in [19, Fig. 4 (d)] that the 2 colored noise effect in ({ãi }, σ̃w ) changes the behaviour of J2 (n) apart from approaching 1. To cope with this problem, George et al. employed a whitening filter, Hw (z) to each noisy speech frame, yielding a pre-whitened speech, yw (n, l). With {b̂j }, Hw (z) is constructed as [19]: Figure 1: Spectrograms of the: (a) clean speech (utterance sp27), q X (b) noisy speech (corrupt (a) with 5 dB factory noise), (c) J2 (n) Hw (z) = 1 + b̂j z −j . (48) and J1 (n) metrics, (d) oracle and non-oracle K0 (n) with adjusted K0′ (n) and K0′′ (n), spectrogram of enhanced speech produced by: j=1 (e) AKF-Oracle, (f) AKF-Non-oracle, (g) AKF-RMBT [19], and (h) 2 AKF-SMBT [20] methods, respectively. Now, ({ai }, σw ) are computed from yw (n, l) using the au- tocorrelation method [21, Chapter 8]. It can be seen that 2 In oracle case, ({ai }, σw 2 ) and ({bj }, σu2 ) are com- the improved ({ai }, σw ) enables J2 (n) ≈ 1 during speech puted from the clean speech and the additive noise, re- pauses, resulting K0′ (n) ≈ 0 as shown in Fig. 1 (c)-(d). spectively. During speech pauses of the observed noisy However, the tuning process in eq. (47) results in a sig- speech; since s(n, l) = 0, it gives {ai } = 0, ŝ(n|n − 1) = 0, nificantly reduced K0′ (n) in speech regions as compared and [α2 (n)+σw 2 ] = 0, which turns K0 (n) = 0 (according to to the oracle K0 (n) (Fig. 1 (d)). Therefore, K0′ (n) causes eq. (41)). For example, it is shown that K0 (n) = 0 between over-suppression of the speech components, resulting in 0-0.2 s or 2.2-2.52 s of Fig. 1 (d)). With K0 (n) = 0 and distorted speech as shown in Fig. 1 (g). ŝ(n|n − 1) = 0, eq. (42) implies that nothing is passed to To address the problem in AKF-RMBT [19], a J1 (n) the enhanced speech (i.e., ŝ(n|n) = 0) (e.g., 0-0.2 s or 2.2- metric-based tuning of K̃0 (n) has been proposed in AKF- 2.52 s in Fig. 1 (e)). Conversely, during speech presence SMBT [20] as: of the noisy speech, it is observed that K0 (n) approaching K0′′ (n) = K̃0 (n) − J1 (n). (49) 1 (e.g., 0.2-0.6 s of Fig. 1 (d)). With K0 (n) ≈ 1, the first part in eq. (42) approaching 0, while the predicted noise, It can be seen that K̃0 (n) and J1 (n) approaching 0.5 dur- v̂(n|n−1) subtracted from the observed noisy speech, y(n) ing speech pauses (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 1 (c)- in the second part, termed as measurement innovation, (d)). While J1 (n) ≈ 0 in speech regions (e.g., 0.2-0.6 s of [y(n) − v̂(n|n − 1)] scaled by K0 (n) almost retains the Fig. 1 (c)). Therefore, the subtraction of J1 (n) from K̃0 (n) clean speech. As a result, the enhanced speech produced (eq. (49)) results in K0′′ (n) ≈ 0 during speech pauses, while by AKF-Oracle (Fig. 1(e)) is almost identical to the clean K0′′ (n) ≈ 1 in speech regions. It is shown in Fig. 1 (d) that speech (Fig. 1(a)). under-estimation issue in the speech region is minimized In non-oracle case, ({bj }, σu2 ) are computed from the in K0′′ (n) as compared to K0′ (n). As a result, AKF-SMBT first noisy speech frame by assuming that there remains [20] produced less distorted speech (Fig. 1 (h)) than that no speech. The computed ({bj }, σu2 ) remains constant of [19] (Fig. 1 (g)). Technically, ({bk }, σu2 ) must be computed from each As demonstrated in Section 2.4, in oracle case, the noisy speech frame in non-stationary noise conditions. Thus, silent frames of y(n, l) gives s(n, l) = 0 such that ai = 0 ({bk }, σu2 ) computed from the first noisy speech frame in for i = 1, 2, . . . , p, which turns ŝ(n|n − 1) = 0 as well AKF-RMBT and AKF-SMBT [19, 20] does not adequately as [α2 (n) + σw 2 ] = 0 (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 2 address the non-stationary noise conditions. In addition, (d)). Substituting [α2 (n) + σw 2 ] = 0 in eq. (41) gives the whitening filter, Hw (z) (eq. (48)) constructed with the K0 (n) = 0, which in turn ŝ(n|n) = 0 (eq. (42)), i.e., constant {bk } in AKF-RMBT and AKF-SMBT [19] failed nothing is passed to the enhanced speech (e.g., 0-0.2 s 2 to reduce bias in the estimated ({ai }, σw ). In light of the or 2.2-2.52 s of Fig. 2 (c) and (g)). Conversely, it gives observations, AKF-RMBT and AKF-SMBT [19, 20] do not [α2 (n) + σw 2 ] > [β 2 (n) + σu2 ] for speech dominated frames, adequately address speech enhancement in non-stationary resulting in K0 (n) ≈ 1 (e.g., 0.2-0.6 s of Fig. 2 (c)). As noise conditions. In Section 2.5, we further demonstrate demonstrated in Section 2.4, K0 (n) ≈ 1 almost passes the biasing impact of K0 (n) on AKF-based speech en- the clean speech to the output. Therefore, the enhanced hancement in non-stationary noise conditions. speech produced by AKF-Oracle (Fig. 2(g)) is almost iden- tical to the clean speech (Fig. 2(a)). 2 2.5. Impact of biased K0 (n) on AKF-based speech enhance- In non-oracle case, ({ai }, σw ) and ({bj }, σu2 ) are com- ment in non-stationary noise conditions puted from each noisy speech frame, resulting in biased 2 To analyze the impact of biased K0 (n) on AKF-based ({ãi }, σ̃w ) and ({b̃j }, σ̃u2 ), which in turn [α2 (n) + σw 2 ]≈ 2 2 speech enhancement in non-stationary noise condition, we [β (n) + σu ] (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 2 (e)). repeat the experiment in Fig. 1 except the utterance sp27 According to eq. (41), this condition introduces around is corrupted with 5 dB babble noise (taken from AURORA 0.5 bias in K0 (n) (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 2 database [24]). In this study, a 32 ms rectangular window (c)). During speech activity of y(n, l), it is observed that with 50% overlap [25, Sec 7.2.1] was considered for con- [α2 (n)+σw 2 ] ≥ [β 2 (n)+σu2 ], resulting in an under-estimated verting y(n) into frames, y(n, k) (as in eq. (2)). We have K0 (n) as compared to the oracle case (e.g., 0.2-0.6 s of also used p = 16 and q = 40. Fig. 2 (c)). 0.5 biased K0 (n) in silent regions leaking 50% of [y(n) − v̂(n|n − 1)] to the enhanced speech (Fig. 2 (h)). In addition, the under-estimated K0 (n) in speech regions produced distorted speech (Fig. 2 (h)). Also, J2 (n) and J1 (n) metrics (Fig. 2 (f)) do not achieve the similar char- acteristics as found in the colored noise condition (Fig. 1 (c)), which leaves them inappropriate in tuning the biased K0 (n) (Fig. 2 (c)) using equations. (47) and (49). In light of the observations in this section, the objec- tive of proposed SEA falls in twofold: firstly, to improve 2 the estimates of ({ai }, σw ) and ({bj }, σu2 ) in real-life noise conditions so that J2 (n) and J1 (n) achieve the similar characteristics as found in colored noise conditions (Fig. 1 (c)). Secondly, incorporate both the improved J2 (n) and J1 (n) metrics for tuning the biased K0 (n) to achieve bet- ter noise reduction by AKF–even for conditions that have time-varying amplitudes. 3. Proposed speech enhancement algorithm Fig. 3 shows the block diagram of the proposed SEA. Firstly, y(n) is converted into frames, y(n, k) with the same setup as used in section 2.5. The next step of the proposed 2 SEA is ({ai }, σw ) and ({bj }, σu2 ) estimation as described in Section 3.1. 2 3.1. Proposed ({ai }, σw ) and ({bj }, σu2 ) estimation method Figure 2: Biasing effect demonstration of K0 (n), spectrogram of: (a) 2 The speech and noise LPC parameters, ({ai }, σw ) and clean speech (utterance sp27), (b) noisy speech (corrupt sp27 with 2 5 dB babble noise), (c) K0 (n) computed in oracle and non-oracle ({bj }, σu ) are sensitive to noise. Since the clean speech, cases, (d)-(e) [α2 (n) + σw 2 ] and [β 2 (n) + σ 2 ] computed in oracle and u s(n, l) and the noise, v(n, l) are unobserved in practice, it 2 non-oracle cases, (f) J2 (n) and J1 (n) metrics computed from the is difficult to accurately estimate ({ai }, σw ) and ({bj }, σu2 ) noisy speech in (b), spectrogram of enhanced speech produced by: from noisy speech, y(n, l). It is already demonstrated that (g) AKF-Oracle method, and (h) AKF-Non-oracle method. 2 ({ai }, σw ) and ({bj }, σu2 ) estimates in AKF-RMBT and 3.2. Noise PSD estimation In this paper, we incorporate an SPP method [26] to estimate noise PSD from each noisy speech frame. For this purpose, the noisy speech, y(n) (eq. (1)) is next an- alyzed frame-wise using the short-time Fourier transform (STFT): Yl (m) = Sl (m) + Vl (m), (50) where Yl (m), Sl (m), and Vl (m) denote the complex-valued STFT coefficients of the noisy speech, clean speech, and noise signal, respectively, for time-frame index l and fre- quency bin mǫ{0, 1, 2, . . . , M − 1} with M being the total number of frequency-bins within each frame. A Hamming window with 50% overlap is used in STFT analysis [25, Section 7.2.1]. In polar form, Yl (m), Sl (m), and Vl (m) can be expressed as: Yl (m) = Rl (m)ejφl (m) , Sl (m) = Al (m)ejϕl (m) , and Vl (m) = Dl (m)ejθl (m) , where Rl (m), Al (m), and Dl (m) are the magnitude spectrums of the noisy speech, the clean speech, and the noise signal, respectively, and φl (m), ϕl (m), and θl (m) are the corre- sponding phase spectrums. We process each frequency bin of the single-sided noisy speech power spectrum, Rl2 (m) (where mǫ{0, 1, . . . , 128} containing the DC and Nyquist frequency components) to estimate the noise power spec- trum, D̂l2 (m). To initialize the algorithm, we assume the first frame (l = 0) of R02 (m) as silent, which gives an esti- mate of noise power as: D̂02 (m) = R02 (m). The noise PSD, λ̂0 (m) is also initialized as: λ̂0 (m) = D̂02 (m). For l ≥ 1, using the speech presence uncertainty principle, an MMSE estimate of D̂l2 (m) at mth frequency bin is given by [26]: D̂l2 (m) = P (H0m |Rl (m))Rl2 (m) + P (H1m |Rl (m)) Figure 3: Block diagram of the proposed AKF-based SEA. λ̂l−1 (m), (51) where P (H0m |Rl (m)) and P (H1m |Rl (m)) are the condi- AKF-SMBT [19, 20] do not address the conditions that tional probability of the speech absence and the speech have time varying amplitudes. presence, given Rl (m) at mth frequency bin. To cope with this problem, in this paper, we first es- The simplified P (H1m |Rl (m)) estimate is given by1 [26]: timate noise PSD, P̂v (l, m) from each noisy speech frame " using an SPP method [26] (described in section 3.2). Then P (H1m |Rl (m)) = 1 + (1+ employ inverse Fourier transform to P̂v (l, m), yields an es- timate of the noise autocorrelation matrix, R bvv (τ ), where ( ! !)#−1 τ is the autocorrelation lag. By solving R bvv (τ ) using the Rl2 (m) ξopt ξopt ) exp − , (52) Levinson-Durbin recursion [21, Chapter 8], gives ({bj }, σu2 ) λ̂l−1 (m) 1 + ξopt (q = 40). As in [19], to reduce bias in the estimated ({ai }, σw2 ) for each noisy speech frame, we compute them from where ξopt is the optimal a priori SNR. the corresponding pre-whitened speech, yw (n, l) using the The optimal choice for ξopt is found as 10 log10 (ξopt ) = autocorrelation method [21, Chapter 8]. The framewise 15 dB [26], and P (H0m |Rl (m)) is given by P (H0m |Rl (m)) = yw (n, l) is obtained by employing a whitening filter, Hw (z) 1 − P (H1m |Rl (m)). However, if P (H1m |Rl (m)) = 1 occurs to y(n, l). With estimated {bj }; Hw (z) is constructed as at mth frequency bin, it causes stagnation, which stops in eq. (48). Unlike AKF-RMBT and AKF-SMBT [19, 20], updating D̂l2 (m) (eq. (51)). Unlike monitoring the status since Hw (z) is constructed with {bj } for each noisy speech of P (H1m |Rl (m)) = 1 for a long time as reported in [26], we 2 frame, the estimates of ({ai }, σw ) address conditions that have time-varying amplitudes. 1 The simplification is a result of assuming the a priori probability of the speech absence and presence, P (H0 ) and P (H1 ) as: P (H0 ) = P (H1 ). simply resolve this issue by setting P (H1m |Rl (m)) = 0.99 We observed that the total a priori prediction errors once this condition occurs prior to update D̂l2 (m). of the speech and noise AR models; [α2 (n) + σw 2 ] and It is observed that Rl2 (m) is completely filled with ad- 2 2 [β (n) + σu ] can be adopted as a speech activity detec- ditive noise during silent activity, thus giving an estimate tor for each sample of y(n, l). For example, during speech of noise power. Therefore, unlike updating D̂l2 (m) using pauses, the condition [β 2 (n) + σu2 ] ≥ [α2 (n) + σw 2 ] holds eq. (51) by existing method [26], we do it differently de- (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 4 (a)). Conversely, pending on the silent/speech activity of Rl2 (m) (for each [α2 (n) + σw 2 ] >> [β 2 (n) + σu2 ] is found in speech regions frequency bin m). Specifically, at mth frequency bin (l ≥ (e.g., 0.2-0.6 s of Fig. 4 (a)). Therefore, at sample n, if 1), if P (H1m |Rl (m)) < 0.5, Rl2 (m) yields silent activity, re- [β 2 (n) + σu2 ] ≥ [α2 (n) + σw2 ]; y(n, k) is termed as silent and sulting in D̂l2 (m) = Rl2 (m), otherwise, D̂l2 (m) is estimated set the decision parameter (denoted by ζ) as ζ(n) = 0; oth- using eq. (51). With estimated D̂l2 (m), λ̂l (m) is updated erwise speech activity occurs and ζ(n) = 1. It can be seen as: λ̂l (m) = η λ̂l−1 (m) + (1 − η)D̂l2 (m), Detected Reference (53) 1 where the smoothing constant, η is set to 0.9. Amplitude The 256-point noise PSD is given as: P̂v (l, m) = λ̂l (m), 0 where the components of P̂v (l, m) at mǫ{1, 2, . . . , 127} are flipped to that of the mǫ{129, 130, . . . , 255} of P̂v (l, m). -1 0.5 1 1.5 2 2.5 3.3. Proposed K0 (n) tuning method Time (s) Firstly, the AKF is constructed with the estimated 2 ({ai }, σw ) and ({bj }, σu2 ). Then we extract the tuning pa- Figure 5: Comparing the detected flags from noisy speech in Fig. 2 rameters as shown in Fig. 4. It can be seen from Fig. 4 (a) (b) to that of the reference corresponding to Fig. 2 (a). that [α2 (n) + σw 2 ] and [β 2 (n) + σu2 ] achieves very similar characteristics as like AKF-Oracle method (Fig. 2 (d)). from Fig. 5 that the detected flags (0/1: silent/speech) by The improvement of these parameters also causes J2 (n) proposed method is closely similar to that of the reference and J1 (n) metrics (Fig. 4 (b)) to achieve the similar char- (0/-1: silent/speech). The reference flags are generated by acteristics as appear in the colored noise condition (Fig. 1 visually inspecting the corresponding clean speech (Fig. 2 (c)). Therefore, J2 (n) and J1 (n) metrics (Fig. 4 (b)) are (a)) frames. now eligible to dynamically offset the bias in K0 (n)—even At sample n, if ζ(n) = 0, the adjusted K0′ (n) in the for non-stationary noise conditions. However, as demon- proposed SEA is given by: strated in Section 2.4, J2 (n) metric is useful in tuning K0 (n) during speech pauses of the noisy speech, since K0′ (n) = K0 (n)[1 − J2 (n)], it results in under-estimated K0 (n) during speech pres- " #" # α2 (n) + σw 2 2 σw ence. On the contrary, since J1 (n) metric approaches 0 in = 1− 2 , speech regions of the noisy speech, according to eq. (49), α2 (n) + β 2 (n) + σw2 + σ2 u 2 α (n) + σw it minimizes the under-estimation of K0 (n). In light of α2 (n) the observations, for each sample of y(n, l), we incorpo- = . (54) α2 (n) + β 2 (n) + σw 2 + σ2 u rate J2 (n) metric during speech pauses and J1 (n) metric during speech presence to dynamically offset the bias in To justify the validity of K0′ (n), Fig. 6 (a) shows the K0 (n). numerator and the denominator of eq. (54) computed from the noisy speech in Fig. 2 (b). It can be seen that α2 (n) ≈ 10 -3 0 during speech pauses (e.g., 0-0.2 s or 2.2-2.52 s of Fig. 6 a 6 (a)). According to eq. (54), it results K0′ (n) ≈ 0. Since 4 [α2 (n) + β 2 (n) + σw 2 + σu2 ] >> α2 (n) occurs during speech 2 presence (e.g., 0.2-0.6 s of Fig. 6 (a)), it may result in 0 under-estimated K0′ (n) as like colored noise experiment b 1 0.8 (Fig. 1 (d)). Thus, J2 (n) metric-based tuning of K0′ (n) in 0.6 speech activity of y(n, l) is inappropriate. 0.4 As discussed earlier, we employ J1 (n) metric to off- 0.2 0 set the bias in K0 (n) during speech activity of y(n, l). 0.5 1 1.5 2 2.5 However, our further investigation on J1 (n) metric-based tuning in eq. (49) reveals that the subtraction of J1 (n) from biased K0 (n) still produced under-estimated K0′ (n) Figure 4: Comparing the estimated: (a) [α2 (n)+σw 2 ] and [β 2 (n)+σ 2 ] u as shown in Fig. 1 (d). To cope with this problem, at sam- and (b) J2 (n), J1 (n) metrics from the noisy speech in Fig. 2 (b). ple n, if ζ(n) = 1, we propose the tuning of biased K0 (n) 10 -3 tains a smooth transition at the edges and the temporal a 4 3 changes in speech regions are closely matched to that of 2 the oracle K0 (n). Amongst the benchmark methods, the 1 adjusted K0 (n) by AKF-SMBT [20] shows less bias than 0 that of the AKF-RMBT [19]. However, AKF-SMBT [20] 10 -5 still produces under-estimated K0 (n) in speech regions. 2 b 1.5 We also repeat the Fig. 7 (a) experiment except the utter- 1 ance sp27 is corrupted with 5 dB factory noise to evaluate 0.5 the performance of the proposed tuning method in col- 0 ored noise conditions. Fig. 7 (b) reveals that the biasing 0 0.5 1 1.5 2 2.5 effect is reduced significantly in the adjusted K0 (n) by the proposed method, which is closely similar to that of the oracle K0 (n). As in the previous experiment, AKF-SMBT Figure 6: K0′ (n) responses in terms of: (a) α2 (n) and α2 (n)+β 2 (n)+ [20] also produce under-estimated K0 (n) in speech regions. 2 +σ 2 , and (b) [α2 (n)+σ 2 ]2 and [α2 (n)+β 2 (n)+σ 2 +σ 2 ]2 , where σw u w w u The AKF-RMBT method [20] produced the most under- the same experimental setup of Fig. 2 is used. estimated K0 (n) amongst the competing methods. In light of the comparative study, the reduced-biased K0 (n) achieved using J1 (n) metric as: by the proposed tuning algorithm will be of benefit to the AKF for speech enhancement in various noise conditions. K0′ (n) = K0 (n)[1 − J1 (n)], " # α2 (n) + σw 2 4. Experimental setup = α2 (n) + β 2 (n) + σw 2 + σ2 u " # 4.1. Speech corpus α2 (n) + σw 2 For the objective experiments, 30 phonetically balanced 2 2 2 + σ2 , utterances belonging to six speakers (three male and three α (n) + β (n) + σw u female) are taken from the NOIZEUS corpus [22, Chapter [α2 (n) + σw 2 2 ] 12]. The noisy speech for the test set is generated by mix- = . (55) [α2 (n) 2 + β (n) + σw 2 + σ 2 ]2 u ing the clean speech with real-world non-stationary (babble and street) and colored (factory and f16 ) noises at multiple To justify the validity of K0′ (n), the numerator and the SNR levels (from -5dB to +15 dB, in 5 dB increments). denominator of eq. (55) are shown in Fig. 6 (b). It can be The babble noise is taken from AURORA database [24], seen that [α2 (n)+β 2 (n)+σw2 +σu2 ]2 ≥ [α2 (n)+σw 2 2 ] during the street noise is taken from Nonspeech database [27], the speech presence of y(n, l) (e.g., 0.2-0.6 s of Fig. 6 (b)), and the factory and f16 noises are taken from RSG-10 which results in K0′ (n) approaching 1. database [23]. All the clean speech and noise recordings To evaluate the performance of the proposed tuning are single-channel with a sampling frequency of 8 kHz. The method in non-stationary noise conditions, we conduct noisy speech dataset provides 30 examples per condition an experiment with the same setup as in Fig. 2. It can with 20 total conditions. be seen from Fig. 7 (a) that the adjusted K0 (n) by the proposed method shows significantly less bias and closely 4.2. Objective evaluation similar to that of the oracle K0 (n). Specifically, it main- Objective measures are used to evaluate the quality a 1 and intelligibility of the enhanced speech with respect to 0.8 the corresponding clean speech. The following objective 0.6 0.4 evaluation metrics have been used in this paper: 0.2 0 • Perceptual Evaluation of Speech Quality (PESQ) for b 1 objective quality evaluation [28]. It ranges between 0.8 0.6 -0.5 and 4.5. A higher PESQ score indicates better 0.4 speech quality. 0.2 0 0.5 1 1.5 2 2.5 • The short-time objective intelligibility (STOI) mea- Time (s) sure for objective intelligibility evaluation [29]. It ranges between 0 and 1 (or 0 to 100%). A higher STOI score indicates better speech intelligibility. Figure 7: Comparing K0 (n) trajectories corresponding to the AKF- Oracle method, proposed method , AKF-RMBT method [19], and We also analyzed the spectrograms of enhanced speech AKF-SMBT method [20], where the utterance sp27 is corrupted with produced by the competing SEAs to visually quantify the 5 dB: (a) babble and (b) factory noises. level of residual noise as well as distortion. 4.3. Subjective evaluation v. AKF-IT [12]: AKF operates with two iterations, 2 The subjective evaluation was carried out through a where the initial ({ai }, σw ) and ({bj }, σu2 ) are com- series of blind AB listening tests [4, Section 3.3.4]. To per- puted from each frame of the noisy speech followed form the tests, we generate a set of stimuli by corrupting by re-estimation of them from the processed speech the utterances sp05 and sp27 from the NOIZEUS corpus frame after first iteration, p = 10, q = 10, wf = 20 [22, Chapter 12]. The reference transcript for utterance ms, sf = 0 ms, and rectangular window is used for sp05 is: “Wipe the grease off his dirty face”, and is cor- framing. rupted with 5 dB factory noise. The reference transcript vi. SBIT-KF [15]: Subband iterative KF with two it- for utterance sp27 is: “Bring your best compass to the third 2 erations, where the initial ({ai }, σw ) are computed class”, and is corrupted with 5 dB babble noise. Utterances from each frame of the noisy speech followed by re- sp05 and sp27 were uttered by a male and a female, re- estimation of them from the processed speech frame spectively. In the tests, the enhanced speech produced by after first iteration, σv2 is estimated from each noisy eight SEAs as well as the corresponding clean speech and speech frame at 1st iteration, p = 8, wf = 32 ms, noisy speech signals were played as stimuli pairs to the sf = 0 ms, and rectangular window is used for fram- listeners. Specifically, the tests were performed on a to- ing. tal of 180 stimuli pairs (90 for each utterance) played in a random order to each listener, excluding the comparisons vii. AKF-RMBT [19]: Robustness metric-based tuning between the same method. of AKF, where ({bj }, σu2 ) are computed from the The listener gives the following ratings for each stimuli first noisy speech frame being considered as silent, pair: prefers the first or second stimuli, which is perceptu- 2 ({ai }, σw ) are computed from pre-whitened speech ally better, or a third response indicating no difference is frame, p = 10, q = 40, wf = 20 ms, sf = 0 ms, and found between them. For pairwise scoring, 100% award is rectangular window is used for framing. given to the preferred method, and 0% to the other. For a similar preference response, each method is awarded a viii. AKF-SMBT [20]: Sensitivity metric-based tuning of score of 50%. Participants could re-listen to stimuli if re- AKF, where ({bj }, σu2 ) are computed from the first quired. Ten English speaking listeners participate in the noisy speech frame being considered as silent, ({ai }, 2 blind AB listening tests2 . The average of the preference σw ) are computed from pre-whitened speech frame, scores given by the listeners termed as mean subjective p = 10, q = 40, wf = 32 ms, sf = 16 ms, and preference score (%), which is used to compare the perfor- rectangular window is used for framing. mance among the SEAs. ix. Proposed: Robustness and sensitivity metrics-based 4.4. Specifications of the competing SEAs tuning of the AKF, where ({bj }, σu2 ) are computed 2 from each frame of the estimated noise, ({ai }, σw ) The performance of the proposed SEA is compared to are computed from each frame of the pre-whitened the following SEAs (the following notation is used for con- 2 speech, p = 16, q = 40, wf = 32 ms, sf = 16 venience: (p, q) : is the order of {ai } and {bk }, (σw , σu2 ) ms, rectangular window is used for generating time are the prediction error variances of the speech and noise domain frames, and Hamming window is used for AR models, wf is the analysis frame duration (ms), and acoustic domain analysis and synthesis. sf is the analysis frame shift (ms)). i. Noisy: speech corrupted with additive noise. 5. Results and discussions 2 ii. AKF-Oracle: AKF, where ({ai }, σw ) and ({bj }, σu2 ) are computed from each frame of the clean speech 5.1. Objective quality evaluation and the additive noise, p = 16, q = 40, wf = 20 Fig. 8 shows the average PESQ score for each SEA. ms, sf = 0 ms, and rectangular window is used for It can be seen that the AKF-Oracle method attained the framing. highest average PESQ score for all of the tested condi- 2 2 tions. This is due to ({ai }, σw ) and ({bk }, σu2 ) being com- iii. AKF-Non-oracle: AKF, where ({ai }, σw ) and ({bj }, puted from the clean speech and the noise signal, which 2 σu ) are computed from each frame of the noisy speech, is unobserved in practice. Thus, AKF-Oracle provides an p = 16, q = 40, wf = 20 ms, sf = 0 ms, and rectan- indication of the upper-bound for the AKF in terms of av- gular window is used for framing. erage PESQ score. Conversely, the average PESQ score iv. MMSE-STSA [8]: It used wf = 25 ms, sf = 10 for Noisy indicates the lower bound of the average PESQ ms, and Hamming window is used for analysis and score for each of the tested conditions. The proposed SEA synthesis. consistently produces a higher average PESQ score than the competing SEAs across the tested conditions. The 2 The AB listening tests were conducted with approval from the average PESQ score for the proposed method is also very Griffith University’s Human Research Ethics Committee: database similar to that of the KF-Oracle method. This is likely due protocol number 2018/671. to the reduced-biased AKF gain achieved by the proposed 4 100 3 80 60 2 40 1 20 0 0 4 100 3 80 60 2 40 1 20 0 0 4 100 3 80 60 2 40 1 20 0 0 4 100 3 80 60 2 40 1 20 0 0 -5 0 5 10 15 -5 0 5 10 15 Noisy Noisy AKF-Non-oracle AKF-Non-oracle AKF-IT AKF-IT MMSE-STSA MMSE-STSA SBIT-KF SBIT-KF AKF-RMBT AKF-RMBT AKF-SMBT AKF-SMBT Proposed Proposed AKF-Oracle AKF-Oracle Figure 9: Average STOI score for each SEA found over all frames Figure 8: Average PESQ score for each SEA found over all frames for each condition described in Section 4.1. for each condition described in Section 4.1. tuning algorithm (Fig. 7). Amongst the competing meth- SEA produces higher quality enhanced speech than that ods, AKF-SMBT [20] relatively produced higher average of the competing SEAs across the tested conditions. PESQ scores for each of the tested conditions (Fig. 8). In light of this study, it is evident to say that the proposed 4 4 4 3 3 3 Freq.(kHz) Freq.(kHz) Freq.(kHz) 2 2 2 1 1 1 0 0 0 4 4 4 3 3 3 Freq.(kHz) Freq.(kHz) Freq.(kHz) 2 2 2 1 1 1 0 0 0 4 4 4 3 3 3 Freq.(kHz) Freq.(kHz) Freq.(kHz) 2 2 2 1 1 1 0 0 0 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 Time (s) Time (s) Time (s) Figure 10: Spectrograms of: (a) clean speech (utterance sp27), (b) noisy speech ((a) corrupted with 5 dB babble noise) (PESQ=1.72), enhanced speech produced by: (c) AKF-IT [12] (PESQ=1.91), (d) MMSE-STSA [8] (PESQ=2.06), (e) SBIT-KF [15] (PESQ=2.17), (f) AKF-RMBT [19] (PESQ=2.22), (g) AKF-SMBT [20] (PESQ=2.30), (h) Proposed (PESQ=2.49), and (i) AKF-Oracle (PESQ=2.69). 5.2. Objective intelligibility evaluation less distortion as well as less residual background noise Fig. 9 shows the average STOI score for each SEA. As than that of AKF-RMBT (Fig. 10 (f)). The enhanced in Section 5.1, the AKF-Oracle method achieves the high- speech produced by the proposed method is shown in Fig. 10 est average STOI score for each tested condition. On the (h). It can be seen that there is less residual background other hand, Noisy shows the lowest average STOI score noise in the enhanced speech than AKF-SMBT (Fig. 10 in any tested condition. The proposed method attained (g)). Finally, the enhanced speech produced by the AKF- the highest average STOI score for each tested condition, Oracle method is shown in Fig. 10 (i), which is most sim- apart from the AKF-Oracle method. Amongst the com- ilar to the clean speech in Fig. 10 (a). This is due to peting methods, AKF-SMBT [20] attained the highest av- AKF-Oracle uses the clean speech and noise (unobserved erage STOI scores. In light of this study, it is evident to in practice) for LPC parameter estimation. say that the proposed SEA produces more intelligible en- 100 Mean subjective preference score (%) hanced speech than the competing SEAs across the tested conditions. 80 5.3. Spectrogram analysis of each SEA 60 Fig. 10 (a) shows the spectrogram of clean speech (fe- male utterance sp27). The clean speech is corrupted by 40 babble noise at 5 dB SNR level to create the noisy speech (Fig. 10 (b)). This is a particularly tough condition for speech enhancement since the background noise exhibits 20 characteristics similar to the speech produced by the tar- get speaker. The enhanced speech produced by AKF- 0 A F BT BT IT IT [12] is shown in Fig. 10 (c). The enhanced speech an sy le ed le -K TS F- c oi M SM c le os ra IT ra AK -S N C R O SB op suffers from significant speech distortion. A significant -o F- F- SE F- on Pr AK AK AK M N M residual background noise is also present in the enhanced F- AK speech. Fig. 10 (d) shows the enhanced speech produced by MMSE-STSA [8]. This method produced less distorted speech than AKF-IT (Fig. 10 (c)); however, residual back- Figure 11: The mean preference score (%) comparison between the ground noise still remains. Less residual background noise proposed and benchmark SEAs for the utterance sp05 corrupted with is present in the enhanced speech produced by SBIT-KF 5 dB colored factory noise. [15] (Fig. 10 (e)) than MMSE-STSA (Fig. 10 (d)), how- ever, the speech is more distorted. The AKF-RMBT [19] 5.4. Subjective Evaluation by AB listening test method produced less distorted speech (Fig. 10 (f)) than The mean subjective preference score (%) for each SEA that of SBIT-KF (Fig. 10 (e)). The enhanced speech pro- is shown in Figs. 11-12. The colored (factory) noise experi- duced by AKF-SMBT [20] (Fig. 10 (g)) relatively shows ment in Fig. 11 reveals that the enhanced speech produced tuning algorithm leads to the capability of speech enhance- Mean subjective preference score (%) 100 ment in various noise conditions. Objective and subjec- 80 tive scores on the NOIZEUS corpus demonstrate that the proposed method outperforms the benchmark methods in 60 real-life noise conditions for a wide range of SNR levels. 40 CRediT authorship contribution statement 20 Sujan Kumar Roy: Preliminary experiments, Exper- iment design, Conducted the experiments, Code writing, 0 Analysis of results, Literature review, and Manuscript writ- A F BT BT IT ing. Kuldip K. Paliwal: Supervision and aided with editing an sy e ed e -K TS cl F- cl oi M SM le os ra IT ra AK -S N C R O SB op the final manuscript. -o F- F- SE F- on Pr AK AK AK M N M F- AK Declaration of Competing Interest The authors declare that they have no known compet- Figure 12: The mean preference score (%) comparison between the proposed and benchmark SEAs for the utterance sp27 corrupted with ing financial interests or personal relationships that could 5 dB non-stationary babble noise. have appeared to influence the work reported in this pa- per. by the proposed method is widely preferred by the listen- ers (74%) over the competing methods, apart from the References clean speech (100%) and the AKF-Oracle method (84%). [1] S. Boll, Suppression of acoustic noise in speech using spectral AKF-SMBT [20] is found to be the most preferred method subtraction, IEEE Transactions on Acoustics, Speech, and Sig- nal Processing 27 (1979) 113–120. doi:10.1109/TASSP.1979. (around 67%) amongst the benchmark methods by the 1163209. listeners. For the non-stationary (babble) noise experi- [2] M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech ment (Fig. 12), the listeners again preferred the proposed corrupted by acoustic noise, IEEE International Conference on method (72%) over the competing methods, with only Acoustics, Speech, and Signal Processing 4 (1979) 208–211. doi:10.1109/TASSP.1979.1163209. clean speech (100%) and AKF-Oracle (82%) being more [3] S. Kamath, P. Loizou, A multi-band spectral subtraction preferred. As in previous experiment (Fig. 11), AKF- method for enhancing speech corrupted by colored noise, IEEE SMBT [20] was the most preferred (65%) amongst the International Conference on Acoustics, Speech, and Signal competing methods, with AKF-RMBT [19] (around 58%) Processing 4 (2002) 4160–4164. doi:10.1109/ICASSP.2002. 5745591. being the next most preferred. In light of the blind AB lis- [4] K. Paliwal, K. Wójcicki, B. Schwerin, Single-channel speech en- tening tests, it is evident to say that the enhanced speech hancement using spectral subtraction in the short-time modu- of the proposed method exhibits the best perceived qual- lation domain, Speech Communication 52 (5) (2010) 450–475. ity amongst all tested methods for both male and female doi:10.1016/j.specom.2010.02.004. [5] J. S. Lim, A. V. Oppenheim, Enhancement and bandwidth com- utterances corrupted by real-life colored as well as non- pression of noisy speech, Proceedings of the IEEE 67 (12) (1979) stationary noises. 1586–1604. doi:10.1109/PROC.1979.11540. [6] P. Scalart, J. V. Filho, Speech enhancement based on a priori signal to noise estimation, IEEE International Conference on 6. Conclusion Acoustics, Speech, and Signal Processing 2 (1996) 629–632. [7] C. Plapous, C. Marro, L. Mauuary, P. Scalart, A two-step noise This paper investigates robustness and sensitivity metrics- reduction technique, IEEE International Conference on Acous- tics, Speech, and Signal Processing 1 (2004) 289–292. based tuning of the AKF gain for single-channel speech [8] Y. Ephraim, D. Malah, Speech enhancement using a minimum- enhancement in real-life noise conditions. At first, an SPP mean square error short-time spectral amplitude estimator, method estimates the noise PSD from each noisy speech IEEE Transactions on Acoustics, Speech, and Signal Processing frame to compute the noise LPC parameters. A whitening 32 (6) (1984) 1109–1121. doi:10.1109/TASSP.1984.1164453. [9] Y. Ephraim, D. Malah, Speech enhancement using a mini- filter is also constructed with the estimated noise LPCs mum mean-square error log-spectral amplitude estimator, IEEE to pre-whiten each noisy speech frame prior to comput- Transactions on Acoustics, Speech, and Signal Processing 33 (2) ing the speech LPC parameters. Then construct the AKF (1985) 443–445. doi:10.1109/TASSP.1985.1164550. with the estimated speech and noise LPC parameters. To [10] K. Paliwal, B. Schwerin, K. Wójcicki, Speech enhancement us- ing a minimum mean-square error short-time spectral modula- achieve better noise reduction, the robustness is employed tion magnitude estimator, Speech Communication 54 (2) (2012) to offset the bias in AKF gain during speech pauses of 282 – 305. doi:https://doi.org/10.1016/j.specom.2011.09. the noisy speech to that of the sensitivity metrics during 003. speech presence. The speech and noise model parameters [11] K. Paliwal, A. Basu, A speech enhancement method based on kalman filtering, IEEE International Conference on Acoustics, are adopted as a speech activity detector. It is shown that Speech, and Signal Processing 12 (1987) 177–180. doi:10.1109/ the reduced-biased AKF gain achieved by the proposed ICASSP.1987.1169756. [12] J. D. Gibson, B. Koo, S. D. Gray, Filtering of colored noise for speech enhancement and coding, IEEE Transactions on Signal Processing 39 (8) (1991) 1732–1742. doi:10.1109/78.91144. [13] G. J. Brown, D. Wang, Separation of Speech by Computational Auditory Scene Analysis, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. doi:10.1007/3-540-27489-8_16. [14] Y. Xu, J. Du, L. Dai, C. Lee, An experimental study on speech enhancement based on deep neural networks, IEEE Signal Pro- cessing Letters 21 (1) (2014) 65–68. doi:10.1109/LSP.2013. 2291240. [15] S. K. Roy, W. P. Zhu, B. Champagne, Single channel speech enhancement using subband iterative kalman filter, IEEE In- ternational Symposium on Circuits and Systems (2016) 762– 765doi:10.1109/ISCAS.2016.7527352. [16] M. Saha, R. Ghosh, B. Goswami, Robustness and sensitivity metrics for tuning the extended kalman filter, IEEE Transac- tions on Instrumentation and Measurement 63 (4) (2014) 964– 971. doi:10.1109/TIM.2013.2283151. [17] S. So, A. E. W. George, R. Ghosh, K. K. Paliwal, A non- iterative kalman filtering algorithm with dynamic gain ad- justment for single-channel speech enhancement, International Journal of Signal Processing Systems 4 (4) (2016) 263–268. doi:10.18178/ijsps.4.4.263-268. [18] S. So, A. E. W. George, R. Ghosh, K. K. Paliwal, Kalman filter with sensitivity tuning for improved noise reduction in speech, Circuits, Systems, and Signal Processing 36 (4) (2017) 1476– 1492. doi:10.1007/s00034-016-0363-y. [19] A. E. George, S. So, R. Ghosh, K. K. Paliwal, Robustness metric-based tuning of the augmented kalman filter for the enhancement of speech corrupted with coloured noise, Speech Communication 105 (2018) 62–76. doi:https://doi.org/10. 1016/j.specom.2018.10.002. [20] S. K. Roy, K. K. Paliwal, Sensitivity metric-based tuning of the augmented Kalman filter for speech enhancement, 14th In- ternational Conference on Signal Processing and Communica- tion Systems (ICSPCS) 2020doi:10.1109/ICSPCS50536.2020. 9310005. [21] S. V. Vaseghi, Linear prediction models, in: Advanced Digital Signal Processing and Noise Reduction, John Wiley & Sons, 2009, Ch. 8, pp. 227–262. [22] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd Edition, CRC Press, Inc., Boca Raton, FL, USA, 2013. [23] H. J. Steeneken, F. W. Geurtsen, Description of the RSG-10 noise database, Report IZF 1988-3, TNO Institute for Percep- tion, Soesterberg, The Netherlands. [24] D. Pearce, H.-G. Hirsch, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions., in: INTERSPEECH, ISCA, 2000, pp. 29–32. [25] A. V. Oppenheim, R. W. Schafer, Discrete-Time Signal Pro- cessing, 3rd Edition, Prentice Hall Press, Upper Saddle River, NJ, USA, 2009. [26] T. Gerkmann, R. C. Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Transactions on Audio, Speech, and Language Processing 20 (4) (2012) 1383–1393. doi:10.1109/TASL.2011.2180896. [27] G. Hu, 100 nonspeech environmental sounds, The Ohio State University, Department of Computer Science and Engineering. [28] A. W. Rix, J. G. Beerends, M. P. Hollier, A. P. Hekstra, Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, IEEE International Conference on Acoustics, Speech, and Sig- nal Processing 2 (2001) 749–752. doi:10.1109/ICASSP.2001. 941023. [29] C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, An algo- rithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Lan- guage Processing 19 (7) (2011) 2125–2136. doi:10.1109/TASL. 2011.2114881. CHAPTER 4. ROBUSTNESS AND SENSITIVITY METRICS-BASED TUNING OF THE AUGMENTED KALMAN FILTER FOR SINGLE-CHANNEL SPEECH 70 ENHANCEMENT Part IV Kalman Filtering and Machine Learning Methods for Speech Enhancement 71 Chapter 5 Deep Learning-Based Kalman Filter and Augmented Kalman Filter for Speech Enhancement 73 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal, "A Deep Learning-Based Kalman Filter for Speech Enhancement”, Proc. Interspeech 2020, 2692-2696, DOI: 10.21437/Interspeech.2020-1551. My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Aaron Nicolson aided with drafting the final manuscript. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ___________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) ______ _______ (Date) 02/04/2021 Aaron Nicolson (Countersigned) ___ __ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal INTERSPEECH 2020 October 25–29, 2020, Shanghai, China A Deep Learning-based Kalman Filter for Speech Enhancement Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal Signal Processing Laboratory, Griffith School of Engineering, Griffith University, Brisbane, QLD, Australia, 4111 {sujankumar.roy, aaron.nicolson}@griffithuni.edu.au,

[email protected]

Abstract an estimate of a time-frequency mask, which is used to compute the spectrum of clean speech [11, 12]. A comparative study on The existing Kalman filter (KF) suffers from poor estimates of six different masks has also been performed in [13] to identify the noise variance and the linear prediction coefficients (LPCs) an optimal mask for speech enhancement. However, the mask- in real-world noise conditions. This results in a degraded speech ing technique usually introduces residual and musical noise in enhancement performance. In this paper, a deep learning ap- the enhanced speech [13]. proach is used to more accurately estimate the noise variance In [14], a fully convolutional neural network (FCNN) based and LPCs, enabling the KF to enhance speech in various noise SEA was introduced. This method is particularly designed conditions. Specifically, a deep learning approach to MMSE- to enhance babble noise corrupted speech. In [15], a raw based noise power spectral density (PSD) estimation, called waveform-based SEA using FCNN was proposed. Since the DeepMMSE, is used. The estimated noise PSD is used to com- input/output of [15] is a raw waveform, the enhanced speech is pute the noise variance. We also construct a whitening filter not affected by the phase issues that are characteristic of mag- with its coefficients computed from the estimated noise PSD. nitude spectrum-based SEAs [11, 13, 14]. Zheng et al. intro- It is then applied to the noisy speech, yielding pre-whitened duced a phase-aware DNN for speech enhancement [16]. Here, speech for computing the LPCs. The improved noise variance the phase information (converted to the instantaneous frequency and LPC estimates enable the KF to minimise the residual noise deviation (IFD)) is jointly used with a time-frequency masks. and distortion in the enhanced speech. Experimental results The enhanced speech is reconstructed with the estimated mask show that the proposed method exhibits higher quality and intel- and the phase information extracted from the IFD. Yu et al. in- ligibility in the enhanced speech than the benchmark methods troduced a KF-based SEA, where the LPCs are estimated using in various noise conditions for a wide-range of SNR levels. a deep neural network [17]. However, the noise covariance is Index Terms: Speech enhancement, Kalman filter, Deep- estimated during speech pauses, which is not effective in non- MMSE, Deep Xi, noise PSD, LPC. stationary noise conditions. In addition, the silence detection process was unspecified. 1. Introduction In this paper, a deep learning technique is used to resolve the noise variance and the LPC estimation issues of the KF, The objective of a speech enhancement algorithm (SEA) is to leading to the capability of performing speech enhancement in eliminate the embedded noise from a noisy speech signal. It various noise conditions. Firstly, the noise PSD is estimated us- can be used as a front-end tool for many applications, such as ing DeepMMSE [18], which is then used to compute the noise voice communication systems, hearing-aid devices, and speech variance. We also construct a whitening filter with its coeffi- recognition. Various SEAs, namely spectral subtraction (SS) cients computed from the estimated noise PSD. The LPCs are [1, 2], MMSE [3, 4], Wiener Filter (WF) [5, 6], and Kalman then computed from the pre-whitened speech, which is obtained filter (KF) [7] have been introduced over the decades. by employing the whitening filter to the noisy speech signal. The SS method heavily depends on the accuracy of the With the improved noise variance and LPCs, the KF is found noise PSD estimate [8]. The MMSE and WF-based SEAs com- to be effective at minimising the residual noise as well as dis- pletely rely upon the accurate estimation of the a priori SNR tortion in the enhanced speech. The efficiency of the proposed [9]. In [3], a decision-directed (DD) approach was proposed to method is evaluated against benchmark methods using objective estimate the a priori SNR. Since this approach uses the speech and subjective testing. and noise power estimates from the previous frame, it is difficult to estimate the a priori SNR accurately for the current frame. The efficiency of KF-based SEA depends on how accu- 2. KF for speech enhancement rately the noise variance and the LPCs are estimated. In [7], At discrete-time sample n, the noisy speech, y(n), can be rep- the LPCs are computed from the clean speech, which is unavail- resented as: able in practice. It is also limited to enhancing speech corrupted y(n) = s(n) + v(n), (1) with additive white Gaussian noise (AWGN). A sub-band itera- tive KF for enhancing speech in different noise conditions was where s(n) and v(n) denote the clean speech, and uncorrelated proposed in [10]. The noisy speech is first decomposed into additive noise, respectively. The clean speech can be modeled 16 sub-bands (SBs). An iterative KF is then employed to en- using a pth order linear predictor, as in [19, Chapter 8]: hance the partially reconstructed high-frequency (HF) SBs. It p X is assumed that the low-frequency (LF) SBs are less affected by s(n) = − ai s(n − i) + w(n), (2) noise and are left unprocessed. The noise variance for the sub- i=1 band iterative KF is estimated using a derivative-based method. Nowadays, deep neural networks (DNNs) are widely used where {ai ; i = 1, 2, . . . , p} are the LPCs, and w(n) is assumed for speech enhancement [11]. DNN-based SEAs typically give to be white noise with zero mean and variance σw 2 . Copyright © 2020 ISCA 2692 http://dx.doi.org/10.21437/Interspeech.2020-1551 Eqs. (1)-(2) can be used to form the following state-space 3. Proposed speech enhancement system model (SSM) of the KF, as in [7]: Fig. 1 shows the block diagram of the proposed SEA. Unlike traditional KF method (section 2), in the proposed SEA, a 32 s(n) = Φs(n − 1) + dw(n), (3) ms rectangular window with 50% overlap was considered for ⊤ y(n) = c s(n) + v(n). (4) converting y(n) into frames, i.e., y(n, l) = s(n, l) + v(n, l), where lǫ{0, 1, 2, . . . , N − 1} is the frame index, N is the total The SSM is comprised of the following: number of frames, and M is the total number of samples within 1. s(n) is a p × 1 state vector at sample n, represented as: each frame, i.e., nǫ{0, 1, 2, . . . , M − 1}. The noisy speech, y(n) is also analyzed frame-wise using the short-time Fourier transform (STFT): s(n) = [s(n) s(n − 1) ... s(n − p + 1)]⊤ , (5) Y (l, m) = S(l, m) + V (l, m), (14) 2. Φ is a p×p state transition matrix that relates the process states at sample n and n − 1, represented as: where Y (l, m), S(l, m), and V (l, m) denote the complex-   valued STFT coefficients of the noisy speech, the clean speech, −a1 −a2 . . . −ap−1 −ap and the noise signal, respectively, for time-frame index l and  1 0 ... 0 0  discrete-frequency bin m.    0 1 . . . 0 0  It is assumed that S(l, m) and V (l, m) follow a Gaussian Φ= , (6)  .. .. .. .. ..  distribution with zero-mean and variances E{|S(l, m)|2 } =  . . . . .  λs (l, m), and E{|V (l, m)|2 } = λv (l, m), where E{·} rep- 0 0 ... 1 0 resents the statistical expectation operator. 3. d and c are the p × 1 measurement vectors for the exci- tation noise and observation, represented as: T d=c= 1 0 ... 0 , 4. y(n) represents the noisy observation at sample n. Firstly, y(n) is windowed into non-overlapped and short (e.g., 20 ms) frames. For a particular frame, the KF computes an unbiased linear MMSE estimate ŝ(n|n) at sample n, given y(n), by using the following recursive equations [7]: ŝ(n|n − 1) = Φŝ(n − 1|n − 1), (7) Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ⊤ + σw 2 dd⊤ , (8) ⊤ K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c + σv2 )−1 , (9) Figure 1: Block diagram of the proposed SEA. ⊤ ŝ(n|n) = ŝ(n|n − 1) + K(n)[y(n) − c ŝ(n|n − 1)], (10) Ψ(n|n) = [I − K(n)c⊤ ]Ψ(n|n − 1). (11) 3.1. Proposed σv2 and ({ai }, σw 2 ) estimation method For a noisy speech frame, the error covariances (Ψ(n|n−1) Firstly, the frame-wise noise PSD, λ̂v (l, m) is estimated using and Ψ(n|n) corresponding to ŝ(n|n − 1) and ŝ(n|n)) and the DeepMMSE [18]. DeepMMSE is described in the following Kalman gain K(n) are continually updated on a samplewise subsection.qAn estimate of noise, v̂(n, l) is given by taking the basis, while σv2 and ({ai }, σw 2 ) remain constant. At sample n, |IDFT| of λ̂v (l, m) exp[∠Y (l, m)]. The noise variance, σv2 ⊤ c ŝ(n|n) gives the estimated speech, ŝ(n|n), as in [20]: is then computed from v̂(n, l) frame-wise as: ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)y(n), (12) M −1 1 X 2 σv2 = v̂ (n, l). (15) where K0 (n) is the 1st component of K(n) given by [20]: M n=0 α2 (n) + σw 2 The LPC parameters, ({ai }, σw 2 ) are sensitive to noise. K0 (n) = , (13) 2 We compute ({ai }, σw ) (p = 10) frame-wise from pre- α2 (n) + σw2 + σ2 v whitened speech, yw (n, l), using the autocorrelation method where α2 (n) = c⊤ ΦΨ(n − 1|n − 1)Φ⊤ c is the transmission [19]. yw (n, l) is used to reduce bias in ({ai }, σw 2 ). Then, of a posteriori mean squared error from the previous sample, yw (n, l) is obtained by employing a whitening filter, Hw (z) n − 1, to the total a priori mean prediction squared error [20]. to y(n, l). Hw (z) is found, as in [19, section 8.1.7]: Eq. (12) implies that K0 (n) has a significant impact on q X ŝ(n|n), which is the output of the KF. In practice, poor esti- Hw (z) = 1 + bk z −k , (16) mates of σv2 and ({ai }, σw 2 ) introduce bias in K0 (n), which k=1 affects ŝ(n|n). In the proposed SEA, DeepMMSE is used to accurately estimate σv2 and ({ai }, σw 2 ), leading to a more accu- where the whitening filter coefficients, ({bk }; q = 40) are com- rate ŝ(n|n). puted from v̂(n, l) using the autocorrelation method [19]. 2693 3.2. DeepMMSE 𝝃෡ ഥ𝑙 ෡ 𝝃ഥ𝑙 DeepMMSE is an MMSE-based noise PSD estimator that em- O ploys the Deep Xi framework for a priori SNR estimation [18]. e=6, d=4 DeepMMSE does not exploit any underlying assumptions about + e=5, d=2 the speech or noise, and produces a noise PSD estimate with e=4, d=1 Conv1D negligible bias, unlike other MMSE-based noise PSD estima- (1, 𝑑𝑚𝑜𝑑𝑒𝑙 , 1) ⋯ e=3, d=4 tors [21, 22]. DeepMMSE includes the following four stages: 𝐸× ⋯ e=2, d=2 Conv1D ⋯ e=1, d=1 1. The a priori SNR estimate, ξ(l, m), of |Y (l, m)|, is first (𝑟, 𝑑𝑓 , 𝑑) found using Deep Xi-ResNet. Deep Xi-ResNet is de- Conv1D scribed in the following subsection. The a priori SNR is (1, 𝑑𝑓 , 1) ⋯ defined as ξ(l, m) = λλvs (l,m) (l,m) . FC 2. Next, the maximum-likelihood (ML) a posteriori SNR |𝐘𝑙−𝟏𝟐| |𝐘𝑙−𝟔| |𝐘𝑙−𝟏𝟏| |𝐘𝑙−𝟓| |𝐘𝑙−𝟏𝟎| |𝐘𝑙−𝟒| |𝐘𝑙−𝟗| |𝐘𝑙−𝟑| |𝐘𝑙−𝟏𝟒| |𝐘𝑙−𝟖| |𝐘𝑙−𝟐| |𝒀𝑙 | |𝐘𝑙−𝟏𝟑| |𝐘𝑙−𝟕| |𝐘𝑙−𝟏| |𝒀𝑙 | estimate is computed using the a priori SNR [23]: γ̂(l, m) = ξ(l, m) + 1. (a) (b) 3. Using ξˆ and γ̂, the noise periodogram estimate is found Figure 2: (a) Deep Xi-ResNet and (b) example of the contextual using the MMSE estimator [21, 22]: |V̂ (l, m)| = 2 field of Deep Xi-ResNet with D = 4, E = 6, and r = 3. ξ(l,m) 1 (1+ξ(l,m))2 + (1+ξ(l,m))γ(l,m) |Y (l, m)|2 . 4. The final noise PSD estimate, λ̂v (l, m), is found by ap- Each block contains three convolutional units (CUs) [25], where plying a first-order temporal recursive smoothing opera- each CU is pre-activated by layer normalisation [26] followed tion: λ̂v (l, m) = αλ̂d [l − 1, k] + (1 − α)|V̂ (l, m)|2 , by the ReLU activation function [27]. The 1st and 3rd CUs where α is the smoothing factor. In this work, α = 0 have a kernel size of r = 1 to that of r = 3 for the 2nd CU. was used, i.e. the instantaneous noise power spectrum The 2nd CU employs a dilation rate (DR) of d, providing a con- estimate from DeepMMSE was used. textual field over previous time steps. As in [18], d is cycled as the block index e increases: d = 2(e−1 mod (log2 (D)+1) , where 3.3. Deep Xi-ResNet for ξ(l, m) estimation mod is the modulo operation, and D is the maximum DR. An example of how the DR is cycled is shown in Fig. 2 (b), with Deep Xi-ResNet is used to estimate ξ(l, m) for DeepMMSE D = 4, and E = 6. It can be seen that the DR is reset after (available at https://github.com/anicolson/DeepXi). Deep Xi is block three. This also demonstrates the contextual field gained a deep learning approach to a priori SNR estimation [24]. Dur- by the use of causal dilated CUs. For Deep Xi-ResNet, D is set ing training, the clean speech and noise of the noisy speech are to 16. The 1st and 2nd CUs have an output size of df = 64 to available. This allows the instantaneous case of the a priori that of dmodel = 256 for the 3rd CU. FC is a fully-connected SNR to be used as the training target. To compute the instan- layer with an output size of dmodel , where layer normalisation taneous a priori SNR, λs (l, m) and λv (l, m) are replaced with is applied to the output of FC, followed by the ReLU activation the squared magnitude of the clean-speech and noise spectral function. The output layer O is a fully-connected layer with components, respectively. sigmoidal units. The observation and target of a training example for a deep neural network (DNN) in the Deep Xi framework is |Yl | and the mapped a priori SNR, ξ̄ξ l , respectively. The mapped a 4. Speech enhancement experiment priori SNR is a mapped version of the instantaneous a priori 4.1. Training set SNR. The instantaneous a priori SNR is mapped to the interval [0, 1] in order to improve the rate of convergence of the used For training the ResNet, a total of 74250 clean speech record- stochastic gradient descent algorithm. The cumulative distribu- ings belonging to the train-clean-100 set from the Librispeech tion function (CDF) of ξdB (l, m) = 10 log10 (ξ(l, m)) is used corpus [28] (28539), the CSTR VCTK corpus [29] (42015), and as the map. As shown in [24, Fig. 2 (top)], the distribution of the si∗ and sx∗ training sets from the TIMIT corpus [30] (3696) ξdB (l, m) for the kth frequency component follows a normal are used. 5% of the clean speech recordings are randomly se- distribution. It is thus assumed that ξdB (l, m) is distributed nor- lected and used as a validation set. Thus, 70537 clean speech mally with mean µk and variance σk2 : ξdB (l, m) ∼ N (µk , σk2 ). recordings are used in the training set and 3713 in the valida- Thus, the mapped a priori SNR is found by applying the normal tion set. The 2382 noise recordings adopted in [24] are used as CDF to ξdB (l, m): the noise training set. All clean speech and noise recordings are " !# single-channel, with a sampling frequency of 16 kHz. ¯ 1 ξdB (l, m) − µk ξ(l, m) = 1 + erf √ , (17) 2 σk 2 4.2. Training strategy The ResNet is trained using cross-entropy as the loss function where µk and σk2 found in [24] are used in this work. During ˆ m), is found from and the Adam algorithm [31] with default hyper-parameters. inference, the a priori SNR estimate, ξ(l, √ The gradients are also clipped between [−1, 1]. The selection ˆ ¯ m) using ξ(l,ˆ m) = 10 σ 2erf−1 (2 ξ̄ˆ(l,m)−1)+µk /10 ξ(l, k . order for the clean speech recordings is randomised for each Deep Xi-ResNet utilises a residual network consisting of 1- epoch. 175 epochs are used to train the ResNet, where a mini- D causal dilated convolutional units withing the Deep Xi frame- batch size of 10 noisy speech signals is used. The noisy signals work [18], as shown in Fig. 2 (a). It consists of E = 40 bottle- are created as follows: each clean speech recording selected for neck residual blocks, where eǫ{1, 2, . . . , E} is the block index. the mini-batch is mixed with a random section of a randomly 2694 a Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz) selected noise recording at a randomly selected SNR level (-10 8 6 4 to 20 dB, in 1 dB increments). 2 0 8 b 6 4.3. Test set 4 2 0 c 8 For objective experiments, 30 utterances belonging to six 6 4 speakers are taken from the NOIZEUS corpus and are sampled 2 0 d at 16 kHz [9, Chapter 12]. We generate a noisy data set that 8 6 has been corrupted by the passing car and café babble noise 4 2 0 recordings that were adopted in [24] at SNR levels from -5dB 8 6 e to 15dB, in 5 dB increments. Note that these clean speech and 4 2 noise recordings are not used during training. 0 8 f 6 4 4.4. Evaluation metrics 2 0 g 8 6 The objective quality and intelligibility evaluation was carried 4 2 out using the perceptual evaluation of speech quality (PESQ) 0 0.5 1 1.5 2 [32] and quasi-stationary speech transmission index (QSTI) Time (s) [33] measures. We also analyse the enhanced speech spectro- Figure 4: (a) Clean speech, (b) noisy speech (sp05 is corrupted grams of the SEAs. The subjective evaluation was carried out with 5 dB passing car noise), and the enhanced speech spec- through blind AB listening tests [34, Section 3.3.4]. Five En- trograms produced by the: (c) RWF-FCN, (d) DNN-KF, (e) glish speaking listeners participated in the tests, where the ut- IAM+IFD, (f) proposed, and (g) KF-Oracle methods. terance sp05 (“Wipe the grease off his dirty face”) was corrupted with 5 dB passing car noise and used as the stimulus. The proposed method is compared with benchmark meth- 100 ods, such as raw waveform processing using FCNN (RWF- Mean preference (%) 80 FCN) method [15], phase-aware DNN (IAM+IFD) method 60 [16], deep learning KF (DNN-KF) method [17], KF-Oracle 40 method (where ({ai }, σw2 ) and σv2 are computed from the clean 20 speech and noise signal) and Noisy (noise corrupted speech). 0 e an sy F FD N ed cl -K FC 5. Results and discussion oi le os ra +I N N C F- -O op M N W IA D KF Pr R Speech Enhancement Methods Fig. 3 (a)-(b) demonstrates that the proposed method consis- tently shows improved PESQ scores over the benchmark meth- ods, except the KF-Oracle method for all noise conditions and Figure 5: The mean preference score (%) for each SEA on sp05 SNR levels. The IAM+IFD method [16] attained the high- corrupted with 5 dB passing car noise. est PESQ scores amongst the benchmark methods. The Noisy speech shows the worse PESQ score for all conditions. KF-Oracle Proposed IAM+IFD DNN-KF RWF-FCN Noisy listening tests also confirm that the benchmark methods produce enhanced speech with significantly more disturbances than the a 3 b 3 proposes method. PESQ 2 PESQ 2 1 1 Fig. 5 shows that the enhanced speech produced by the pro- 0 0 c 1 d 1 posed method is widely preferred by the listeners (76.33%) than 0.8 0.8 the benchmark methods, apart from the KF-Oracle (83.22%) QSTI QSTI 0.6 0.6 0.4 0.4 and clean speech. The IAM+IFD method [16] is found to be 0.2 0.2 0 0 the best preferred (66.67%) amongst the benchmark methods. -5 0 5 10 15 -5 0 5 10 15 Input SNR(dB) Input SNR(dB) 6. Conclusions Figure 3: Performance of each SEA in terms of: (a) PESQ for passing car, (b) PESQ for café babble, (c) QSTI for passing This paper introduced a deep learning and Kalman filter-based car, and (d) QSTI for café babble. speech enhancement algorithm. Specifically, DeepMMSE is used to estimate the noise PSD for computing the noise vari- Fig. 3 (c)-(d) also shows that the proposed method demon- ance. A whitening filter is also constructed using coefficients strates a consistent QSTI improvement across the noise exper- estimated from the noise PSD. It is employed to the noisy iments as well as the SNR levels, apart from the KF-Oracle speech signal, yielding a pre-whitened speech. The LPCs are method. The existing IAM+IFD method [16] is found to be computed from the pre-whitened signal. The large training set competitive with the proposed method in terms of QSTI, typi- of DeepMMSE yields more accurate estimates of the noise vari- cally at low SNR levels. However, the QSTI for each method at ance and the LPCs in various noise conditions. As a result, high SNR levels is competitive. the KF constructed with the improved parameters minimises the It can be seen that the enhanced speech produced by the residual noise as well as the distortion in the resultant enhanced proposed method (Fig. 4 (f)) exhibits significantly less residual speech. Extensive objective and subjective testing implies that noise than that of the benchmark methods (Fig. 4 (c)-(e)) and the proposed method outperforms the benchmark methods in is similar to the KF-Oracle method (Fig. 4 (g)). The informal various noise conditions for a wide range of SNR levels. 2695 7. References [19] S. V. Vaseghi, “Linear prediction models,” in Advanced Digital Signal Processing and Noise Reduction. John Wiley & Sons, [1] S. Boll, “Suppression of acoustic noise in speech using spectral 2009, ch. 8, pp. 227–262. subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 113–120, April 1979. [20] S. So, A. E. W. George, R. Ghosh, and K. K. Paliwal, “Kalman filter with sensitivity tuning for improved noise reduction in [2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech,” Circuits, Systems, and Signal Processing, vol. 36, no. 4, speech corrupted by acoustic noise,” IEEE International Confer- pp. 1476–1492, April 2017. ence on Acoustics, Speech, and Signal Processing, vol. 4, pp. 208– 211, April 1979. [21] R. C. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise [3] Y. Ephraim and D. Malah, “Speech enhancement using a PSD tracking with low complexity,” IEEE International Confer- minimum-mean square error short-time spectral amplitude esti- ence on Acoustics, Speech and Signal Processing, pp. 4266–4269, mator,” IEEE Transactions on Acoustics, Speech, and Signal Pro- March 2010. cessing, vol. 32, no. 6, pp. 1109–1121, December 1984. [22] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise [4] Y. Ephraim and D. Malah, “Speech enhancement using a mini- power estimation with low complexity and low tracking delay,” mum mean-square error log-spectral amplitude estimator,” IEEE IEEE Transactions on Audio, Speech, and Language Processing, Transactions on Acoustics, Speech, and Signal Processing, vol. 20, no. 4, pp. 1383–1393, December 2012. vol. 33, no. 2, pp. 443–445, April 1985. [23] P. Scalart and J. V. Filho, “Speech enhancement based on a pri- [5] P. Scalart and J. V. Filho, “Speech enhancement based on a pri- ori signal to noise estimation,” in 1996 IEEE International Con- ori signal to noise estimation,” IEEE International Conference on ference on Acoustics, Speech, and Signal Processing Conference Acoustics, Speech, and Signal Processing, vol. 2, pp. 629–632, Proceedings, vol. 2, 1996, pp. 629–632 vol. 2. May 1996. [24] A. Nicolson and K. K. Paliwal, “Deep learning for minimum [6] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step mean-square error approaches to speech enhancement,” Speech noise reduction technique,” IEEE International Conference on Communication, vol. 111, pp. 44–55, August 2019. Acoustics, Speech, and Signal Processing, vol. 1, pp. 289–292, [25] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of May 2004. generic convolutional and recurrent networks for sequence mod- [7] K. Paliwal and A. Basu, “A speech enhancement method based on eling,” ArXiv, vol. abs/1803.01271, 2018. [Online]. Available: Kalman filtering,” IEEE International Conference on Acoustics, http://arxiv.org/abs/1803.01271 Speech, and Signal Processing, vol. 12, pp. 177–180, April 1987. [26] J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv, [8] N. Upadhyay and A. Karmakar, “Speech enhancement using spec- vol. abs/1607.06450, 2016. tral subtraction-type algorithms: A comparison and simulation study,” Procedia Computer Science, vol. 54, pp. 574 – 584, 2015. [27] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” 27th International Conference on Machine [9] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Learning, pp. 807–814, June 2010. Boca Raton, FL, USA: CRC Press, Inc., 2013. [28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- [10] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech rispeech: An ASR corpus based on public domain audio books,” enhancement using subband iterative Kalman filter,” IEEE Inter- IEEE International Conference on Acoustics, Speech and Signal national Symposium on Circuits and Systems, pp. 762–765, May Processing, pp. 5206–5210, April 2015. 2016. [11] Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on [29] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK speech enhancement based on deep neural networks,” IEEE Sig- corpus: English multi-speaker corpus for CSTR voice cloning nal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014. toolkit,” University of Edinburgh. The Centre for Speech Tech- nology Research (CSTR), 2017. [12] A. Nicolson and K. K. Paliwal, “Bidirectional long-short term memory network-based estimation of reliable spectral component [30] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. locations,” in Proc. Interspeech 2018, 2018, pp. 1606–1610. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018- corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon 1134 Technical Report N, vol. 93, Feb. 1993. [13] Y. Wang, A. Narayanan, and D. Wang, “On training targets for [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic supervised speech separation,” IEEE/ACM Transactions on Au- optimization,” ArXiv, vol. abs/1412.6980, 2014. [Online]. dio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849– Available: http://arxiv.org/abs/1412.6980 1858, 2014. [32] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- [14] S. R. Park and J. Lee, “A fully convolutional neural network for ceptual evaluation of speech quality (PESQ)-a new method for speech enhancement,” Proceedings of Interspeech, p. 1993–1997, speech quality assessment of telephone networks and codecs,” 2017. IEEE International Conference on Acoustics, Speech, and Signal [15] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End- Processing, vol. 2, pp. 749–752, May 2001. to-end waveform utterance enhancement for direct evaluation [33] B. Schwerin and K. K. Paliwal, “An improved speech transmis- metrics optimization by fully convolutional neural networks,” sion index for intelligibility prediction,” Speech Communication, IEEE/ACM Transactions on Audio, Speech, and Language Pro- vol. 65, pp. 9–19, December 2014. cessing, vol. 26, no. 9, pp. 1570–1584, 2018. [34] K. K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel [16] N. Zheng and X. Zhang, “Phase-aware speech enhancement based speech enhancement using spectral subtraction in the short-time on deep neural networks,” IEEE/ACM Transactions on Audio, modulation domain,” Speech Communication, vol. 52, no. 5, pp. Speech, and Language Processing, vol. 27, no. 1, pp. 63–76, 450–475, May 2010. 2019. [17] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neu- ral network based Kalman filter for time domain speech enhance- ment,” IEEE International Symposium on Circuits and Systems, pp. 1–5, May 2019. [18] Q. Zhang, A. M. Nicolson, M. Wang, K. Paliwal, and C. Wang, “Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020. 2696 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal, "Deep Learning with Augmented Kalman Filter for Single-Channel Speech Enhancement," 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 2020, pp. 1-5, doi: 10.1109/ISCAS45731.2020.9180820. My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Aaron Nicolson aided with drafting the final manuscript. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ _____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) _______ _________ (Date) 02/04/2021 Aaron Nicolson (Countersigned) ____ ____ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal Deep Learning with Augmented Kalman Filter for Single-Channel Speech Enhancement Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal School of Engineering, Griffith University, Brisbane, QLD, 4111, Australia Emails: {sujankumar.roy, aaron.nicolson}@griffithuni.edu.au,

[email protected]

Abstract—The existing augmented Kalman filter (AKF) suffers resolved by sensitivity tuning of the KF gain. Both [11], [12] from poor LPC estimates in real-world noise conditions, which operate in stationary noise conditions. George et al. introduced degrades the speech enhancement performance. In this paper, a robustness metric-based tuning of the AKF for colored noise deep learning technique exploits the LPC estimates for the AKF to enhance speech in various noise conditions. Specifically, a deep suppression [13]. The robustness metric still gives distorted residual network is used to estimate the noise PSD for computing speech. Yu et al. introduced a KF-based SEA, where the LPCs noise LPCs. A whitening filter is also implemented with the noise are estimated using a deep neural network [14]. However, the LPCs to pre-whiten the noisy speech signal prior to estimating noise covariance estimated during speech pauses makes the the speech LPCs. It is shown that the improved speech and noise KF ineffective at dealing with non-stationary noise conditions. LPCs enable the AKF to minimize the residual noise as well as distortion in the enhanced speech. Experimental results show that The silence detection process was also unspecified. the enhanced speech produced by the proposed method exhibits In this paper, a deep learning technique is used to resolve the higher quality and intelligibility than the benchmark methods in LPC estimation issues of the AKF, leading to the capability of various noise conditions for a wide-range of SNR levels. performing speech enhancement in various noise conditions. Index Terms—Speech enhancement, augmented Kalman filter, Firstly, the noise PSD is estimated using a deep residual net- Deep Xi, noise PSD, LPC. work (ResNet) [15], from where the noise LPCs are computed. The noise LPCs are then used to implement a whitening filter I. I NTRODUCTION to pre-whiten the noisy speech signal prior to computing the The aim of a speech enhancement algorithm (SEA) is to speech LPCs. With the improved speech and noise LPCs, the eliminate embedded noises from a noisy speech signal. SEAs AKF is found to be effective at minimizing the residual noise are used in many applications, such as voice communication as well as distortion in the enhanced speech. The efficiency systems, hearing aid devices, and speech recognition. Various of the proposed method is evaluated against the benchmark SEAs, namely spectral subtraction (SS) [1], [2], MMSE [3], methods using objective and subjective testing. [4], Wiener Filter (WF) [5], [6], and Kalman filter (KF) [7] II. AKF FOR S PEECH E NHANCEMENT have been introduced in the literature. The SS method heavily depends on the accuracy of the Assuming that the colored noise v(n) is additive and un- noise estimate [8]. The MMSE and WF-based SEAs rely upon correlated with speech s(n), the noisy speech y(n) at sample the accurate estimation of the a priori SNR [9]. In [3], the nǫ{0, 1, 2, . . . , M − 1} can be represented as: decision-directed (DD) approach was proposed to estimate y(n) = s(n) + v(n). (1) the a priori SNR. However, the use of speech and noise power estimates from the previous frame makes it inefficient Both s(n) and v(n) can be modeled using pth and q th order at computing the a priori SNR for the current frame. linear predictors, as in [16]: In KF-based SEA [7], Paliwal and Basu computed the LPCs p X from clean speech for enhancing white noise corrupted speech. s(n) = − ai s(n − i) + w(n), (2) Gibson et al. introduced an augmented KF (AKF) to iteratively i=1 q suppress the colored noise [10]. The LPCs for the AKF of the X current iteration are estimated from the filtered signal of the v(n) = − bk v(n − k) + u(n), (3) previous iteration. The enhanced speech (after 3-4 iterations) k=1 suffers from musical noise and distortion. Roy et al. proposed where {ai ; i = 1, 2, . . . , p} and {bk ; k = 1, 2, . . . , q} are the a sub-band iterative KF-based SEA. Due to enhancing the LPCs, and w(n) and u(n) are assumed to be white noise with high-frequency sub-bands (SBs) only, the low-frequency SBs zero mean and variance σw 2 and σu2 , respectively. may still get affected by noise. Eqs. (1)-(3) can be used to form the following augmented In [11], a robustness metric-based tuning offsets the bias of state-space model (ASSM) of the AKF, as [13]: the KF gain caused by poor LPC estimates. In [12], it was x(n) = Φx(n − 1) + dz(n), (4) shown that the robustness metric gives an under-estimated ⊤ Kalman gain, resulting in distorted speech, which can be y(n) = c x(n), (5) 978-1-7281-3320-1/20/$31.00 ©2020 IEEE where x(n) = [s(n) . . . s(n − p+ 1) v(n). . . v(n − q + 1)]⊤ is Φs 0 a (p+q)×1 state-vector, Φ = is a (p+q)×(p+q) 0 Φv state-transition matrix constructed with the {ai } and {bj }, d = ds 0 ⊤ ⊤ , ds = 1 0 . . . 0 , dv = 1 0 . . . 0 , 0 dv w(n) ⊤ z(n) = , and c = 1 0 . . . 0 1 0 . . . 0 u(n) is a (p + q) × 1 vector [13]. Firstly, y(n) is windowed into non-overlapped, short (e.g., 20 ms) frames. For a particular frame, the AKF computes an unbiased and linear MMSE estimate x̂(n|n) at sample n, given y(n) by using the following recursive equations [10]: x̂(n|n − 1) = Φx̂(n − 1|n − 1), (6) ⊤ ⊤ Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + dQd , (7) Fig. 1. Block diagram of the proposed deep learning AKF-based SEA. ⊤ −1 K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (8) x̂(n|n) = x̂(n|n − 1) + K(n)[y(n) − c⊤ x̂(n|n − 1)], (9) where m is the discrete-frequency index. Ψ(n|n) = [I − K(n)c⊤ ]Ψ(n|n − 1), (10) It is assumed that S(l, m) and V (l, m) follow a Gaussian 2 distribution with zero-mean and variances E{|S(l, m)|2 } = σ 0 where Q = w is the process noise covariance. λs (l, m), and E{|V (l, m)|2 } = λv (l, m), where E{·} repre- 0 σu2 For a noisy speech frame, the error covariances Ψ(n|n − 1) sents the statistical expectation operator. and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n), and the A. Proposed ({bk }, σu2 ) and ({ai }, σw 2 ) Estimation Method Kalman gain K(n) are continually updated on a samplewise The ({bk }, σu ) estimates from the initial speech pauses used 2 basis, while ({ai }, σw2 ) and ({bk }, σu2 ) remain constant. At by the existing AKF [13] makes it limited to suppressing only sample n, g x̂(n|n) gives the estimated speech, ŝ(n|n), ⊤ colored noises. In the proposed SEA, the noise PSD estimate, ⊤ where g = 1 0 0 . . . 0 is a (p + q) × 1 column bv (l, m), is used to compute ({bk }, σ 2 ). Specifically, the noise λ u vector. As in [13], ŝ(n|n) is given by: power estimate, |Vb (l, m)|2 is obtained through a simplified ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)[y(n)− version1 of the MMSE method as described in [17], [18]: ! v̂(n|n − 1)], (11) 1 |Vb (l, m)|2 = |Y (l, m)|2 , (14) where K0 (n) is the 1st component of K(n) given by [13]: 1 + ξ(l, m) α2 (n) + σw2 λs (l, m) K0 (n) = , (12) ξ(l, m) = , (15) α2 (n) + σw + β (n) + σu2 2 2 λv (l, m) where α2 (n) and β 2 (n) are the transmission of a posteriori where ξ(l, m) is the a priori SNR. error variances (of the speech and noise) by the augmented In practice, the existing decision-directed approach [17], [18] gives a biased estimate of ξ(l,ˆ m), which affects the dynamic model from the previous sample, n − 1 [13]. b |V (l, m)| estimate. To resolve this, we employ a ResNet 2 Eq. (11) implies that K0 (n) has a significant impact on the ŝ(n|n) estimates, which is the output of the AKF. In practice, [15] within the Deep Xi framework (Deep Xi-ResNet) [19] to ˆ m), as described in section III-B. The smoothed estimate ξ(l, poor estimates of ({ai }, σw 2 ) and ({bk }, σu2 ) introduce bias in noise PSD estimate, λbv (l, m) is obtained as: K0 (n), which affects the ŝ(n|n) estimates. In the proposed SEA, a deep learning technique is used to estimate the LPCs bv (l, m) = η λ λ bv (l − 1, m) + (1 − η)|Vb (l, m)|2 . (16) for the AKF, leading to an improved ŝ(n|n) estimate. where η is a smoothing constant and set to 0.9. III. P ROPOSED S PEECH E NHANCEMENT S YSTEM The |IDFT| of λ bv (l, m) yields an estimate of the noise Fig. 1 shows the block diagram of the proposed SEA. autocorrelation, R bvv (τ ), where τ is the autocorrelation lag. Firstly, a 32 ms rectangular window with 50% overlap was By solving R bvv (τ ) using the Levinson-Durbin recursion [16], considered for converting y(n) into frames, i.e., y(n, l) = the ({bk }, σu2 ) (q = 40) estimates are obtained. Then {bk }’s s(n, l)+v(n, l), where lǫ{0, 1, 2, . . . , N −1} is the frame index are used to design the whitening filter, Hw (z) as [16]: and N is the total number of frames. The DFT coefficients q X Y (l, m), S(l, m), and V (l, m) are found using the square- Hw (z) = 1 + bk z −k . (17) root-Hann window and correspond to y(n), s(n) and v(n). k=1 These can also be represented as: 1 The simplification is a result of setting the a posteriori SNR to γ̂(l, m) = Y (l, m) = S(l, m) + V (l, m), (13) ξ̂(l, m) + 1, which is the maximum-likelihood estimate. Employing Hw (z) to y(n, l) gives the whitened speech, algorithm. The cumulative distribution function of ξdB (l, m) yw (n, l) for computing the ({ai }, σw 2 ) (p = 10) [16]. was used as the map. It can be seen from [19, Fig. 2 (top)] that the distribution of ξdB for a given frequency component, ˆ m) Estimation B. Deep Xi-ResNet for ξ(l, m follows a normal distribution. Thus, it was assumed that Deep Xi-ResNet is used to estimate ξ(l, ˆ m) (model ξdB (l, m) is distributed normally with mean µm and variance 3e from https://github.com/anicolson/DeepXi). Specifically, it σm2 : ξdB (l, m) ∼ N (µm , σm 2 ). The mapped a priori SNR takes |Y l | (which contains all frequency components for lth ¯ m) is given by: ξ(l, frame) as its input and gives an estimate of the mapped a " !# 1 ξ (l, m) − µ priori SNR, ξ̄ξˆl , as described in section III-C. Deep Xi-ResNet ¯ m) = dB m ξ(l, 1 + erf √ . (18) 2 σm 2 𝝃෡ ഥ𝑙 ෡ 𝝃ഥ Following [19], the statistics of ξdB (l, m) for each noisy 𝑙 speech spectral component are found over a sample of 1, 000 O e=6, d=4 noisy speech files from the training set. During inference, e=5, d=2 ˆ m) is found from ξˆdB (l, m) as follows: ξ(l, + e=4, d=1 Conv1D ˆ m) = 10(ξ̂dB (l,m)/10) , ξ(l, (19) (1, 𝑑𝑚𝑜𝑑𝑒𝑙 , 1) ⋯ e=3, d=4 𝐸× ⋯ e=2, d=2 ˆ where the ξˆdB (l, m) is computed from ξ(l, ¯ m) as follows: Conv1D (𝑟, 𝑑𝑓 , 𝑑) ⋯ e=1, d=1 √ ˆ ¯ m) − 1 + µm . ξˆdB (l, m) = σm 2erf−1 2ξ(l, (20) Conv1D (1, 𝑑𝑓 , 1) ⋯ IV. S PEECH E NHANCEMENT E XPERIMENT FC A. Training Set For training Deep Xi-ResNet, a total of 74, 250 clean |𝐘𝑙−𝟏𝟐| |𝐘𝑙−𝟔| |𝐘𝑙−𝟏𝟏| |𝐘𝑙−𝟓| |𝐘𝑙−𝟏𝟎| |𝐘𝑙−𝟒| |𝐘𝑙−𝟗| |𝐘𝑙−𝟑| |𝐘𝑙−𝟏𝟒| |𝐘𝑙−𝟖| |𝐘𝑙−𝟐| |𝒀𝑙 | |𝐘𝑙−𝟏𝟑| |𝐘𝑙−𝟕| |𝐘𝑙−𝟏| |𝒀𝑙 | (a) (b) speech recordings belonging to the train-clean-100 set from the Librispeech corpus [25] (28, 539), the CSTR VCTK corpus Fig. 2. (a) Deep Xi-ResNet and (b) example of the contextual field of Deep [26] (42, 015), and the si∗ and sx∗ training sets from the Xi-ResNet with D = 4, E = 6, and r = 3. TIMIT corpus [27] (3, 696) are used. 5% of the clean speech recordings are randomly selected and used as a validation is shown in Fig. 2 (a). It consists of E = 40 bottleneck residual set. Thus, 70, 537 clean speech recordings are used in the blocks, where eǫ{1, 2, . . . , E} is the block index. Each block training set and 3, 713 in the validation set. The 2, 382 noise contains three one-dimensional causal dilated convolutional recordings adopted in [19] are used as the noise training set. units (CDCUs) [20], where each convolutional unit (CU) is All clean speech and noise recordings are single-channel, with pre-activated by layer normalisation [21] followed by the a sampling frequency of 16 kHz. ReLU activation function [22]. The 1st and 3rd CUs have a kernel size of r = 1 to that of r = 3 for the 2nd CU. The 2nd B. Training Strategy CU employs a dilation rate (DR) of d, providing a contextual The following strategy was employed to train the ResNet: field over previous time steps. As in [23], d is cycled as the • Cross-entropy as the loss function. block index e increases: d = 2(e−1 mod (log2 (D)+1) , where • The Adam algorithm [28] with default hyper-parameters mod is the modulo operation, and D is the maximum DR. is used for gradient descent optimisation. An example of how the DR is cycled is shown in Fig. 2 (b), • Gradients are clipped between [−1, 1]. with D = 4, and E = 6. It can be seen that the DR is reset • The selection order for the clean speech recordings is after block three. This also demonstrates the contextual field randomised for each epoch. gained by the use of CDCUs. For Deep Xi-ResNet, D is set • 175 epochs are used to train the ResNet. to 16. The 1st and 2nd CUs have an output size of df = 64 • A mini-batch size of 10 noisy speech signals. to that of dmodel = 256 for the 3rd CU [24]. FC is a fully- • The noisy signals are created as follows: each clean connected layer with an output size of dmodel , where layer speech recording selected for the mini-batch is mixed normalisation is applied to the output of FC, followed by with a random section of a randomly selected noise the ReLU activation function. The output layer O is a fully- recording at a randomly selected SNR level (-10 to 20 connected layer with sigmoidal units. dB, in 1 dB increments). C. Mapped a priori SNR Training Target C. Test Set The training target for the ResNet is a mapped version of For objective experiments, 30 utterances belonging to six the instantaneous a priori SNR. For the instantaneous case, speakers are taken from the NOIZEUS corpus and are sampled |S(l, m)| and |V (l, m)| in Eq. (15) are known to compute at 16 kHz [9, Chapter 12]. We generate a noisy data set that λs (l, m) and λv (l, m). In [19], ξdB (l, m) = 10 log10 [ξ(l, m)] has been corrupted by non-stationary (babble) and colored was mapped to the interval [0, 1] in order to improve the (factory2) noises [29] at SNR levels from -5dB to 15dB, in 5 rate of convergence of the used stochastic gradient descent dB increments. D. Evaluation Metrics 8 6 The objective quality and intelligibility evaluation is carried 4 2 out using the perceptual evaluation of speech quality (PESQ) 0 [30] and quasi-stationary speech transmission index (QSTI) 8 6 [31] measures. We also analyze the enhanced speech spec- 4 2 trograms of the SEAs. The subjective evaluation was carried 0 out through blind AB listening tests [32, Section 3.3.4]. Five 8 6 English speaking listeners participated in the tests, where the 4 utterance sp05 (“Wipe the grease off his dirty face”) was 2 0 corrupted with 5 dB babble noise and used as the stimulus. 8 The proposed method is compared with benchmark meth- 6 4 ods, such as MMSE-STSA [3], AKF-IT [10], robustness- 2 0 metrics-based tuning of AKF (AKF-RMBT) [13], AKF-Oracle 8 (where ({ai }, σw 2 ) and ({bk }, σu2 ) are computed from the clean 6 4 speech and noise signal) and Noisy (noise corrupted speech). 2 0 V. R ESULTS AND D ISCUSSION 8 6 Fig. 3 (a)-(b) demonstrates that the proposed method con- 4 2 sistently shows improved PESQ scores over the benchmark 0 methods, except for AKF-Oracle. The AKF-RMBT method 8 6 [13] exhibits competitive PESQ scores with the proposed 4 method for babble noise (Fig. 3 (a)), however, for factory2 2 0 noise, its efficiency is reduced and is only competitive with 0.5 1 1.5 2 the other benchmark methods (Fig. 3 (b)). AKF-Oracle Proposed AKF-RMBT AKF-IT MMSE-STSA Noisy Fig. 4. (a) Clean speech, (b) noisy speech (sp05 is corrupted with 5 dB babble a noise), the enhanced speech spectrograms produced by the: (c) MMSE-STSA, 3 b 3 (d) AKF-IT, (e) AKF-RMBT, (f) proposed, and (g) AKF-Oracle methods. 2 2 1 1 100 0 0 Mean preference (%) c 1 d 1 80 0.8 0.8 0.6 0.6 60 0.4 0.4 0.2 0.2 40 0 0 -5 0 5 10 15 -5 0 5 10 15 20 0 A IT BT an sy Fig. 3. Performance comparison of the SEAs in terms of average: PESQ; (a) e ed TS cl F- oi M le os ra AK -S N C R babble, (b) factory2 and QSTI; (c) babble, (d) factory2 noise conditions. -O op F- SE F Pr AK AK M M Speech Enhancement Methods Fig. 3 (c)-(d) shows that the proposed method demonstrates a consistent QSTI improvement across the noise experiments, Fig. 5. The mean preference score (%) for each SEA on sp05 corrupted with apart from AKF-Oracle. The existing AKF-RMBT method 5 dB babble noise. [13] is also competitive with the proposed method. The QSTI scores of MMSE-STSA [3] and Noisy methods are VI. C ONCLUSIONS significantly lower than the AKF-IT [10] at low SNR levels. It can be seen that the enhanced speech produced by the This paper introduced a deep learning and augmented proposed method (Fig. 4 (f)) exhibits significantly less residual Kalman filter-based single channel speech enhancement algo- noise than that of the benchmark methods (Fig. 4 (c)-(e)) and is rithm. Specifically, Deep Xi-ResNet is used to estimate the similar to that of the AKF-Oracle (Fig. 4 (g)). Some distortion noise PSD for computing the noise LPCs. A whitening filter is and noise-flooring is found for the AKF-RMBT method [13] then constructed with the noise LPCs to pre-whiten the noisy (Fig. 4 (e)). The enhanced speech of the MMSE-STSA method speech signal prior to the speech LPC estimates. The large [3] contains significant residual noise (Fig. 4 (c)). training set of Deep Xi-ResNet enables the LPC estimates to Fig. 5 shows that the enhanced speech produced by the be effective in various noise conditions. As a result, the im- proposed method is widely preferred by the listeners (78%) proved speech and noise LPCs enable the AKF to minimize the than the benchmark methods, apart from the AKF-Oracle residual noise as well as distortion in the resultant enhanced (81.75%) and clean speech. The AKF-RMBT method [13] is speech. Extensive objective and subjective testing imply that found to be the best preferred (54%) amongst the benchmark the proposed method outperforms the benchmark methods in methods. various noise conditions for a wide range of SNR levels. R EFERENCES [23] Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time-frequency masking for speech separation,” ArXiv, vol. abs/1809.07454, 2018. [1] S. Boll, “Suppression of acoustic noise in speech using spectral [24] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. V. D. Oord, A. Graves, subtraction,” IEEE Transactions on Acoustics, Speech, and Signal and K. Kavukcuoglu, “Neural machine translation in linear time,” ArXiv, Processing, vol. 27, pp. 113–120, April 1979. vol. abs/1610.10099, 2016. [2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech cor- [25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An rupted by acoustic noise,” IEEE International Conference on Acoustics, ASR corpus based on public domain audio books,” IEEE International Speech, and Signal Processing, vol. 4, pp. 208–211, April 1979. Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210, [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum- April 2015. mean square error short-time spectral amplitude estimator,” IEEE [26] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. English multi-speaker corpus for CSTR voice cloning toolkit,” Univer- 6, pp. 1109–1121, December 1984. sity of Edinburgh. The Centre for Speech Technology Research (CSTR), [4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum 2017. mean-square error log-spectral amplitude estimator,” IEEE Transactions [27] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443– “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. 445, April 1985. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, [5] P. Scalart and J. V. Filho, “Speech enhancement based on a priori Feb. 1993. signal to noise estimation,” IEEE International Conference on Acoustics, [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Speech, and Signal Processing, vol. 2, pp. 629–632, May 1996. ArXiv, vol. abs/1412.6980, 2014. [29] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech [6] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step noise recognition: II. NOISEX-92: A database and an experiment to study reduction technique,” IEEE International Conference on Acoustics, the effect of additive noise on speech recognition systems,” Speech Speech, and Signal Processing, vol. 1, pp. 289–292, May 2004. Communication, vol. 12, no. 3, pp. 247–251, July 1993. [7] K. Paliwal and A. Basu, “A speech enhancement method based on [30] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual Kalman filtering,” IEEE International Conference on Acoustics, Speech, evaluation of speech quality (PESQ)-a new method for speech quality and Signal Processing, vol. 12, pp. 177–180, April 1987. assessment of telephone networks and codecs,” IEEE International [8] N. Upadhyay and A. Karmakar, “Speech enhancement using spectral Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. subtraction-type algorithms: A comparison and simulation study,” Pro- 749–752, May 2001. cedia Computer Science, vol. 54, pp. 574 – 584, 2015. [31] B. Schwerin and K. K. Paliwal, “An improved speech transmission index [9] P. C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, for intelligibility prediction,” Speech Communication, vol. 65, pp. 9–19, Inc., Boca Raton, FL, USA, 2nd edition, 2013. December 2014. [10] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise [32] K. K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech for speech enhancement and coding,” IEEE Transactions on Signal enhancement using spectral subtraction in the short-time modulation Processing, vol. 39, no. 8, pp. 1732–1742, August 1991. domain,” Speech Communication, vol. 52, no. 5, pp. 450–475, May [11] S. So, A. E. W. George, R. Ghosh, and K. K. Paliwal, “A non- 2010. iterative Kalman filtering algorithm with dynamic gain adjustment for single-channel speech enhancement,” International Journal of Signal Processing Systems, vol. 4, pp. 263–268, August 2016. [12] S. So, A. E. W. George, R. Ghosh, and K. K. Paliwal, “Kalman filter with sensitivity tuning for improved noise reduction in speech,” Circuits, Systems, and Signal Processing, vol. 36, no. 4, pp. 1476–1492, April 2017. [13] A. E. W. George, S. So, R. Ghosh, and K. K. Paliwal, “Robustness metric-based tuning of the augmented Kalman filter for the enhancement of speech corrupted with coloured noise,” Speech Communication, vol. 105, pp. 62 – 76, December 2018. [14] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neural network based Kalman filter for time domain speech enhancement,” IEEE International Symposium on Circuits and Systems, pp. 1–5, May 2019. [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016. [16] S. V. Vaseghi, “Linear prediction models,” in Advanced Digital Signal Processing and Noise Reduction, chapter 8, pp. 227–262. John Wiley & Sons, 2009. [17] R. C. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD tracking with low complexity,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4266–4269, March 2010. [18] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1383–1393, December 2012. [19] A. Nicolson and K. K. Paliwal, “Deep learning for minimum mean- square error approaches to speech enhancement,” Speech Communica- tion, vol. 111, pp. 44–55, August 2019. [20] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” ArXiv, vol. abs/1803.01271, 2018. [21] J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv, vol. abs/1607.06450, 2016. [22] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” 27th International Conference on Machine Learning, pp. 807–814, June 2010. CHAPTER 5. DEEP LEARNING-BASED KALMAN FILTER AND AUGMENTED KALMAN FILTER FOR SPEECH 86 ENHANCEMENT Chapter 6 DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement 87 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal, "DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement", IEEE Access (Submitted Revisions at 18 March 2021). My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Aaron Nicolson aided with drafting the final manuscript. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) _______ ________ (Date) 02/04/2021 Aaron Nicolson (Countersigned) ____ ____ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier *** DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement SUJAN KUMAR ROY1 , AARON NICOLSON2 , KULDIP K. PALIWAL1 1 Signal Processing Laboratory, Griffith University, Nathan Campus, Brisbane, QLD, 4111, Australia 2 Australian eHealth Research Centre, CSIRO, Herston, QLD, 4006, Australia Corresponding author: Sujan Kumar Roy (e-mail:

[email protected]

). ABSTRACT Current deep learning approaches to linear prediction coefficient (LPC) estimation for the augmented Kalman filter (AKF) produce bias estimates, due to the use of a whitening filter. This severely degrades the perceived quality and intelligibility of enhanced speech produced by the AKF. In this paper, we propose a deep learning framework that produces clean speech and noise LPC estimates with significantly less bias than previous methods, by avoiding the use of a whitening filter. The proposed framework, called DeepLPC, jointly estimates the clean speech and noise LPC power spectra. The estimated clean speech and noise LPC power spectra are passed through the inverse Fourier transform to form autocorrelation matrices, which are then solved by the Levinson-Durbin recursion to form the LPCs and prediction error variances of the speech and noise for the AKF. The performance of DeepLPC is evaluated on the NOIZEUS and DEMAND Voice Bank datasets using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC is compared to six existing deep learning-based methods. Compared to other deep learning approaches to clean speech LPC estimation, DeepLPC produces a lower spectral distortion (SD) level than existing methods, confirming that it exhibits less bias. DeepLPC also produced higher objective scores than any of the competing methods (with an improvement of 0.11 for CSIG, 0.15 for CBAK, 0.14 for COVL, 0.13 for PESQ, 2.66% for STOI, 1.11 dB for SegSNR, and 1.05 dB for SI-SDR, over the next best method). The enhanced speech produced by DeepLPC was also the most preferred by listeners. By producing less biased clean speech and noise LPC estimates, DeepLPC enables the AKF to produce enhanced speech at a higher quality and intelligibility. INDEX TERMS Speech enhancement, Kalman filter, augmented Kalman filter, deep neural network, temporal convolutional network, LPC. I. INTRODUCTION analysis (CASA) [14], and deep learning approaches [15] have been introduced over the decades. This paper focuses T HE main objective of a speech enhancement algorithm (SEA) is to improve the quality and intelligibility of noise corrupted speech (or noisy speech) [1]. This can be on deep learning for the AKF. Paliwal and Basu introduced the Kalman filter for SEA achieved by eliminating the embedded noise from a noisy [12]. For the KF, the clean speech signal is represented by speech signal without distorting the speech. Many appli- an auto-regressive (AR) process, whose parameters comprise cations, such as speech communication systems, hearing the linear prediction coefficients (LPCs) and prediction error aid devices, and speech recognition systems typically rely variance. The LPC parameters and the additive noise variance upon speech enhancement algorithms for robustness. Vari- are used to form the recursive equations of the KF. The KF ous SEAs, including spectral subtraction (SS) [2]–[5], the gives a linear MMSE estimate of the current state of the Wiener filter (WF) [6], [7], minimum mean square error clean speech given the observed noisy speech for each sample (MMSE) estimators [8]–[11], the Kalman filter (KF) [12], within a frame using the recursive equations. Therefore, the augmented KF (AKF) [13], computational auditory scene the performance of the KF depends on how accurately the VOLUME **, **** 1 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement LPC parameters and additive noise variance are estimated. Xu et al. proposed a DNN to map the noisy speech log However, estimating the LPC parameters and additive noise power spectra (LPS) to the clean speech LPS. In [24], Han variance from the noisy speech is difficult in practice, with et al. trained a DNN to learn a spectral mapping from the poor estimates degrading the quality and intelligibility of magnitude spectrum of noisy speech to that of clean speech. the enhanced speech produced by the KF. In [12], it was Deep learning methods have also been proposed to im- demonstrated that the KF performs well for stationary white prove the performance of statistical model-based SEAs, such noise when the LPC parameters were computed from the as the MMSE short-time spectral amplitude (MMSE-STSA) clean speech. estimator [8], MMSE log-spectral amplitude (MMSE-LSA) Gibson et al. introduced an augmented KF (AKF) for estimator [9], WF [1], the square-root WF (SRWF) [1]. enhancing speech corrupted by coloured noise [13]. In this Specifically, the performance of these SEAs relies upon the SEA, both the clean speech and noise signal are represented accuracy of the a priori SNR estimation. Recently, a deep by AR processes. The speech and noise LPC parameters learning framework was proposed to estimate the a priori are incorporated in an augmented matrix to construct the SNR directly from the noisy speech spectral magnitude, recursive equations of AKF. In [13], the LPC parameters for called Deep Xi [25]. The estimated a priori SNR is then the current frame are computed from the corresponding en- employed by the MMSE-STSA estimator [8], MMSE-LSA hanced speech frame of the previous AKF iteration. Although estimator [9], WF [1], or SRWF [1]. In [26], Zhang et the enhanced speech (obtained after 3-4 iterations) of the al. proposed a DeepMMSE framework, which employed a AKF demonstrates an improvement in signal-to-noise ratio ResNet temporal convolutional network (ResNet-TCN) for (SNR), it suffers from musical noise and speech distortion. MMSE-based noise power spectral density (PSD) estimation. Thus, the AKF method [13] is not robust to inaccurate LPC DeepMMSE demonstrates an improvement in noise PSD parameter estimates in practice. Roy et al. introduced a sub- tracking over previous methods. band (SB) iterative KF (SBIT-KF)-based SEA [16]. SBIT-KF Deep learning has also been employed for time-domain employs an iterative KF to enhance only the high-frequency speech enhancement. In [27], end-to-end utterance enhance- sub-bands (SBs) of 16 decomposed SBs for a given noisy ment using a fully-convolutional network (EEUE-FCNN) has speech signal. However, the low-frequency SBs can also been proposed. The authors claimed that the discontinuities be affected by noise—typically when operating in real-life present at the boundaries of framed speech are detrimental to noise conditions. As demonstrated in [13], SBIT-KF also the enhancement process. In this SEA, an FCNN facilitates a produced distorted speech. George et al. introduced a ro- direct mapping of the noisy speech waveform to the clean bustness metric-based tuning of the AKF gain for enhancing speech waveform. The authors claim that the processing speech corrupted by coloured noise [17]. The authors showed of the whole noisy speech waveform results in enhanced that the inaccurate estimates of the speech and noise LPC speech with an improvement in intelligibility. However, this parameters introduce bias in the AKF gain, leading to a SEA shows a moderate intelligibility improvement than some degradation in speech enhancement performance. In partic- benchmark methods. Also, the FCNN model is constructed ular, the adjusted AKF gain is under-estimated in regions of with ten one-dimensional convolutional layers; each com- speech, resulting in distorted speech. prises of 30 filters with a filter size of 55. Thus, the processing As of late, deep learning has been used widely for speech of the whole waveform using the FCNN is computationally enhancement. Motivated by the time-frequency (T-F) mask- expensive, which may be inappropriate for systems requiring ing in CASA [14], Wand and Wang proposed the use of a low delay, such as telephony. multi-layer perceptrons (MLPs) to estimate the ideal binary mask (IBM) [18]. The estimated IBM is used to reconstruct A. RELATED WORK the clean speech spectrum from the noisy speech spectrum. In this section, we briefly review the existing deep learning Subsequently, it was demonstrated that the ideal ratio mask methods that address LPC parameter estimation for the KF as (IRM) is able to attain better speech quality than the IBM well as AKF-based speech enhancement. In [28], Pickersgill [19]. In [20], post-processing was applied after masking with et al. employed a similar DNN to that used in [15] for LPC the IBM, IRM, or ideal amplitude mask (IAM) [21], resulting estimation. They evaluate the LPC estimation performance in an improvement in objective quality and intelligibility. in terms of the spectral distortion (SD) level. However, In [22], Williamson et al. introduced a complex ideal ratio the performance of LPC estimation at low SNR levels was mask (cIRM), which is able to estimate both the amplitude unspecified. In [29], Yu et al. proposed a deep learning- and phase spectra of the clean speech. In [23], Zheng et based KF for speech enhancement. A DNN containing three al. combine the instantaneous frequency deviation (IFD) hidden layers is adopted for estimating the LPCs for each with the IAM to form the phase sensitive mask (PSM). noisy speech frame. For training the DNN model, only 10 720 The clean speech spectrum is then reconstructed using the samples constructed from 670 speech recordings, including estimated mask and the phase information (extracted from four noise recordings and four SNR levels was used. The the IFD). Different from masking-based methods, mapping- small training data reduces the generalization capabilities based methods employ a DNN to extract the spectral features of this model to estimate LPCs for a wide range of noise of the clean speech from that of the noisy speech. In [15], conditions. In addition, the noise covariance is estimated 2 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement TABLE 1: Summary of existing deep learning-based LPC esti- during speech pauses of the noisy speech, which does not mation methods for the KF as well as AKF. account for conditions that have time-varying amplitudes. Motivated by the performance of Deep Xi in combination Methods Summary Limitations A DNN [15] is used to Due to training the DNN with statistical model-based SEAs [25], a residual network DNN-LPC estimate the speech LPC with a small dataset, its (ResNet) [30] was incorporated within the Deep Xi frame- [28] parameters. generalization capabili- work to estimate parameters for the AKF [31]. Later on, ties may be reduced. AKF is constructed with the DeepMMSE framework [26] was used to estimate pa- the noise and speech The whitening technique gives a biased estimate rameters for the KF [32]. In both methods [31], [32], the LPC parameters derived of the speech LPC noise parameters for the AKF and KF are computed from DeepXi-AKF from a Deep Xi-ResNet parameters, which [31] framework and whiten- the estimated noise PSD derived from Deep Xi and Deep- ing technique [17], re- impacts the quality and intelligibility of MMSE, respectively. However, Deep Xi and DeepMMSE do spectively. enhanced speech. not address speech LPC estimation directly from the noisy KF is constructed with speech. Rather, a whitening filter is constructed with its coef- the noise variance and As in [31], the biased ficients computed from the estimated noise. The whitening speech LPC parameters speech LPC parameters filter is then applied to each noisy speech frame, yielding DeepXi-KF derived from the Deep- derived from the whiten- [32] MMSE framework [26] ing technique impact the pre-whitened speech. The speech LPC parameters are then and whitening technique quality and intelligibility computed from the pre-whitened speech. It is shown that [17], respectively. of enhanced speech. the estimated speech LPC parameters [32] produce a higher AKF is constructed with SD level than competing methods. That means the whitening the speech and noise techniques in [31], [32] do not adequately address speech LPC parameters derived LSTM-CKFS [33] ex- LSTM-CKFS LPC parameters estimates. It is also demonstrated that the [33] from an LSTM network hibits a high amount of and an ML-based ap- bias. biased speech LPC estimates in [31], [32] impact the quality proach [34]. and intelligibility of the enhanced speech in real-life noise conditions. In [33], Yu et al. adopted a DNN [29] and an LSTM network to estimate the speech and noise LPCs, respectively, recursion, yielding speech and noise LPC parameters. The for AKF-based speech enhancement. To estimate the pre- proposed method aims to mitigate the weaknesses of pre- diction error variances for the AR processes of the AKF, viously proposed deep learning-based KFs and AKFs, by the authors employed a maximum likelihood (ML) approach providing an improved estimate of the speech LPCs. The [34]. It was claimed that the LSTM-CKFS method performs motivation of this is to produce enhanced speech at a higher better speech enhancement than other methods, as in [33]. quality and intelligibility in real-world noise conditions. However, due to training the LSTM network with a small The structure of this paper is as follows: background amount of training data [29], LSTM-CKFS lacks the ability knowledge is presented in Section II-B, including the signal to estimate LPCs accurately in various noise conditions. The model, the AKF for speech enhancement, and an overview bias of the LPC estimates of LSTM-CKFS is reflected in of our previous works on deep learning-based KF as well the constructed AKF method, which produces significant as AKF. In Section III, we describe the proposed SEA, residual noise in the enhanced speech [33]. To minimize the which includes the proposed DeepLPC framework for LPC residual noise, the authors employed a multi-band spectral estimation. Following this, Section IV describes the experi- subtraction (MB-SS) method [4]. For MB-SS, the noise spec- mental setup in terms of speech corpus, objective and subjec- trum is updated during speech pauses, which is not appropri- tive evaluation metrics, and specifications of the competing ate for noise conditions that have time-varying amplitudes. SEAs. The experimental results are then presented in Section The state-of-the-art deep learning-based KF and AKF in V. Finally, Section VI gives some concluding remarks. literature, such as [31], [32], are able to adequately suppress the background noise without any post-processing. In light II. BACKGROUND of these observations, it is evident to say that LSTM-CKFS A. SIGNAL MODEL does not produce accurate enough LPC parameter estimates The noisy speech y(n), at discrete-time sample n, is assumed for the AKF. to be given by: In light of the shortcomings of existing deep learning- y(n) = s(n) + v(n), (1) based KF and AKF methods presented in Table 1, this paper introduces DeepLPC, a deep learning framework for where s(n) is the clean speech and v(n) is uncorrelated accurately estimating the speech and noise LPC parameters. additive coloured noise. Since the AKF operates on a frame- Specifically, the DeepLPC maps each frame of the noisy by-frame basis for speech enhancement, firstly, a 32 ms speech magnitude spectrum to the speech and noise LPC rectangular window with 50% overlap is used to convert y(n) power spectrum. The autocorrelation metrics (constructed into frames, denoted by y(n, l): from the estimated LPC power spectra using the inverse Fourier transform) are then solved by the Levinson-Durbin y(n, l) = s(n, l) + v(n, l), (2) VOLUME **, **** 3 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement where lǫ{0, 1, . . . , L−1} is the frame index with L being the 4) total number of frames in an utterance, and nǫ{0, 1, . . . , N − w(n) 1} where N is the total number of samples within each frame. g(n) = , (10) u(n) The noisy speech, y(n) (Eq. (1)) is next analysed frame- wise using the short-time Fourier transform (STFT): ⊤ 5) c⊤ = c⊤ v , where cs = 1 0 . . . 0 c⊤ and s ⊤ Y (l, m) = S(l, m) + V (l, m), (3) cv = 1 0 . . . 0 are p × 1 and q × 1 vectors, 6) y(n) is the noisy measurement at sample n. where Y (l, m), S(l, m), and V (l, m) denote the complex- For each frame, the AKF recursively computes an unbiased valued STFT coefficients of the noisy speech, clean speech, linear MMSE estimate, x̂(n|n) at sample n, given y(n), by and noise, respectively, for time-frame index l and discrete- using the following Equations [17]: frequency bin m. The Hamming window is used for analysis and synthesis. x̂(n|n − 1) = Φx̂(n − 1|n − 1), (11) ⊤ ⊤ Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + Qrr , (12) B. AKF FOR SPEECH ENHANCEMENT ⊤ −1 K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (13) For simplicity, the frame index is omitted in the AKF re- ⊤ cursive equations. Each frame of the clean speech and noise x̂(n|n) = x̂(n|n − 1) + K(n)[y(n) − c x̂(n|n − 1)], signal in (2) can be represented with pth and q th order AR (14) models, as in [35, Chapter 8]: Ψ(n|n) = [I − K(n)c⊤ ]Ψ(n|n − 1), (15) p 2 X σ 0 s(n) = − ai s(n − i) + w(n), (4) where Q = w is the process noise covariance. 0 σu2 i=1 q For a noisy speech frame, the error covariances (Ψ(n|n − X v(n) = − bk v(n − k) + u(n), (5) 1) and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n)) and k=1 the Kalman gain K(n) are continually updated on a sample- wise basis, while ({ai }, σw 2 ) and ({bk }, σu2 ) remain constant. where {ai ; i = 1, 2, . . . , p} and {bk ; k = 1, 2, . . . , q} are the ⊤ At sample n, h x̂(n|n) gives the output of the AKF, ŝ(n|n), LPCs. w(n) and u(n) are assumed to be white noise with ⊤ where h = 1 0 0 . . . 0 is a (p + q) × 1 column zero mean and variances σw 2 and σu2 , respectively. vector. As in [17], ŝ(n|n) is given by: Equations (2), (4)-(5) can be used to form the following augmented state-space model (ASSM) of the AKF, as in [13]: ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)[y(n)− v̂(n|n − 1)], (16) x(n) = Φx(n − 1) + rg(n), (6) ⊤ y(n) = c x(n). (7) where K0 (n) is the 1st component of K(n), given by [17]: In the above ASSM, α2 (n) + σw 2 K0 (n) = , (17) 1) x(n) = [s(n) . . . s(n − p + 1) v(n) . . . v(n − q + 1)]T α2 (n) + σw2 + β 2 (n) + σ 2 u is a (p+ q) × 1state-vector, where α2 (n) = c⊤ ⊤ s Φs Ψs (n − 1|n − 1)Φs cs and β (n) = 2 Φs 0 ⊤ cv Φv Ψv (n − 1|n − 1)Φv cv are the transmission of a pos- ⊤ 2) Φ = is a (p + q) × (p + q) state-transition 0 Φv teriori error variances of the speech and the noise augmented matrix with:   dynamic model from the previous sample, n − 1, respectively −a1 −a2 . . . −ap−1 −ap [17].  1 0 ... 0 0  Equation (16) reveals that K0 (n) has a significant impact    0 1 . . . 0 0  on ŝ(n|n). In practice, the inaccurate estimates of ({ai }, σw 2 ) Φs =  , (8)  .. .. . . .. ..  and ({bk }, σu ) introduce bias into K0 (n), which impacts 2  . . . . .  ŝ(n|n). In our previous works, Deep Xi-AKF [31] and Deep 0 0 ... 1 0 Xi-KF [32], a DeepMMSE framework [26] is used to utilize   the parameter estimates as described in the following section. −b1 −b2 ... −bq−1 −bq  1 0 ... 0 0    C. REVIEW OF Deep Xi-AKF AND Deep Xi-KF  0  Φv =  0 1 ... 0 , (9) This section briefly summarises Deep Xi-AKF and Deep  .. .. .. .. ..   . . . . .  Xi-KF methods [31], [32], including their limitations. Deep 0 0 ... 1 0 Xi-AKF and Deep Xi-KF both employ an MMSE-based noise PSD estimator, called DeepMMSE [26]. DeepMMSE rs 0 ⊤ utilises deep learning to estimate the noise PSD estimate, 3) r = , where r s = 1 0 . . . 0 , rv = 0 rv λ̂v (l, m). Specifically, DeepMMSE leverages the Deep Xi ⊤ 1 0 ... 0 , framework to estimate the a priori and a posteriori SNR for 4 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement the MMSE noise periodogram estimator [36], [37], where |V̂ (l, m)|2 is the noise periodogram estimate. To obtain λ̂v (l, m) from |V̂ (l, m)|2 , DeepMMSE employs first-order recursive smoothing: λ̂v (l, m) = η λ̂d [l − 1, k] + (1 − η)|V̂ (l, m)|2 , (18) where η is the smoothing factor. For Deep Xi-AKF [31], ({bk }, σu2 ) are computed from λ̂v (l, m), where η = 0.9. However, ({ai }, σw 2 ) are still unknown. As in [17], ({ai }, σw ) (p = 10) are computed 2 frame-wise from pre-whitened speech, yw (n, l) as presented in [31, Section III-A]. The AKF is then constructed with the estimated ({ai }, σw2 ) and ({bk }, σu2 ) for speech enhance- ment. In Deep Xi-KF [32], the noise variance, σv2 is estimated from λ̂v (l, m), where η = 0 was used in Eq. (18) [32, Section 3.1]. As in [31], ({ai }, σw 2 ) (p = 10) are computed frame- wise from pre-whitened speech, yw (n, l). The KF is then constructed with the estimated, σv2 and ({ai }, σw 2 ) for speech enhancement. It was observed in [17] that biased speech LPC estimates are produced with the whitening filter [32]. This indicates that Deep Xi-AKF and Deep Xi-KF also produce biased speech LPC estimates. The biased estimates of ({ai }, σw 2 ) will thus impact the quality and intelligibility of the enhanced speech produced by the AKF and the KF. Moreover, while the noise LPC estimates produced by DeepMMSE are adequate, direct estimation of the noise LPC estimates could result in less bias. III. PROPOSED SPEECH ENHANCEMENT ALGORITHM To address the shortcomings of Deep Xi-AKF and Deep Xi-KF highlighted in the previous section, we propose the DeepLPC framework. DeepLPC jointly estimates the clean speech and noise LPC power spectra (LPC-PS), denoted as Ps (l, m) and Pv (l, m), respectively. The clean speech and noise LPC-PS estimates are used to compute the clean speech and noise LPC estimates, which display less bias than those produced by Deep Xi-AKF and Deep Xi-KF. Fig. 1 shows the block diagram of the proposed SEA. As in Section II-A, y(n) is first converted into frames, y(n, l). Following this, the short-time noisy speech magnitude spec- trum, |Y l |, is computed. Next, DeepLPC jointly estimates Ps (l, m) and Pv (l, m) from |Y l |. During training, Ps (l, m) FIGURE 1: Block diagram of the proposed SEA. and Pv (l, m) are computed as in [35, Chapter 9]: 2 σw Ps (l, m) = 2 , (19) P 1 + p ai e−j2πim/M is based on the findings that higher-order LPCs are required i=1 to accurately estimate the short-term correlation information σu2 of wideband (16 kHz) speech [38, Section 6.2.1-6.2.2 and Pv (l, m) = 2 , (20) Fig. (6.2)]. P 1 + q bk e−j2πkm/M To facilitate the convergence of the stochastic gradi- k=1 ent descent algorithm, the dynamic range of Ps (l, m) and where ({ai }, σw 2 ) (p = 16) and ({bk }, σu2 ) (q = 16) are Pv (l, m) must be compressed. For this, we follow the same computed from the clean speech (s(n, l)) and noise (v(n, l)) method used to compress the dynamic range of the instan- using the autocorrelation method [35]. The chosen speech taneous a priori SNR in [25]. We first convert Ps (l, m) and and noise LPC order (p = 16 and q = 16, respectively) Pv (l, m) into the log-spectral domain, i.e., Ps (l, m)[dB] = VOLUME **, **** 5 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement 10 6 a Pv (l, m)[dB] ∼ N (µv , σv2 )). The statistics of Ps (l, m)[dB] 10 and Pv (l, m)[dB] , i.e., (µs , σs2 ) and (µv , σv2 ) for each fre- 8 quency bin, m were found over a sample of the training set, as described in Section IV-B. The CDF of Ps (l, 64)[dB] and Count 6 4 Pv (l, 64)[dB] over the sample is shown in Fig. 2 (b) and Fig. 3 2 (b), respectively. The CDFs of Ps (l, m)[dB] and Pv (l, m)[dB] 0 are used to form the training targets for DeepLPC: -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 " !# 1 Ps (l, m)[dB] − µs P̄s (l, m) = 1 + erf √ , (21) b 2 σs 2 1 " !# 0.8 1 Pv (l, m)[dB] − µv P̄v (l, m) = 1 + erf √ . (22) 0.6 2 σv 2 0.4 P̄s (l, m) and P̄v (l, m) are concatenated to form the final 0.2 training target for DeepLPC: 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 ζ l = {P̄s (l, 0), P̄s (l, 1), . . . , P̄s (l, M − 1), P̄v (l, 0), (23) P̄v (l, 1), . . . , P̄v (l, M − 1), FIGURE 2: (a) The distribution of Ps (l, 64)[dB] over a sample where ζ l is of size M × 2. of the training set. (b) The CDF of Ps (l, 64)[dB] , assuming that Ps (l, 64)[dB] is distributed normally. The sample of the training set is described in Section IV-B. 10 6 a 10 8 Count 6 4 2 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 b 1 0.8 0.6 0.4 0.2 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 FIGURE 3: (a) The distribution of Pv (l, 64)[dB] over a sample of the training set. (b) The CDF of Pv (l, 64)[dB] , assuming that Pv (l, 64)[dB] is distributed normally. The sample of the training FIGURE 4: ResNet-TCN within the proposed DeepLPC frame- set is described in Section IV-B. work. The ResNet-TCN consists of a fully-connected first layer, FC, followed by B residual blocks, and then a fully-connected output layer, O that employs sigmoidal units. 10 log10 (Ps (l, m)) and Pv (l, m)[dB] = 10 log10 (Pv (l, m)). During inference, ζ̂ l is first split into P̄ˆs (l, m) and Next, we utilise the cumulative distribution function ˆ P̄v (l, m). The clean speech and noise LPC-PS estimates are (CDF) of Ps (l, m)[dB] and Pv (l, m)[dB] to compress their then computed from P̄ˆs (l, m) and P̄ˆv (l, m): dynamic range to the interval [0, 1]. As an example, we ob- √ 2erf−1 (2P̄ˆs (l,m)−1)+µs )/10) serve that Ps (l, 64)[dB] and Pv (l, 64)[dB] follow a Gaussian P̂s (l, m) = 10((σs , (24) √ distribution, as shown in Figs. 2 (a) and 3 (a), respectively. ((σv 2erf−1 (2P̄ˆv (l,m)−1)+µv )/10) P̂v (l, m) = 10 . (25) Therefore, we assume that Ps (l, m)[dB] and Pv (l, m)[dB] are distributed normally with mean, µs and µv , and variance The |IDFT| of P̂s (l, m) and P̂v (l, m) yields an estimate of σs2 and σv2 , respectively (Ps (l, m)[dB] ∼ N (µs , σs2 ) and the autocorrelation matrices, R̂(τ ) and Ĥ(τ ), where τ is the 6 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement autocorrelation lag. With R̂(τ ) and Ĥ(τ ), {ai } and {bk } can 𝜻෡𝒍 be represented using the Yule-Walker equation [35, Chapter 8]:      R̂(0) R̂(1) . . . R̂(p − 1) a1 −R̂(1)  R̂(1) . . . R̂(p − 2)      R̂(0)  a2  −R̂(2)  R̂(2) R̂(1) . . . R̂(p − 3)  3  = −R̂(3)   a   j=6, d=4  ,  . . . .   ..   .   .. .. .. ..   .   ..  j=5, d=2 R̂(p − 1) R̂(p − 2) . . . R̂(0) ap −R̂(p) (26) j=4, d=1      Ĥ(0) Ĥ(1) . . . Ĥ(p − 1) b1 −Ĥ(1)  Ĥ(1) . . . Ĥ(p − 2)      Ĥ(0)  b2  −Ĥ(2) ⋯ j=3, d=4  Ĥ(2) Ĥ(1)  . . . Ĥ(p − 3) b3  = −Ĥ(3)     .  . . . .   ..   .   .. .. .. ..   .   ..  j=2, d=2 ⋯ Ĥ(q − 1) Ĥ(q − 2) . . . Ĥ(0) bq −Ĥ(q) (27) ⋯ j=1, d=1 The matrices R̂ and Ĥ in Equations (26)-(27) are Toepiltz matrices, where all diagonal elements are identical. The special Toepiltz structure of the autocorrelation matrix allows the use of the Levinson-Durbin recursion to solve the Yule- Walker equation [35, Chapter 8]. Solving the Equations (26) and (27) using the Levinson-Durbin recursion [35, Chapter ⋯ 8], yields ({âi }, σ̂w 2 ) (p = 16) and ({b̂k }, σ̂u2 ) (q = 16) for the AKF. For more about solving the Yule-Walker equation using Levinson-Durbin recursion, we refer the readers to [35, |𝐘𝑙−𝟖 | |𝐘𝑙−𝟕 | |𝐘𝑙−𝟏𝟑 | |𝐘𝑙−𝟔 | |𝐘𝑙−𝟏𝟐 | |𝐘𝑙−𝟒 | |𝐘𝑙−𝟏𝟒 | |𝐘𝑙−𝟏𝟏| |𝐘𝑙−𝟏𝟎| |𝐘𝑙−𝟗 | |𝐘𝑙−𝟓 | |𝐘𝑙−𝟑 | |𝐘𝑙−𝟐 | |𝐘𝑙−𝟏 | |𝐘𝑙 | Section 8.2.2]. A. DeepLPC-ResNet-TCN The ResNet-TCN from [26] is chosen to estimate ζ l from FIGURE 5: Example of the contextual field of DeepLPC- ResNet-TCN, where D = 4, B = 6, and ks = 3 are used. |Y l | for DeepLPC, and is detailed in this section. The ResNet-TCN is shown in Fig. 4. The input, |Y l | is first passed through FC, a fully-connected layer of size dmodel , field gained by the use of causal dilated convolutional units. followed by layer normalization (LN) [39] and the rectified The last residual block is followed by the output layer, O, linear unit (ReLU) activation function [40]. FC is followed which is a fully-connected layer with sigmoidal units. The by B bottleneck residual blocks, where jǫ{1, 2, . . . , B} is O layer gives an estimate of ζ̂ l . The hyperparameters used the block index. As in [41], each block contains three one- in [26] where used in this study: dmodel = 256, df = 64, dimensional convolutional units. Each convolutional unit is B = 40, ks = 3, and D = 16. The training strategy as well pre-activated by LN [39] followed by the ReLU activation as a complexity and convergence analysis of ResNet-TCN function [40]. The kernel size, output size, and dilation rate are detailed in Sections IV-D and IV-E. for each convolutional unit (Fig. 4) is denoted as (kernel size, output size, dilation rate). IV. SETUP OF THE SPEECH ENHANCEMENT The first and third convolutional units in each block have EXPERIMENT a kernel size of one, whilst the second convolutional unit has A. TRAINING & VALIDATION SET a kernel size of ks . The output size of the first and second In this paper, two datasets are used for training DeepLPC: the convolutional unit is df , while the third one is dmodel . A dila- DEMAND Voice Bank [51] and Deep Xi datasets,1 which tion rate of one is set for the first and the third convolutional have been used previously in [25], [26], [31], [32]. The units, and d for the second convolutional unit. The second details of the dataset are summarized in Table. 2. All clean convolutional unit provides a contextual field over previous speech and noise recordings are single-channel with a sam- time steps. The dilation rate, d is cycled as the block index pling frequency of 16 kHz. For the Deep Xi dataset, the noisy j increases: d = 2(j−1 mod (log2 (D)+1) , where mod is the speech for the validation set is created by corrupting each of modulo operation, and D is the maximum dilation rate. An the 3 713 clean speech recordings with a random section of a example of how the dilation rate is cycled is shown in Fig. 5, randomly selected noise recording (from the set of 813 noise with D = 4, and B = 6. It can be seen that the dilation rate is reset after block three. This also demonstrates the contextual 1 Specifically, an earlier version of the open-source Deep Xi dataset [53]. VOLUME **, **** 7 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement TABLE 2: Summary of the training and validation datasets used in this paper. Dataset No. of Speech Recordings No. of Noise Recordings No. of Train Data No. of Validation data 16 243 noise recordings are 74 250 clean speech taken from the following 70 537 clean speech 5% of speech and noise recordings are taken from datasets: QUT-NOISE recordings and 15 430 recordings, i.e., 3 713 Deep Xi dataset (training the Librispeech corpus [42] [45], Nonspeech [46], noise recordings are clean speech recordings set) (28 539), CSTR VCTK corpus Environmental Background used to construct the and 813 noise record- [43] (42 015), and TIMIT Noise [47], [48], MUSAN training set as described ings are used to con- corpus [44] (3 696). [49], FreeSound packs2 , and in Section IV-D. struct the validation set. coloured noise recordings. 10 noise recordings, includ- 11 572 training The Voice Bank corpus com- ing 2 synthetic noises (speech DEMAND Voice Bank examples are generated, prises of 11 572 clean speech shaped and babble) [51] and As in [51], no validation dataset (training set) as described in Section recordings [50]. 8 real-world noises from DE- data is used in training. IV-D. MAND dataset [52]. 2 Freesound packs (https://freesound.org/) that were used: 147, 199, 247, 379, 622, 643, 1 133, 1 563, 1 840, 2 432, 4 366, 4 439, 15 046, 15 598, 21 558. TABLE 3: Summary of the test datasets used in this paper. Dataset No. of Speech Recordings No. of Noise Recordings Generation of Test Dataset The noisy speech for the test set is formed by 30 clean speech utterances be- 2 real-world non-stationary mixing the clean speech with voice babble, longing to six speakers (three (voice babble, street) and 2 street, factory, and f16 noise recordings at NOIZEUS dataset male and three female) are real-world coloured (factory, multiple SNR levels varying from -5dB to taken from the NOIZEUS cor- f16 noise recordings are taken +15 dB, in 5 dB increments. This provides pus [1, Chapter 12]. from [47], [48]. 30 examples per condition with 20 total con- ditions. The noisy speech for the test set is formed 824 clean speech recordings 5 noise recordings selected by mixing 824 clean speech recordings with DEMAND Voice Bank of two speakers from Voice from the DEMAND dataset 5 noise recordings at multiple SNR levels: (test set) Bank Corpus–393 from p232 [52]. {2.5, 7.5, 12.5, 17.5}. This provides 824 and 431 from p257 [50]. examples per condition with 20 total condi- tions. recordings) at a randomly selected SNR level (-10 to +20 dB, channel with a sampling frequency of 16 kHz. Note that the in 1 dB increments). The number of validation examples is speech and the noise recordings in both test sets are different thus equal to the number of clean speech recordings (3 713). from those used in the training and validation sets. As in [51], we do not use a validation set for the DEMAND Voice Bank dataset during training. D. TRAINING STRATEGY The following training strategy was employed to train B. TRAINING SET SAMPLE DeepLPC-ResNet-TCN: For the Deep Xi dataset, 2 500 randomly selected clean • Mean square error between the predicted and true values speech recordings were mixed with 2 500 randomly selected is used as the loss function. noise recordings with SNR levels: -10 dB to +20 dB in • The Adam algorithm with default hyperparameters is 1 dB increments, giving 2 500 noisy speech signals. For adopted for the gradient descent optimisation [54]. each frequency bin, m, the sample mean and variances, (µs , • Gradients are clipped between [−1, 1]. σs2 ) and (µv , σv2 ) were computed from 2 500 concatenated • The number of training examples in an epoch is equal clean speech recordings and scaled noises, respectively. This to the number of clean speech recordings used in the sample is used in Figures 2 and 3. The same is done to training set, i.e., 70 537 for Deep Xi dataset and 11 572 produce the sample for the DEMAND Voice Bank dataset, for DEMAND Voice Bank dataset. except that a sample size of 1 500 is used. • A mini-batch size of 8 training examples is used. • For the Deep Xi dataset, the noisy speech signals are C. TEST SET generated on the fly as follows: each clean speech For the objective experiments, the NOIZEUS dataset was recording is corrupted with a randomly selected noise used to evaluate the performance of DeepLPC trained with recording at a randomly selected SNR level (-10 to +20 the Deep Xi dataset. In addition, DeepLPC trained with dB, in 1 dB increments). the DEMAND Voice Bank dataset is evaluated using the • For the DEMAND Voice Bank dataset, the noisy speech DEMAND Voice Bank test set. The details of the NOIZEUS signals for the training set are formed by mixing each and DEMAND Voice Bank test sets are given in Table 3. All clean speech recording with a random section of a the clean speech and noise recordings in Table 3 are single- randomly selected noise recording at a random SNR 8 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement level from the set {0, 5, 10, 15} (dB). This creates a set G. OBJECTIVE QUALITY AND INTELLIGIBILITY of 11 572 noisy speech signals for training. MEASURES • For the Deep Xi dataset, we employ early stopping with Objective measures are used to evaluate the quality and a patience of 30 epochs. Using this strategy, training was intelligibility of the enhanced speech with respect to the terminated at epoch 168. corresponding clean speech. The objective quality and intel- • No early stopping is set for the DEMAND Voice Bank ligibility measures used in this paper are given in Table 4. dataset as it has no validation set (Table 2). Instead, a total of 125 epochs were used to train DeepLPC. TABLE 4: Objective measures, what each assesses, and the range of their scores. For each measure, higher is better. E. COMPLEXITY AND CONVERGE ANALYSIS OF DeepLPC-ResNet-TCN Measure Assesses Range The complexity of a DNN depends on the number of training CSIG [55] Quality [1, 5] CBAK [55] Quality [1, 5] parameters, where ResNet-TCN has 2.1 million parameters. COVL [55] Quality [1, 5] This is markedly less than other models used for speech PESQ [56] Quality [−0.5, 4.5] enhancement, such as the residual LSTM (ResLSTM) from STOI [57] Intelligibility [0, 100]% SI-SDR [58] Quality [−∞, ∞] [25], which employs 10 million parameters. This also allows SegSNR [59] Quality [−∞, ∞] for a significant speedup in training time, with ResNet-TCN taking 40 minutes per epoch when compared to 8 hours per epoch for the ResLSTM network on the Deep Xi dataset (using an NVIDIA GTX 1080 Ti GPU). H. SPECTROGRAM EVALUATION Next, we analyze the convergence of the mean square error We also compare the enhanced speech spectrograms of the between the predicted and true values for the training and proposed SEA to that of recent AKF SEAs to visually analyse validation data sets of DeepLPC-ResNet-TCN, as shown in the level of residual noise as well as distortion. For this Fig. 6. It can be seen that the mean square error reduces for purpose, we generate a set of stimuli by corrupting utterances the training as well as the validation data after each epoch, sp05 and sp27 from the NOIZEUS corpus [1, Chapter 12]. until converging at around epoch 140. As the early stopping The reference transcript for utterance sp05 is: “Wipe the criterion with a patience of 30 is used, epoch 138 is chosen grease off his dirty face”, and is corrupted with voice babble for testing. at 5 dB. The reference transcript for utterance sp27 is: “Bring your best compass to the third class”, and is corrupted with 4 10 -4 factory at 5dB. Utterance sp05 and sp27 were uttered by a Training male and a female, respectively. Mean square error Validation 3.8 I. SUBJECTIVE EVALUATION 3.6 Subjective evaluation was carried out through a series of blind AB listening tests [5, Section 3.3.4]. To perform the 3.4 0 20 40 60 80 100 120 140 160 tests, the stimuli set described in Section IV-H was used. Epoch The enhanced speech produced by eight SEAs as well as FIGURE 6: Mean square error between the predicted and true the corresponding clean speech and noisy speech signals values for the training and validation data sets of DeepLPC- were played as stimuli pairs to the listeners. Specifically, ResNet-TCN. the test is performed on a total of 180 stimuli pairs (90 for each utterance) played in a random order to each listener, excluding the comparisons between the same method. F. SD LEVEL EVALUATION After listening to a stimuli pair, the listeners’ preference The frame-wise spectral distortion (SD) (dB) [28] is used was determined by the selection of one of three options. The to evaluate the accuracy of the LPC estimates produced by first and second options indicated a preference for one of DeepLPC. Specifically, the estimated clean speech LPCs are the two stimuli, while the third option indicated an equal evaluated. SD for lth frame, Dl (in dB) is defined as the root- preference for both stimuli. For pairwise scoring, 100% of the mean-square-difference between the LPC power spectrum score is given to the preferred method, whilst 0% is given to estimate in dB, P̂s (l, m)[dB] , and the oracle case in dB, the other. 50% of the score is given to both methods for equal Ps (l, m)[dB] as: preference. The participants could re-listen to the stimuli pair v if required. Ten English speaking listeners participate in the u M −1 blind AB listening tests 3 . The average of the scores given u 1 X 2 Dl = t Ps (l, m)[dB] − P̂s (l, m)[dB] . (28) M m=0 3 The AB listening tests were conducted with approval from the Griffith University’s Human Research Ethics Committee: database protocol number 2018/671. VOLUME **, **** 9 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement TABLE 5: Average SD (dB) level comparison for each of the by the listeners, termed as mean subjective preference score LPC estimation methods on NOIZEUS dataset as described in (%), is used to subjectively compare the SEAs. Table 3. The boldface represent the lowest SD level. J. SPECIFICATIONS OF THE COMPETITIVE SEAs SNR level (dB) The performance of the proposed SEA is compared to the Noise Methods -5 0 5 10 15 following SEAs (the following notation is used for conve- Noisy 22.05 18.29 14.86 13.80 11.87 Voice babble nience: (p, q) : is the order of {ai } and {bk }, (σw 2 , σu2 ) are DNN-LPC [28] 16.72 15.98 13.24 12.76 10.79 the prediction error variances of the speech and noise AR LSTM-CKFS [33] 15.91 14.51 12.11 11.89 9.23 Deep Xi-KF [32] 14.95 13.88 11.81 10.31 9.11 models, wf is the analysis frame duration (ms), and sf is the DeepLPC 11.89 10.49 8.73 7.33 6.51 analysis frame shift (ms)). Noisy 20.21 16.39 14.43 13.88 12.45 1) Noisy: speech corrupted with additive noise. DNN-LPC [28] 13.41 12.25 11.68 11.18 10.87 2) AKF-Oracle: AKF, where ({ai }, σw 2 ) and ({bk }, σu2 ) LSTM-CKFS [33] 12.57 11.05 10.78 10.35 9.86 Street Deep Xi-KF [32] 11.66 10.51 9.74 9.21 8.95 are computed from the clean speech and the noise DeepLPC 9.21 8.74 7.59 6.91 5.89 signal, where p = 16, q = 16, wf = 32 ms, sf = 16 Noisy 29.46 25.21 21.16 18.36 16.83 ms, and a rectangular window is used for framing. DNN-LPC [28] 18.74 17.15 16.47 15.79 14.67 3) DNN-LPC-KF: KF-based SEA, where ({âi }, σ̂w ) are Factory 2 LSTM-CKFS [33] 16.39 15.91 14.61 13.60 13.12 estimated using a DNN (C-DNN3) [28] and σ̂v is com- 2 Deep Xi-KF [32] 15.10 14.98 13.87 12.72 12.33 DeepLPC 12.29 10.89 9.48 8.21 7.89 puted from the first noisy speech frame, y(0, l) (with the assumption that it is silent), p = 12, wf = 20 ms, Noisy 28.81 24.56 20.54 17.78 15.32 DNN-LPC [28] 18.93 17.78 16.55 15.23 13.22 sf = 0 ms. A rectangular window is used for framing. F16 LSTM-CKFS [33] 16.78 15.36 14.65 13.13 12.78 4) LSTM-CKFS [33]: AKF constructed using ({âi }, σ̂w 2 ) Deep Xi-KF [32] 14.21 13.01 12.59 11.96 10.81 and ({b̂k }, σ̂u2 ) computed using an LSTM and ML- DeepLPC 12.13 10.46 9.49 8.63 7.83 based approaches, followed by post subtraction using the multi-band SS method [4], where p = 12, q = 12, TABLE 6: Average SD (dB) level comparison for each of the wf = 20 ms, sf = 0 ms, and a rectangular window is LPC estimation methods on DEMAND Voice Bank test set as used for framing. described in Table 3. The boldface represent the lowest SD level. 5) EEUE-FCNN [27]: End-to-end utterance enhancement Methods Average SD Level (dB) using a fully convolutional neural network. Noisy 20.67 6) IAM-IFD [23]: Phase-aware DNN for speech enhance- DNN-LPC [28] 15.13 ment, where wf = 20 ms, sf = 5 ms, and the Hamming LSTM-CKFS [33] 13.26 window is used for analysis and synthesis. Deep Xi-KF [32] 11.35 DeepLPC 8.13 7) Deep Xi-KF [32]: KF-based SEA, where σ̂v2 is esti- mated using the DeepMMSE framework [26] and ({âi }, 2 σ̂w ) are computed from pre-whitened speech corre- sponding to each noisy speech frame, where p = 10, levels than the competing deep learning-based LPC estima- wf = 32 ms, sf = 16 ms, and a rectangular window is tion methods. Amongst the competing methods, Deep Xi-KF used for framing. [31] produced the lowest SD levels. The SD levels for noisy 8) Deep Xi-ResNet-TCN-MMSE-LSA: A ResNet [30] speech indicate the upper bounds of the SD level. is incorporated within the Deep Xi framework [25], The average SD level for each method (found on the Deep Xi-ResNet [31] to estimate the a priori SNR. The DEMAND Voice Bank test set) is shown in Table 6. It can estimated a priori SNR is then employed by the MMSE- be seen that DeepLPC produced a lower SD level than the LSA estimator [9], where wf = 32 ms, sf = 16 ms, competing methods. Amongst the competing methods, Deep and a square-root-Hann window is used for analysis and Xi-KF [32] exhibits lower SD level followed by LSTM- synthesis. CKFS [33], and DNN-LPC [28]. In light of the comparative 9) Proposed: AKF constructed from ({âi }, σ̂w 2 ) and study, the lower SD levels attained by DeepLPC demonstrate ({b̂k }, σ̂u ) computed using DeepLPC framework, 2 that it produces clean speech and noise LPC estimates with where p = 16, q = 16, wf = 32 ms, sf = 16 ms, less bias than previous methods. and a rectangular window is used for framing. B. OBJECTIVE QUALITY EVALUATION V. RESULTS AND DISCUSSIONS Fig. 7 shows the average PESQ score for each SEA over a A. SD LEVEL COMPARISON range of conditions. It can be seen that AKF-Oracle exhibits The average SD levels (found over all frames for each test the highest PESQ score for all of the tested conditions. This condition on NOIZEUS dataset) attained by DeepLPC are is due to ({ai }, σw 2 ) and ({bk }, σu2 ) being computed from given in Table 5. It can be seen that for both real-world non- the clean speech and the noise signal, which is unobserved stationary (voice babble and street) and coloured (factory and in practice. Thus, AKF-Oracle provides an indication of the f16) noise conditions, DeepLPC is able to produce lower SD upper-bound for the AKF in terms of PESQ score. Con- 10 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 4 100 3 80 PESQ 2 STOI 60 40 1 20 0 0 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 -5 0 5 10 15 -5 0 5 10 15 Input SNR(dB) Input SNR(dB) Noisy Noisy DNN-LPC-KF DNN-LPC-KF LSTM-CKFS LSTM-CKFS EEUE-FCNN EEUE-FCNN IAM-IFD IAM-IFD Deep Xi-KF Deep Xi-KF Deep Xi-ResNet-TCN-MMSE-LSA Deep Xi-ResNet-TCN-MMSE-LSA Proposed Proposed AKF-Oracle AKF-Oracle FIGURE 7: Average PESQ score for each SEA found over all FIGURE 8: Average STOI score for each SEA found over all frames for each condition in NOIZEUS dataset (Table 3). frames for each condition in NOIZEUS dataset (Table 3). versely, the average PESQ score for Noisy indicates the lower across the tested conditions. This is due to the lower bias bound of the PESQ score for each of the tested conditions. exhibited by the clean speech and noise LPC estimates of The proposed SEA consistently produces a higher PESQ DeepLPC, as demonstrated in the previous section. score than the competing SEAs across the tested conditions. Amongst the competing methods, Deep Xi-ResNet-TCN- C. OBJECTIVE INTELLIGIBILITY EVALUATION MMSE-LSA produced the highest PESQ scores for each of Fig. 8 shows the average STOI score for each SEA. As in the tested conditions (Fig. 7). In light of this comparative Section V-B, the AKF-Oracle method achieves the highest study, it is evident that the proposed SEA produces higher STOI score for each tested condition. Amongst the SEAs, quality enhanced speech than that of the competing SEAs the proposed method attained the highest average STOI score VOLUME **, **** 11 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement 8 8 intelligible enhanced speech than the competing SEAs. Freq.(kHz) Freq.(kHz) 6 6 4 4 2 2 D. OBJECTIVE EVALUATION FOR MULTIPLE 0 iii 0 OBJECTIVE MEASURES 8 8 In this section, we perform an analysis of the performance Freq.(kHz) Freq.(kHz) 6 6 4 4 improvement of the proposed method over the competing methods for the objective measures described in Table 4, 2 2 0 0 8 8 including CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR. The mean objective evaluation results for NOIZEUS Freq.(kHz) Freq.(kHz) 6 6 dataset and DEMAND Voice Bank test set are shown in 4 4 2 2 0 0 Tables 7 and 8, respectively. It can be seen that AKF- 8 8 Oracle produces the highest scores for all measures, which can be thought of as the upper boundary of performance. Freq.(kHz) Freq.(kHz) 6 6 4 4 2 2 Noisy produced the lowest scores for all measures, indicating 0 0 the lower boundary of performance. When comparing the 8 8 performance of the proposed method, it shows a consistent Freq.(kHz) Freq.(kHz) 6 6 4 4 CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR 2 2 score improvement over the competing methods. This again 0 0 0.5 1 Time (s) 1.5 2 0.5 1 Time (s) 1.5 2 demonstrates that DeepLPC produces enhanced speech at a higher quality and intelligibility than any of the competing FIGURE 9: Spectrograms of: (i) clean speech (utterance sp05), (ii) noisy speech ((a) corrupted with 5 dB voice babble noise), methods. (iii)-(x) enhanced speech produced by each SEA. E. SPECTROGRAM ANALYSIS This section analyzes the enhanced speech spectrograms pro- 8 8 duced by each of the SEAs for the test conditions specified in Freq.(kHz) Freq.(kHz) 6 6 4 4 Section IV-H. Specifically, Fig. 9 (i) shows the spectrogram 2 0 2 0 of clean speech (male utterance sp05). The clean speech is 8 iii 8 corrupted by voice babble noise at an SNR level of 5 dB to create the noisy speech shown in Fig. 9 (ii). This is a Freq.(kHz) Freq.(kHz) 6 6 4 2 4 2 particularly tough condition for speech enhancement since 0 0 the background noise exhibits characteristics similar to the 8 8 speech produced by the target speaker. The enhanced speech produced by DNN-LPC-KF is shown in Fig. 9 (iii). It can be Freq.(kHz) Freq.(kHz) 6 6 4 4 2 2 seen that DNN-LPC-KF significantly reduces the amount of 0 0 background noise in the noisy speech. The enhanced speech 8 8 produced by LSTM-CKFS [33] (Fig. 9 (iv)) contains less Freq.(kHz) Freq.(kHz) 6 6 4 4 residual background noise than that of DNN-LPC-KF (Fig. 9 2 2 (iii)); however, suffers from significant speech distortion. 0 0 Fig. 9 (v) shows the enhanced speech produced by EEUE- 8 8 FCNN [27]. This method produced less distorted speech Freq.(kHz) Freq.(kHz) 6 6 4 4 than LSTM-CKFS [33] (Fig. 9 (iv)); however, residual back- 2 ground noise still remains. The enhanced speech produced by 2 0 0 0.5 1 Time (s) 1.5 2 2.5 0.5 1 Time (s) 1.5 2 2.5 IAM-IFD [23] (Fig. 9 (vi)) shows less speech distortion and FIGURE 10: Spectrograms of: (i) clean speech (utterance sp27), residual background noise than EEUE-FCNN (Fig. 9 (v)). (ii) noisy speech ((a) corrupted with 5 dB factory noise), (iii)-(x) Less residual background noise is present in the enhanced enhanced speech produced by each SEA. speech produced by Deep Xi-KF [32] (Fig. 9 (vii)) than IAM-IFD [23] (Fig. 9 (vi)), however, the speech is more distorted. Deep Xi-ResNet-TCN-MMSE-LSA produced less for each tested condition. When analyzing the performance distorted speech (Fig. 9 (viii)) than that of Deep Xi-KF [32] of the competing SEAs, Deep Xi-ResNet-TCN-MMSE-LSA (Fig. 9 (vii)). The enhanced speech produced by the proposed attained the highest average STOI scores for the tested con- method is shown in Fig. 9 (ix). It can be seen that it produces ditions. Again, the proposed method outperformed the com- less residual background noise and speech distortion than peting SEAs, producing more intelligible enhanced speech Deep Xi-ResNet-TCN-MMSE-LSA (Fig. 9 (viii)). Finally, for the tested conditions. The lower bias exhibited by the the enhanced speech produced by the AKF-Oracle method clean speech and noise LPC estimates of DeepLPC results is shown in Fig. 9 (x). The enhanced speech of AKF-Oracle in not only higher quality enhanced speech, but also more is most similar to the clean speech in Fig. 9 (i). This is due to 12 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement TABLE 7: Mean objective scores on the NOIZEUS dataset in terms of CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR. Apart from AKF-Oracle, the highest score amongst the methods for each measure is given in boldface. Methods CSIG CBAK COVL PESQ STOI SegSNR SI-SDR Noisy speech 2.41 2.27 2.12 1.64 67.87 0.89 6.39 DNN-LPC-KF 2.59 2.49 2.34 1.89 74.60 6.33 10.72 LSTM-CKFS 2.63 2.55 2.42 1.99 77.58 6.54 11.15 EEUE-FCNN 2.76 2.66 2.56 2.05 79.45 6.93 11.59 IAM-IFD 2.95 2.72 2.63 2.11 80.64 7.01 11.88 Deep Xi-KF 3.11 2.83 2.72 2.16 81.89 7.14 12.15 Deep Xi-ResNet-TCN-MMSE-LSA 3.38 3.02 2.81 2.22 82.05 7.67 13.39 Proposed 3.49 3.17 2.95 2.35 84.71 8.78 14.44 AKF-Oracle 4.21 4.07 3.97 2.74 95.18 10.87 16.43 TABLE 8: Mean objective scores on the DEMAND Voice Bank test set in terms of CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR. Apart from AKF-Oracle, the highest score amongst the competing methods for each measure is given in boldface. Methods CSIG CBAK COVL PESQ STOI SegSNR SI-SDR Noisy speech 3.50 2.47 2.73 1.99 91.53 1.71 8.39 DNN-LPC-KF 3.61 2.79 2.91 2.57 91.79 7.33 16.68 LSTM-CKFS 3.63 2.85 2.93 2.61 91.87 7.44 16.75 EEUE-FCNN 3.67 2.87 3.02 2.64 92.03 7.64 17.19 IAM-IFD 3.74 2.96 3.11 2.69 92.13 7.67 17.32 Deep Xi-KF 3.90 3.09 3.23 2.74 92.34 8.21 17.55 Deep Xi-ResNet-TCN-MMSE-LSA 4.19 3.40 3.52 2.83 93.31 9.03 17.72 Proposed 4.27 3.48 3.61 2.97 94.44 9.19 17.98 AKF-Oracle 4.54 4.12 4.19 3.21 96.13 11.43 20.22 AKF-Oracle uses the clean speech and noise (unobserved in Mean subjective preference score (%) 100 practice) for LPC parameter estimation. 80 Fig. 10 shows the enhanced speech spectrograms of each SEA for a real-world coloured noise condition. The spectro- 60 gram of the clean speech (female utterance sp27) is shown in Fig. 10 (i). The clean speech is corrupted by factory noise at 40 an SNR level of 5 dB to generate the noisy speech shown in Fig. 10 (ii). The enhanced speech produced by DNN-LPC- 20 KF (Fig. 10 (iii)) contains significant residual background noise. The enhanced speech produced by LSTM-CKFS [33] 0 S N F D an sy e F ed (Fig. 10 (iv)) shows less residual background noise than KF -K N SA -IF cl -K oi le FC os ra PC -C Xi -L N M C O op E- IA SE TM -L p F- DNN-LPC-KF (Fig. 10 (iii)), however, it still suffers from Pr ee U N AK M LS EE N D -M D significant speech distortion. The enhanced speech of EEUE- N C -T et FCNN has less speech distortion and residual background N es -R noise (Fig. 10 (v)) than LSTM-CKFS [33] (Fig. 10 (iv)). Xi p ee IAM-IFD [23] produced enhanced speech with less back- D Speech Enhancement Methods ground noise (Fig. 10 (vi)) than EEUE-FCNN (Fig. 10 (v)), FIGURE 11: The mean preference score (%) comparison be- although there remains noticeable speech distortion. The tween the proposed and benchmark SEAs for male utterance enhanced speech produced by Deep Xi-KF [32] (Fig. 10 sp05 corrupted with 5 dB non-stationary voice babble noise. The (vii)) contains less residual background noise, as well as error bars indicate the standard deviation of the scores. speech distortion than IAM-IFD [23] (Fig. 10 (vi)). Less residual noise as well as speech distortion remains in the en- hanced speech produced by Deep Xi-ResNet-TCN-MMSE- method is widely preferred (71%) by the listeners to that of LSA (Fig. 10 (viii)) than that of Deep Xi-KF [32] (Fig. 10 the competing methods, apart from the clean speech (100%) (vii)). The enhanced speech produced by the proposed and AKF-Oracle (81%). Deep Xi-ResNet-TCN-MMSE-LSA method (Fig. 10 (ix)) contains the least amount of residual is found to be the most preferred method (66%) amongst background noise, as well as speech distortion amongst the the competing SEAs, with Deep Xi-KF (62%) being the SEAs and is most similar to the enhanced speech produced next most preferred method, followed by IAM-IFD (56%), by AKF-Oracle (Fig. 10 (x)). LSTM-CKFS (57%), and then EEUE-FCNN (52%). LSTM- CKFS [33] was preferred by the listeners more than EEUE- F. SUBJECTIVE EVALUATION FCNN [27], even though EEUE-FCNN attained higher ob- The mean subjective preference scores (%) for each SEA jective scores (Section V-B and V-C). This may be due to are shown in Figures 11 and 12. The non-stationary (voice the fact that LSTM-CKFS [33] demonstrates superior noise babble) noise experiment in Fig. 11 reveals that the proposed suppression in regions of speech than EEUE-FCNN [27], VOLUME **, **** 13 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement was shown that the estimated LPCs produced by the proposed Mean subjective preference score (%) 100 DeepLPC framework exhibit a lower SD level than recent 80 deep learning methods for both of the test sets. Moreover, the AKF constructed with the speech and noise LPC param- 60 eters derived from the DeepLPC leads to the capability of 40 speech enhancement in real-life noise conditions. Extensive objective and subjective testing on NOIZEUS and DEMAND 20 Voice Bank test sets demonstrate that the proposed method outperforms the competing deep learning-based methods in 0 various noise conditions for a wide range of SNR levels. S N F D an sy e F d Though the proposed DeepLPC framework achieves the KF -K N SA IF cl se -K oi le FC ra PC - -C Xi po -L N M C O E- IA SE TM -L o p lowest SD level for LPC estimation, there is still room F- Pr ee U N AK M LS EE N D -M D for improvement. For example, in the proposed DeepLPC N C -T et framework, we have incorporated ResNet-TCN. However, N es -R the multi-head self-attention network (MHANet) has been Xi p ee shown to outperform ResNet-TCN for speech enhancement D Speech Enhancement Methods [60]. Motivated by this, the DeepLPC framework will employ FIGURE 12: The mean preference score (%) comparison be- the MHANet in future work to facilitate a further improve- tween the proposed and benchmark SEAs for female utterance sp27 corrupted with 5 dB coloured factory noise. The error bars ment in LPC estimation for AKF-based speech enhancement. indicate the standard deviation of the scores. REFERENCES [1] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Boca as indicated in [17]. DNN-LPC-KF was given the lowest Raton, FL, USA: CRC Press, Inc., 2013. [2] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac- preference score (46%) amongst the SEAs. tion,” IEEE Transactions on Acoustics, Speech, and Signal Processing, The blind AB listening test results for the coloured (fac- vol. 27, pp. 113–120, April 1979. tory) noise condition is shown in Fig. 12. It can be seen [3] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech cor- rupted by acoustic noise,” IEEE International Conference on Acoustics, that the proposed method achieves a better preference score Speech, and Signal Processing, vol. 4, pp. 208–211, April 1979. (74%) than the competing methods, except for clean speech [4] S. Kamath and P. Loizou, “A multi-band spectral subtraction method (100%) and AKF-Oracle (84%). As in the previous ex- for enhancing speech corrupted by colored noise,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 4160– periment, Deep Xi-ResNet-TCN-MMSE-LSA was the most 4164, May 2002. preferred amongst the competing methods (70%), with Deep [5] K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech en- Xi-KF [32] (63%) being the next most preferred method, hancement using spectral subtraction in the short-time modulation do- main,” Speech Communication, vol. 52, no. 5, pp. 450–475, May 2010. followed by IAM-IFD [23] (61%), LSTM-CKFS [33] (59%), [6] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal EEUE-FCNN [27] (56%), and then DNN-LPC-KF (48%). In to noise estimation,” IEEE International Conference on Acoustics, Speech, light of the blind AB listening tests, it is evident to say that the and Signal Processing, vol. 2, pp. 629–632, May 1996. enhanced speech produced by the proposed method exhibits [7] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step noise re- duction technique,” IEEE International Conference on Acoustics, Speech, the best perceived quality amongst all of the competing and Signal Processing, vol. 1, pp. 289–292, May 2004. SEAs, for both male and female utterances corrupted by real- [8] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean life non-stationary and coloured noise sources. square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– 1121, December 1984. VI. CONCLUSION [9] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean- square error log-spectral amplitude estimator,” IEEE Transactions on We propose the DeepLPC framework to estimate LPCs for Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, the AKF in real-life noise conditions. Specifically, DeepLPC April 1985. maps each frame of the noisy speech magnitude spectrum to [10] K. Paliwal, B. Schwerin, and K. Wójcicki, “Speech enhancement using the LPC power spectrum of the clean speech and noise signal. a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, vol. 54, no. 2, pp. 282–305, February Applying the inverse Fourier transform to the estimated LPC 2012. power spectra yields the corresponding autocorrelation ma- [11] B. M. Mahmmod, A. R. Ramli, T. Baker, F. Al-Obeidat, S. H. Abdulhus- trices. Then the application of the Levinson-Durbin recursion sain, and W. A. Jassim, “Speech enhancement algorithm based on super- gaussian modeling and orthogonal polynomials,” IEEE Access, vol. 7, pp. to the autocorrelation matrices yields the LPC estimates and 103 485–103 504, 2019. the prediction error variances of the speech and noise signal. [12] K. Paliwal and A. Basu, “A speech enhancement method based on Kalman The Deep Xi and DEMAND Voice Bank datasets were used filtering,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 12, pp. 177–180, April 1987. to train DeepLPC separately to ensure generalizing capability [13] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise for to unseen conditions. NOIZEUS dataset is used to evaluate speech enhancement and coding,” IEEE Transactions on Signal Process- the performance of DeepLPC trained with Deep Xi dataset. ing, vol. 39, no. 8, pp. 1732–1742, August 1991. [14] G. J. Brown and D. Wang, Separation of Speech by Computational In addition, DeepLPC trained with DEMAND Voice Bank Auditory Scene Analysis. Berlin, Heidelberg: Springer Berlin Heidelberg, dataset is evaluated using DEMAND Voice Bank test set. It 2005. 14 VOLUME **, **** S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement [15] Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on speech tions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1383– enhancement based on deep neural networks,” IEEE Signal Processing 1393, December 2012. Letters, vol. 21, no. 1, pp. 65–68, 2014. [38] S. So, “Efficient block quantisation for image and speech coding,” [16] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech PhD dissertation, Griffith University, 2005. [Online]. Available: enhancement using subband iterative Kalman filter,” IEEE International https://research-repository.griffith.edu.au/handle/10072/366625. Symposium on Circuits and Systems, pp. 762–765, May 2016. [39] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. [17] A. E. George, S. So, R. Ghosh, and K. K. Paliwal, “Robustness metric- abs/1607.06450, 2016. based tuning of the augmented Kalman filter for the enhancement of [40] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: speech corrupted with coloured noise,” Speech Communication, vol. 105, Surpassing human-level performance on imagenet classification,” CoRR, pp. 62–76, October 2018. vol. abs/1502.01852, 2015. [18] Y. Wang and D. Wang, “Towards scaling up classification-based speech [41] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, separation,” IEEE Transactions on Audio, Speech, and Language Process- and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, ing, vol. 21, no. 7, pp. 1381–1390, 2013. vol. abs/1610.10099, 2016. [19] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised [42] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An speech separation,” IEEE/ACM Transactions on Audio, Speech, and Lan- ASR corpus based on public domain audio books,” IEEE International guage Processing, vol. 22, no. 12, pp. 1849–1858, 2014. Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210, [20] N. Saleem, M. I. Khattak, M. Al-Hasan, and A. B. Qazi, “On learning spec- April 2015. tral masking for single channel speech enhancement using feedforward [43] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: and recurrent neural networks,” IEEE Access, vol. 8, pp. 160 581–160 595, English multi-speaker corpus for CSTR voice cloning toolkit,” University 2020. of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. [21] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive [44] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, and recognition-boosted speech separation using deep recurrent neural “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. networks,” in 2015 IEEE International Conference on Acoustics, Speech NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, and Signal Processing (ICASSP), 2015, pp. 708–712. Feb. 1993. [22] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for [45] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE- monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, TIMIT corpus for the evaluation of voice activity detection algorithms,” in and Language Processing, vol. 24, no. 3, pp. 483–492, 2016. Proceedings Interspeech 2010, 2010, pp. 3110–3113. [23] N. Zheng and X. Zhang, “Phase-aware speech enhancement based on [46] G. Hu, “100 nonspeech environmental sounds,” The Ohio State University, deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Department of Computer Science and Engineering, 2004. Language Processing, vol. 27, no. 1, pp. 63–76, 2019. [47] F. Saki, A. Sehgal, I. Panahi, and N. Kehtarnavaz, “Smartphone-based real- [24] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, time classification of noise signals using subband features and random “Learning spectral mapping for speech dereverberation and denoising,” forest classifier,” in 2016 IEEE International Conference on Acoustics, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Speech and Signal Processing (ICASSP), March 2016, pp. 2204–2208. vol. 23, no. 6, pp. 982–992, 2015. [48] F. Saki and N. Kehtarnavaz, “Automatic switching between noise classi- [25] A. Nicolson and K. K. Paliwal, “Deep learning for minimum mean-square fication and speech enhancement for hearing aid devices,” in 2016 38th error approaches to speech enhancement,” Speech Communication, vol. Annual International Conference of the IEEE Engineering in Medicine and 111, pp. 44–55, 2019. Biology Society (EMBC), Aug 2016, pp. 736–739. [26] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, “Deep- [49] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise MMSE: A deep learning approach to mmse-based noise power spectral corpus,” CoRR, vol. abs/1510.08484, 2015. density estimation,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 28, pp. 1404–1415, 2020. [50] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, [27] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end wave- collection and data analysis of a large regional accent speech database,” form utterance enhancement for direct evaluation metrics optimization by in 2013 International Conference Oriental COCOSDA held jointly with fully convolutional neural networks,” IEEE/ACM Transactions on Audio, 2013 Conference on Asian Spoken Language Research and Evaluation (O- Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018. COCOSDA/CASLRE), 2013, pp. 1–4. [28] C. Pickersgill, S. So, and B. Schwerin, “Investigation of DNN prediction [51] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi- of power spectral envelopes for speech coding & asr,” 17th Speech Science gating RNN-based speech enhancement methods for noise-robust text-to- and Technology Conference (SST2018), Sydney, Australia, Dec 2018. speech,” in 9th ISCA Speech Synthesis Workshop, 2016, pp. 146–152. [29] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neural [52] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi- network based Kalman filter for time domain speech enhancement,” IEEE channel acoustic noise database (DEMAND): A database of multichannel International Symposium on Circuits and Systems, pp. 1–5, May 2019. environmental noise recordings,” Proceedings of Meetings on Acoustics, [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image vol. 19, no. 1, p. 035081, 2013. recognition,” IEEE Conference on Computer Vision and Pattern Recogni- [53] A. Nicolson, “Deep Xi dataset,” IEEE Dataport, 2020. [Online]. tion, pp. 770–778, June 2016. Available: https://dx.doi.org/10.21227/3adt-pb04 [31] S. K. Roy, A. Nicolson, and K. K. Paliwal, “Deep learning with augmented [54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Kalman filter for single-channel speech enhancement,” in 2020 IEEE CoRR, vol. abs/1412.6980, 2015. International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1–5. [55] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for [32] S. K. Roy, P. Nicolson, and K. K. Paliwal, “A deep learning-based Kalman speech enhancement,” IEEE Transactions on Audio, Speech, and Lan- filter for speech enhancement,” Prof. of Interspeech2020, pp. 2692–2696, guage Processing, vol. 16, no. 1, pp. 229–238, 2008. October 2020. [56] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual [33] H. Yu, W.-P. Zhu, and B. Champagne, “Speech enhancement using a DNN- evaluation of speech quality (PESQ)-a new method for speech quality augmented colored-noise Kalman filter,” Speech Communication, vol. 125, assessment of telephone networks and codecs,” IEEE International Con- pp. 142 – 151, 2020. ference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752, [34] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Codebook driven short- May 2001. term predictor parameter estimation for speech enhancement,” IEEE [57] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, intelligibility prediction of time–frequency weighted noisy speech,” IEEE pp. 163–176, 2006. Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, [35] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduction. pp. 2125–2136, 2011. Hoboken, NJ, USA: John Wiley Sons, Inc., 2006. [58] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked [36] R. C. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD or well done?” in ICASSP 2019 - 2019 IEEE International Conference on tracking with low complexity,” IEEE International Conference on Acous- Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630. tics, Speech and Signal Processing, pp. 4266–4269, March 2010. [59] P. Mermelstein, “Evaluation of a segmental SNR measure as an indicator [37] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise power of the quality of ADPCM coded speech,” The Journal of the Acoustical estimation with low complexity and low tracking delay,” IEEE Transac- Society of America, vol. 66, no. 6, pp. 1664–1667, 1979. VOLUME **, **** 15 S. K. Roy et al.: DeepLPC: A Deep Learning Approach to AKF-Based Speech Enhancement [60] A. Nicolson and K. K. Paliwal, “Masked multi-head self-attention for KULDIP K. PALIWAL was born in Aligarh, India, causal speech enhancement,” Speech Communication, vol. 125, pp. 80–96, in 1952. He received the B.S. degree from Agra 2020. University, Agra, India, in 1969, the M.S. degree from Aligarh Muslim University, Aligarh, India, in 1971, and the Ph.D degree from Bombay Univer- sity, Bombay, India, in 1978. He has been carrying out research in the area of speech processing since SUJAN KUMAR ROY (Graduate Student Mem- 1972. He has worked at a number of organizations, ber, IEEE) received a B.Sc. and M.Sc. degrees including Tata Institute of Fundamental Research, in Computer Science and Engineering from the Bombay, India Norwegian Institute of Technology, University of Rajshahi, Bangladesh, in 2008 and Trondheim, Norway, University of Keele, U.K., AT&T Bell Laboratories, 2010, respectively. He also received a Master of Murray Hill, New Jersey, U.S.A., AT&T Shannon Laboratories, Florham Applied Science (M.A.Sc) degree in Electrical and Park, New Jersey, U.S.A., and Advanced Telecommunication Research Computer Engineering from Concordia Univer- Laboratories, Kyoto, Japan. Since July 1993, he has been a professor at sity, Canada, in May 2016. He is currently a Ph.D Griffith University, Brisbane, Australia, in the School of Microelectronic En- candidate in the School of Engineering, Griffith gineering. His current research interests include speech recognition, speech University, Brisbane, Australia. His research inter- coding, speaker recognition, speech enhancement, face recognition, image ests include speech processing, machine learning, and data science. coding, pattern recognition and artificial neural networks. He has published more than 300 papers in these research areas. Prof. Paliwal is a Fellow of the Acoustical Society of India. He has served the IEEE Signal Processing Society’s Neural Networks Technical Committee as a founding member from 1991 to 1995 and the Speech Processing Technical Committee from AARON NICOLSON was born in Brisbane, Aus- 1999 to 2003. He was an Associate Editor of the IEEE Transactions on tralia in 1994. He received a BEng (Class 1 A Speech and Audio Processing during the periods 1994-1997 and 2003-2004. Hons.) and PhD degree from Griffith University, He is on the Editorial Board of the IEEE Signal Processing Magazine. He Brisbane, Australia, in 2016 and 2020, respec- also served as an Associate Editor of the IEEE Signal Processing Letters tively. He is currently a postdoctoral research fel- from 1997 to 2000. He was the General Co-Chair of the Tenth IEEE low at the Australian eHealth Research Centre, Workshop on Neural Networks for Signal Processing (NNSP2000). He has CSIRO. His research interests include speech, nat- co-edited two books: "Speech Coding and Synthesis" (published by Elsevier) ural language, image, and multimodal processing and "Speech and Speaker Recognition: Advanced Topics" (published by using deep learning. Kluwer). He received the IEEE Signal Processing Society’s best (senior) paper award in 1995 for his paper on LPC quantization. He served as the Editor-in-Chief of the Speech Communication journal (published by Elsevier) during 2005-2011. 16 VOLUME **, **** Chapter 7 DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-Based Speech Enhancement 105 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal, "DeepLPC-MHANet: Multi-Head Self- Attention for Augmented Kalman Filter-based Speech Enhancement", Under review with: IEEE Access (Submitted at 08 April 2021). My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Aaron Nicolson aided with drafting the final manuscript. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ___________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) ______ ________ (Date) 02/04/2021 Aaron Nicolson (Countersigned) ____ ____ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier *** DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech Enhancement SUJAN KUMAR ROY1 , AARON NICOLSON2 , KULDIP K. PALIWAL1 1 Signal Processing Laboratory, Griffith University, Nathan Campus, Brisbane, QLD, 4111, Australia 2 Australian eHealth Research Centre, CSIRO, Herston, QLD, 4006, Australia Corresponding author: Sujan Kumar Roy (e-mail:

[email protected]

). ABSTRACT Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolu- tional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF- based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post- processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method. INDEX TERMS Speech enhancement, Kalman filter, augmented Kalman filter, LPC, temporal convolu- tional network, multi-head attention network. I. INTRODUCTION [2]–[4], Wiener filter (WF) [5], minimum mean square error S PEECH corrupted by background noise (or noisy speech) can reduce the efficiency of communication be- tween speaker and listener. A speech enhancement algorithm (MMSE) [6]–[8], Kalman filter (KF) [9], augmented KF (AKF) [10], computational auditory scene analysis (CASA) [11], and deep learning-based [12] have been introduced over (SEA) can be used to suppress the embedded background the decades. This paper focuses on the AKF—constructed noise and increase the quality and intelligibility of noisy from parameters estimated using deep learning. speech [1]. SEAs are useful in many applications, where The KF is an unbiased linear MMSE estimator, which was noisy speech is undesirable and unavoidable. For example, first introduced as a SEA by Paliwal and Basu [9]. In this speech communication systems, hearing aid devices, and seminal work, each frame of the uncorrupted speech signal speech recognition systems typically rely upon SEAs for (i.e., clean speech) is represented by an auto-regressive (AR) robustness. Various SEAs, namely spectral subtraction (SS) process, whose parameters comprise the linear prediction co- VOLUME **, **** 1 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement efficients (LPCs) and the prediction error variance. The LPC the amplitude and the phase spectrum of the clean speech. In parameters as well as the additive noise variance are inherent [21], Zheng et al. introduced a phase-ware SEA. Here, the to the KF recursive equations. For simplicity, the background phase information (converted to the instantaneous frequency noise was assumed to be stationary and white. Given a deviation (IFD) is jointly used with the IAM to form the frame of noisy speech samples, the recursive equations of phase sensitive mask (PSM). The clean speech spectrum is the KF estimate the clean speech samples. As demonstrated then reconstructed using the estimated mask and the phase by Paliwal and Basu, it is difficult to accurately estimate the information (extracted from the IFD). Unlike masking-based clean speech LPC parameters and additive noise variance in methods, mapping-based methods utilise a deep neural net- practice, with poor estimates resulting in enhanced speech work (DNN) to estimate the clean speech spectrum. In [12], with low quality and intelligibility. Xu et al. employed a DNN to map the noisy speech log In [10], Gibson et al. introduced the augmented KF (AKF) power spectra (LPS) to the clean speech LPS. In [22], Han for speech enhancement in coloured noise conditions. For the et al. trained a DNN to learn a spectral mapping from the AKF, both the clean speech and additive background noise magnitude spectrum of noisy speech to that of clean speech. are represented by AR processes. The clean speech and noise Deep learning methods have also been proposed to im- LPC parameters form an augmented matrix, which is used prove the performance of statistical model-based SEAs, such to construct the recursive equations of the AKF. In [10], the as the MMSE short-time spectral amplitude (MMSE-STSA) AKF processes the noisy speech iteratively (usually three to estimator [6], MMSE log-spectral amplitude (MMSE-LSA) four iterations) to suppress the coloured background noise, estimator [7], WF [1], and square-root WF (SRWF) [1]. yielding the enhanced speech. During this, the clean speech Generally, the performance of these SEAs relies upon the and noise LPC parameters for the current frame are estimated accuracy of the a priori SNR estimate. In [23], Nicolson and from the corresponding filtered speech frame of the previous Paliwal proposed Deep Xi—a deep learning framework to iteration. Although this method demonstrated the ability to estimate the a priori SNR. In [24], Zhang et al. proposed improve the signal-to-noise ratio (SNR) of noisy speech, the the DeepMMSE framework for noise power spectral density resultant enhanced speech suffered from musical noise and (PSD) estimation. DeepMMSE uses the Deep Xi framework speech distortion. This is because the AKF is not robust to with a residual network (ResNet) temporal convolutional net- inaccurate LPC estimates [13], [14]. work (ResNet-TCN) to estimate parameters for the MMSE- In [15], Roy et al. introduced a sub-band (SB) iterative KF based noise periodogram estimator. DeepMMSE was able to (SBIT-KF) for SEA. With the assumption that the impact of demonstrate better noise PSD tracking than other benchmark noise in low-frequency SBs is negligible, SBIT-KF enhances methods in various noise conditions. only the high-frequency sub-bands (SBs) of the noisy speech In [25], an attention-based network was investigated for using two KF iterations. However, low-frequency SBs can speech enhancement, namely the multi-head attention net- also be affected by noise—typically when operating in real- work (MHANet). This was motivated by the ability of multi- life noise conditions. Moreover, the iterative processing em- head attention to more efficiently model long-term dependen- ployed by SBIT-KF produces speech distortion [10]. George cies than recurrent neural networks (RNNs) and TCNs [26]. et al. used a robustness metric to tune the AKF for coloured The experimental results demonstrated that MHANet was noise [13]. The authors demonstrated that inaccurate esti- able to attain significantly higher objective quality and intel- mates of the clean speech and noise LPC parameters intro- ligibility scores than a TCN and a long short-term memory duce bias in the AKF gain, leading to a degradation in speech (LSTM) network. This indicated that multi-head attention enhancement performance. Typically, the adjusted AKF gain is more apt at modelling the long-term dependencies of the is under-estimated in speech regions, resulting in distorted clean speech and background noise present in noisy speech speech. than that of RNNs and TCNs. In recent years, deep learning-based supervised methods Deep learning has also been employed for time-domain have been used for speech enhancement. Many approaches speech enhancement. In [27], Fu et al. proposed raw utilise a time-frequency (T-F) representation derived from waveform-based speech enhancement using a fully convolu- the unobserved clean speech and noise as the training target tional neural network (RWF-FCNN). The FCNN maps noisy [11]. Inspired by the T-F masking in CASA [11], Wang speech time-domain frames to clean speech time-domain and Wang proposed to use a deep neural network (DNN) frames. Different from noisy speech spectral mapping [22], to estimate the ideal binary mask (IBM) [16]. The estimated RWF-FCNN maps each frame of the noisy speech waveform IBM can be used to estimate the T-F components of the clean to the clean speech waveform. By estimating time-domain speech. Later on, researchers found that the ideal ratio mask samples, RWF-FCNN also estimates the phase—spectral (IRM) produces higher objective quality scores than the IBM magnitude estimation methods [12]. In [28], the authors [17]. In [18], a post-processing method was employed after claimed that the discontinuities present at the boundaries of masking with the IBM, IRM, or ideal amplitude mask (IAM) framed speech are detrimental to the enhancement process [19], resulting in an improvement in objective quality and in [27]. Motivated by this, the authors proposed end-to-end intelligibility. In [20], Williamson et al. introduced a complex utterance enhancement using an FCNN (EEUE-FCNN). In ideal ratio mask (cIRM), which is capable to recover both this SEA, an FCNN directly maps the noisy speech to the 2 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement clean speech. It was shown that EEUE-FCNN [28] produces (LPC-PS), called DeepLPC [34]. The clean speech and noise more intelligible enhanced speech than that of RWF-FCNN LPC-PS estimates are then used to the LPC estimates re- [27]. quired to construct the AKF. As a result, DeepLPC pro- Deep learning has also been used for LPC estimation— duces clean speech LPCs with significantly less bias than a key stage for KF and AKF-based SEA [9], [10]. In [29], the aforementioned methods. This leads to the production Pickersgill et al. proposed an LPC estimation method using of the highest quality and most intelligible enhanced speech a DNN. One drawback of this study is that results weren’t amongst current KF and AKF SEAs—outperforming Deep given for lower SNR levels (below 10 dB). Moreover, only Xi-KF while using the same training set. However, a ResNet- six noise recordings were used for training, reducing its TCN [24] was used to estimate the clean speech and noise generalisation capabilities for unseen noise conditions. For LPC-PS (DeepLPC-ResNet-TCN). As mentioned previously, the KF SEA in [30], Yu et al. utilised a DNN to estimate a ResNet-TCN is suboptimal for modeling the long-term LPC parameters from noisy speech frames. For training, dependencies of noisy speech. only 10, 720 examples constructed from 670 speech record- Motivated by the shortcomings of previous deep learning- ings, four noise recordings, and four SNR levels were used. based KF and AKF SEAs, we propose DeepLPC-MHANet This limits the number of conditions observed by the DNN for AKF-based speech enhancement. DeepLPC-MHANet during training, thus reducing its generalisation capabilities aims to produce clean speech and noise LPC parameters with to unseen conditions. Also, the additive noise variance is the least bias to date. With this, we also aim to produce higher computed from the first noisy speech frame by assuming quality and more intelligible enhanced speech than any KF that there is no presence of speech. However, this does not or AKF-based SEA. DeepLPC is selected as it avoids the account for conditions that have time-varying amplitudes. issues associated with previous deep learning frameworks for In [31], Yu et al. adopted a fully-connected feed-forward KF and AKF SEAs, including the use of whitening filters, DNN (denoted as FNN) and an LSTM network to estimate post-processing, and small training sets. MHANet is selected the clean speech and noise LPCs, respectively, as well as as it is better suited than TCNs for modelling the long- multi-band spectral subtraction (MB-SS) post-processing [3] term dependencies of noisy speech. Together, DeepLPC and for coloured-noise AKF-based speech enhancement (FNN- MHANet form an improved map from the noisy speech to CKFS, LSTM-CKFS). To estimate the prediction error vari- the clean speech and noise LPC parameters. ances for the AR processes of the AKF, the authors employed The structure of this paper is as follows: background a maximum likelihood (ML) approach [32]. However, FNN- knowledge is presented in Section II, including the AKF for CKFS and LSTM-CKFS lack the ability to accurately esti- speech enhancement, an overview of DeepLPC framework, mate LPCs in various noise conditions—leading to the use of and MHANet. In Section III, we describe the proposed MB-SS for post-processing. This could be due to the small DeepLPC-MHANet. Following this, Section IV describes amount of training data used when fitting the FNN and LSTM the experimental setup. The experimental results are then networks. presented in Section V, along with a discussion. Finally, Motivated by the performance improvement that Deep Xi Section VI gives some concluding remarks. offers to statistical model-based SEAs [23], the AKF in [14] employed the Deep Xi framework to estimate its parameters II. BACKGROUND (named Deep Xi-AKF). Improving upon Deep Xi-AKF, the A. AKF FOR SPEECH ENHANCEMENT KF in [33] utilised the DeepMMSE framework [24] to es- timate its parameters (named Deep Xi-KF, as DeepMMSE In this section, we overview the AKF for speech enhance- uses Deep Xi). This was motivated by DeepMMSE’s ability ment. First, we describe the signal model. The noisy speech to significantly reduce MMSE-based noise PSD estimation y(n), at discrete-time sample n, is given by: bias. Deep Xi-AKF and Deep Xi-KF also used significantly larger training sets than previous methods. For Deep Xi-AKF y(n) = s(n) + v(n), (1) and Deep Xi-KF, the noise parameters are computed from the estimated noise PSD derived from Deep Xi and DeepMMSE, where s(n) is the clean speech and v(n) is assumed to be respectively. However, Deep Xi-AKF and Deep Xi-KF do not uncorrelated additive coloured noise. Next, a 32 ms rectan- directly estimate the clean speech LPC parameters from the gular window with 50% overlap is used to convert y(n) into noisy speech. Rather, a whitening filter is constructed with frames, denoted by y(n, l): its coefficients computed from the estimated noise PSD. The whitening filter is then applied to each noisy speech frame, y(n, l) = s(n, l) + v(n, l), (2) yielding pre-whitened speech, from where the speech LPC parameters are computed. This leads to biased clean speech where lǫ{0, 1, . . . , L − 1} is the frame index with L being LPC estimates—thus impacting the quality and intelligibility the total number of frames, and nǫ{0, 1, . . . , N − 1} where of the enhanced speech produced by the AKF and KF. N is the total number of samples within each frame. For Recently, a deep learning framework was proposed to simplicity, the frame index is omitted from the following estimate the clean speech and noise LPC power spectra AKF recursive equations. VOLUME **, **** 3 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement Each frame of the clean speech and noise signal in Equa- For each frame, the AKF recursively computes an unbiased tion (2) can be represented by pth and q th order AR models, linear MMSE estimate x̂(n|n) at sample n, given y(n), by as in [35, Chapter 8]: using the following equations [13]: p X x̂(n|n − 1) = Φx̂(n − 1|n − 1), (10) s(n) = − ai s(n − i) + w(n), (3) Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + Qn rr , ⊤ ⊤ (11) i=1 ⊤ −1 Xq G(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (12) v(n) = − bk v(n − k) + u(n), (4) ⊤ x̂(n|n) = x̂(n|n − 1) + G(n)[y(n) − c x̂(n|n − 1)], k=1 (13) Ψ(n|n) = [I − G(n)c⊤ ]Ψ(n|n − 1), (14) where {ai ; i = 1, 2, . . . , p} and {bk ; k = 1, 2, . . . , q} are the LPCs, and w(n) and u(n) are Gaussian-distributed excitation 2 σw 0 noises with zero mean and variances σw 2 and σu2 , respectively. where Qn = is the process noise covariance. 0 σu2 Equations (2)-(4) form the augmented state-space model For a noisy speech frame, the error covariances Ψ(n|n − (ASSM) of the AKF [10], given by: 1) and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n), respectively and the Kalman gain G(n) are continually x(n) = Φx(n − 1) + rz(n), (5) updated on a samplewise basis, while ({ai }, σw 2 ) and ({bk }, ⊤ y(n) = c x(n). (6) σu ) remain constant. At sample n, g x̂(n|n) gives the output 2 ⊤ ⊤ of the AKF, ŝ(n|n), where g = 1 0 0 . . . 0 is a (p + q) × 1 column vector. As in [13], ŝ(n|n) is given by: In the above ASSM, 1) x(n) = [s(n) . . . s(n − p + 1) v(n) . . . v(n − q + 1)]T ŝ(n|n) = [1 − G0 (n)]ŝ(n|n − 1) + G0 (n)[y(n)− is a (p+ q) × 1state-vector, v̂(n|n − 1)], (15) Φs 0 2) Φ = is a (p + q) × (p + q) state-transition 0 Φv where G0 (n) is the 1st component of G(n), given by [13]: matrix with:   α2 (n) + σw 2 −a1 −a2 ... −ap−1 −ap G0 (n) = , (16) α2 (n) + σw2 + β 2 (n) + σ 2 u  1 0 ... 0 0     0  where Φs =  0 1 ... 0 , (7)  .. .. .. .. ..   . . . . .  ⊤ α2 (n) = c⊤ s Φs Ψs (n − 1|n − 1)Φs cs , (17) 0 0 ... 1 0 and   β 2 (n) = c⊤ ⊤ (18) −b1 −b2 ... −bq−1 −bq v Φv Ψv (n − 1|n − 1)Φv cv ,  1 0 ... 0 0     0  are the transmission of a posteriori error variances of the Φv =  0 1 ... 0 , (8)  .. .. .. .. ..  speech and noise augmented dynamic model from the pre-  . . . . .  vious sample n − 1, respectively [13]. 0 0 ... 1 0 Equation (15) reveals that G0 (n) has a significant impact on ŝ(n|n). In practice, inaccurate estimates of ({ai }, σw2 ) and rs 0 ⊤ ({bk }, σu2 ) introduce bias into G0 (n), which impacts ŝ(n|n). 3) r = , where r s = 1 0 . . . 0 , rv = 0 rv In our previous work, we proposed the DeepLPC framework ⊤ [34] to estimate ({ai }, σw 2 ) and ({bk }, σu2 ) for the AKF, as 1 0 ... 0 , 4) described in following section. w(n) z(n) = , (9) u(n) ⊤ 5) c⊤ = c⊤ s c⊤v , where cs = 1 0 . . . 0 and ⊤ FIGURE 1: Block diagram of the DeepLPC framework. cv = 1 0 . . . 0 are p × 1 and q × 1 vectors, 6) y(n) is the noisy measurement at sample n. 4 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement MHANet MHA module Masked scaled dot- 𝜻෠ product attention Output layer Add & norm Linear MatMul Feed-forward Concat ×𝐻 Softmax Add & norm ×𝐵 Masked scaled dot-product Mask Masked attention multi-head Scale self-attention Linear Linear Linear MatMul Pos. encoding 𝐐 𝐊 𝐕 𝓠ℎ 𝓚ℎ 𝓥ℎ First layer |𝐘| FIGURE 2: (left) DeepLPC-MHANet, (middle) multi-head attention (MHA) module, and (right) masked scaled dot-product attention. B. DeepLPC FRAMEWORK DeepLPC then estimates the clean speech and noise LPC- In this section, we review the DeepLPC framework [34]. PS in two stages. For the first stage, a DNN jointly esti- DeepLPC was able to produce speech and noise LPC es- mates a mapped version of the speech and noise LPC-PS, timates with significantly less bias than previous methods ζ l = {λ̄s (l); λ̄v (l)}, of size M × 2, where {·; ·} denotes the by avoiding the use of a whitening filter, as used in earlier concatenation operation, and λ̄s (l) and λ̄v (l) are computed methods [14], [33]. This was accomplished by using deep from λs (l) and λv (l), respectively, by using a mapping learning to jointly estimate the clean speech and noise LPC- function. As in [34], the cumulative distribution functions PS, denoted as λ̂s (l) = {λ̂s (l, 0), λ̂s (l, 1), . . . , λ̂s (l, M −1)} (CDF) of λs (l) and λv (l) are used as the mapping functions and λ̂v (l) = {λ̂v (l, 0), λ̂v (l, 1), . . . , λ̂v (l, M − 1)}, respec- to compute λ̄v (l) and λ̄s (l), respectively. A description of tively, where M is the total number of discrete-frequency how λ̄v (l) and λ̄s (l) are computed is provided in Appendix bins. A. In [34], a ResNet-TCN was used to estimate ζ̂ l . For the The DeepLPC framework is shown in Figure 1. DeepLPC second stage, ζ̂ l is first split into the mapped clean speech and noise LPC-PS, λ̄ ˆ (l, m) and λ̄ ˆ (l, m), respectively. Next, is fed as input the single-sided noisy speech magnitude s v spectrum |Y (l)| = {|Y (l, 0)|, |Y (l, 1)|, . . . , |Y (l, M − 1)|}. ˆ the inverse mapping of λ̄s (l, m) and λ̄ ˆ (l, m) yields λ̂ (l, m) v s This is computed from the noisy speech in Equation 1 using and λ̂v (l, m). The inverse mapping is described in Appendix the short-time Fourier transform (STFT): B. Y (l, m) = S(l, m) + V (l, m), (19) The |IDFT| of λ̂s (l, m) and λ̂v (l, m) yields an estimate of the autocorrelation matrices, R bss (τ ) and Rbvv (τ ), where where Y (l, m), S(l, m), and V (l, m) denote the complex- τ is the autocorrelation lag. As in [34, eq. (26)-(27)], we valued STFT coefficients of the noisy speech, clean speech, construct Yule-Walker equations with the estimated R bss (τ ) and noise, respectively, for time-frame index l and discrete- and R bvv (τ ). These are solved using the Levinson-Durbin frequency bin m. The Hamming window is used for analysis recursion [35, Chapter 8], giving ({âi }, σ̂w 2 ) (p = 16) and and synthesis. ({b̂k }, σ̂u ) (q = 16) for constructing the AKF. 2 VOLUME **, **** 5 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement C. MHANet The MHANet proposed in [25] is overviewed from input to output in this section. For a detailed description of MHANet, we refer the reader to [25]. MHANet is shown in Figure 2 (left). The first layer is used to project the input to a size of dmodel , and is realised as follows: max(0, LN(|Y |W I + bIs )), where LN is frame-wise layer normalisation [36] W I ∈ RM ×dmodel and bIs ∈ Rdmodel . Next, the positional encoding from [25] is added after the first layer, where the time-frame index indicates the position. The position encod- ing is learned using weight matrix Wp , with a maximum length of 2048 time-frames (i.e. Wp ∈ R2048×256 ). This is followed B cascading blocks identical to those from the encoder of the Transformer [26], except that masked multi- head attention (MHA) is employed, to ensure causality. Each block includes an MHA module, a two-layer feed- forward neural network (FNN) [12], residual connections [37], and frame-wise LN [36]. The MHA module of each block is shown in Figure 2 (middle). The MHA module takes three inputs belonging to a set of L queries (Qs ∈ RL×dmodel ), keys (K s ∈ RL×dmodel ), and values (V s ∈ RL×dmodel ), where L is the number of frames, and dmodel is the size of each query, key, and value. Each MHA module includes a total of H heads of masked scaled dot-product attention, where h = {1, 2, · · · , H} is the head index. For head h, Qs , K s , and V s are linearly projected as: Q h = Qs W Q K V h , K h = K s W h , and V h = V s W h , where W h ∈ Q K V R dmodel ×dk , Wh ∈ R dmodel ×dk , and W h ∈ R dmodel ×dv are learned weight matrices. The projected queries and keys are of size dk , and the projected values are of size dv , where dk = dv = dmodel /H. Figure 2 (right) shows the masked scaled dot-product attention mechanism for head h, which takes as input Q h , K h , and V h . Masked scaled dot-product attention is computed as: Q K⊤ Attention(Q Qh , K h , V h ) = softmax M s + √h h V h . dk (20) The outputs from all of the heads are then concatenated and linearly projected using the learned weight matrix WhO ∈ RHdv ×dmodel , forming the final output of the MHA module: MHA(Qs , K s , V s ) = concat(A1 , A2 , · · · , AH )W O , FIGURE 3: Block diagram of the proposed SEA. (21) A residual connection is applied from the input to the output III. PROPOSED DeepLPC-MHANet of the MHA module, which is followed by frame-wise LN. Current deep learning-based AKF methods employ a TCN, The second half of the block includes a two-layer FNN: for example, Deep Xi-KF, Deep Xi-AKF, and DeepLPC- FNN(Z) = max(0, ZW 1 + b1s )W 2 + b2s , (22) ResNet-TCN-AKF. However, TCNs demonstrate deficien- cies when modeling the long-term dependencies of noisy where Z ∈ RL×dmodel is the input, W 1 ∈ Rdmodel ×df , speech—unlike attention-based networks [25]. Hence, we b1s ∈ Rdf , W 2 ∈ Rdf ×dmodel , and b2s ∈ Rdmodel . Hence, the investigate if an attention-based network can produce clean inner layer has a size of df . A residual connection is applied speech and noise LPC estimates with less bias and obtain from the input to the output of the FNN, which is followed higher quality and intelligibility scores than current deep by frame-wise LN. The last block is followed by the output learning-based KF and AKF SEAs. To this end, we compare layer, which is a sigmoidal feed-forward layer. the ResNet-TCN to the MHANet within the DeepLPC frame- 6 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement work, as it has shown to outperform all other KF and AKF where γ is the training step and Γ is the number of deep learning frameworks to date [34]. warmup steps. The block diagram of the proposed SEA, DeepLPC- • Gradients are clipped between [−1, 1]. MHANet-AKF, is shown in Figure 3. It can be seen that • The number of training examples in an epoch is equal DeepLPC-MHANet estimates ζ = {λ̄s ; λ̄v } from |Y |. The to the number of clean speech recordings used in the hyperparameters for DeepLPC-MHANet are the same used training set, i.e., 70 537. in [25]: B = 5, df = 1 024, dmodel = 256, H = 8, and • A mini-batch size of 8 training examples is used. Γ = 40 000. With this set of hyperparameters, the MHANet • The noisy speech signals are generated on the fly as fol- comprises of approximately 4.27 million parameters. The lows: each clean speech recording is randomly selected training time for DeepLPC-MHANet is approximately 45 and corrupted with a randomly selected noise recording minutes per epoch on an NVIDIA GeForce GTX 1080. at a randomly selected SNR level (-10 to +20 dB, in 1 Section IV-B details the training strategy of the DeepLPC- dB increments). MHANet. • During training, we employ early stopping which mon- itors the validation loss with a patience of 30 epochs. IV. SPEECH ENHANCEMENT EXPERIMENT Using this strategy, training was terminated at epoch A. TRAINING & VALIDATION SET 180. The noisy speech for the training and validation sets are formed from clean speech and noise recordings. For the clean C. TEST SET speech recordings, the train-clean-100 set of the Librispeech For the objective experiments, 30 clean speech recordings corpus [38] (28 539), the CSTR VCTK corpus [39] (42 015), (sampled at 16 kHz) belonging to six speakers (three male and the si∗ and sx∗ training sets of the TIMIT corpus [40] and three female) are taken from the NOIZEUS corpus [1, (3 696) were used, giving a total of 74 250 clean speech Chapter 12]. The noisy speech for the test set is formed recordings. To form the validation set, 5% of the clean speech by mixing the clean speech with real-world non-stationary recordings (3 713) are randomly selected. Thus, 70 537 of (voice babble, street) and coloured (factory and f16) noise the clean speech recordings are used for the training set. For recordings selected from [43], [44] at multiple SNR levels the noise recordings, the QUT-NOISE dataset [41], the Non- varying from -5dB to +15 dB, in 5 dB increments. This speech dataset [42], the Environmental Background Noise provides 30 examples per condition with 20 total conditions. dataset [43], [44], the noise set from the MUSAN corpus Note that both the speech and noise recordings in the test set [45], multiple FreeSound packs (https://freesound.org/)1 , and are different from those used in the training and validation coloured noise recordings (with an value ranging from 2 sets. to 2 in increments of 0.25) were used, giving a total of 16 243 noise recordings. For the validation set, 5% of the D. SD LEVEL EVALUATION noise recordings (813) are randomly selected. The remaining The frame-wise spectral distortion (SD) (dB) [47] is used 15 430 noise recordings are used for the training set. All the to evaluate the accuracy of the LPC estimates produced by clean speech and noise recordings are single-channel with a DeepLPC-MHANet. Specifically, the estimated clean speech sampling frequency of 16 kHz. To create the noisy speech for LPCs are evaluated. SD for lth frame, Dl (in dB) is de- the validation set, each of the 3 713 clean speech recordings fined as the root-mean-square-difference between the LPC- is corrupted by a random section of a randomly selected PS estimate in dB, λ̂s (l, m)[dB] , and the oracle case in dB, noise recording (from the set of 813 noise recordings) at λs (l, m)[dB] as [47]: a randomly selected SNR level (-10 to +20 dB, in 1 dB v increments). The noisy speech for the training set was created u M −1 u 1 X 2 using the method described in Section IV-B. Dl = t λs (l, m)[dB] − λ̂s (l, m)[dB] . (24) M m=0 B. TRAINING STRATEGY The following training strategy was employed to train DeepLPC-MHANet: TABLE 1: Objective measures, what each assesses, and the range of their scores. For each measure, higher is better. • Mean squared error is used as the loss function. • The Adam optimiser [46] with β1 = 0.9, β2 = 0.98, Measure Assesses Range and ǫ = 10−9 is used for stochastic gradient descent CSIG [48] Quality [1, 5] optimisation, where the learning rate, αr , is controlled CBAK [48] Quality [1, 5] COVL [48] Quality [1, 5] over the course of training as in [26]: PESQ [49] Quality [−0.5, 4.5] STOI [50] Intelligibility [0, 100]% αr = d−0.5 model · min(γ −0.5 , γ · Γ−1.5 ), (23) SI-SDR [51] Quality [−∞, ∞] SegSNR [52] Quality [−∞, ∞] 1 Freesound packs that were used: 147, 199, 247, 379, 622, 643, 1 133, 1 563, 1 840, 2 432, 4 366, 4 439, 15 046, 15 598, 21 558. VOLUME **, **** 7 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement E. OBJECTIVE QUALITY AND INTELLIGIBILITY based approaches, where p = 12, q = 12, wf = 20 MEASURES ms, sf = 0 ms, and a rectangular window is used Objective measures are used to evaluate the quality and for framing. LSTM-CKFS utilises multi-band SS post- intelligibility of the enhanced speech with respect to the processing [3]. corresponding clean speech. The objective quality and in- 4) EEUE-FCNN [28]: End-to-end utterance enhancement telligibility measures used in this paper are given in Table using a fully convolutional neural network. 1. We also analyse the enhanced speech spectrogram of the 5) Deep Xi-KF [33]: KF-based SEA, where σ̂v2 is esti- proposed SEA, to determine if it causes speech distortion, mated using the DeepMMSE framework [24] and ({âi }, if any background noise is not suppressed (i.e. residual 2 σ̂w ) are computed from pre-whitened speech corre- background noise), and if it introduces any musical noise. sponding to each noisy speech frame, where p = 10, wf = 32 ms, sf = 16 ms, and a rectangular window is F. SUBJECTIVE EVALUATION used for framing. The subjective evaluation was carried out through a series of 6) Deep Xi-ResNet-TCN-MMSE-LSA: The ResNet- blind AB listening tests [4, Section 3.3.4]. To perform the TCN from [24] is used to form Deep Xi-ResNet-TCN tests, we generated a set of stimuli by corrupting recordings [24]. Deep Xi-ResNet-TCN estimates the a priori SNR sp05 and sp27 from the NOIZEUS corpus [1, Chapter 12]. for the MMSE-LSA estimator [7], where wf = 32 ms, The reference transcript for recording sp05 is: “Wipe the sf = 16 ms, and a square-root-Hann window is used for grease off his dirty face”, and is corrupted with voice babble analysis and synthesis. at 5 dB. The reference transcript for recording sp27 is: “Bring 7) DeepLPC-ResNet-TCN-AKF [34]: AKF constructed your best compass to the third class”, and is corrupted with with ({âi }, σ̂w 2 ) and ({b̂k }, σ̂u2 ) derived from DeepLPC factory at 5 dB. Utterance sp05 and sp27 were uttered by framework, where p = 16, q = 16, wf = 32 ms, a male and a female, respectively. In this test, the enhanced sf = 16 ms, and a rectangular window is used for speech produced by seven SEAs, as well as the corresponding framing. The ResNet-TCN from [24] is incorporated clean speech and noisy speech signals, were played as stimuli within DeepLPC. pairs to the listeners. Specifically, the test is performed on a 8) Proposed DeepLPC-MHANet-AKF: Proposed SEA, total of 144 stimuli pairs (72 for each of recording) played in where AKF is constructed from ({âi }, σ̂w 2 ) and ({b̂k }, a random order to each listener, excluding the comparisons σ̂u ) computed using DeepLPC-MHANet, where p = 2 between the same method. 16, q = 16, wf = 32 ms, sf = 16 ms, and a rectangular The listener gives the following ratings for each stimuli window is used for framing. pair: perceptual preference for the first or second stimuli, or a third response indicating no preference. For pairwise scoring, TABLE 2: Average SD (dB) level comparison for each of the a score of 100% is given to the preferred method, 0% to the LPC estimation methods. The boldface represent the lowest SD level. The used test set is described in Section IV-C. other. A score of 50% is given to both methods when there is no preference. The participants could re-listen to stimuli SNR level (dB) if required. Ten English speaking listeners participate in the Noise Methods -5 0 5 10 15 blind AB listening tests 2 . The mean subjective preference Noisy 22.05 18.29 14.86 13.80 11.87 score (%) is used to compare the SEAs, which is the average Voice babble DNN-LPC [29] 16.72 15.98 13.24 12.76 10.79 of the preference scores given by the listeners. LSTM-CKFS [31] 15.91 14.51 12.11 11.89 9.23 Deep Xi-KF [33] 14.95 13.88 11.81 10.31 9.11 DeepLPC-ResNet-TCN [34] 11.89 10.49 8.73 7.33 6.51 G. SPECIFICATIONS OF THE COMPETITIVE SEAs Proposed 10.84 8.49 6.71 5.60 4.89 The performance of the proposed SEA is compared to the Noisy 20.21 16.39 14.43 13.88 12.45 following SEAs (the following notation is used for conve- DNN-LPC [29] 13.41 12.25 11.68 11.18 10.87 nience: (p, q) : is the order of {ai } and {bk }, (σw 2 , σu2 ) are LSTM-CKFS [31] 12.57 11.05 10.78 10.35 9.86 Street Deep Xi-KF [33] 11.66 10.51 9.74 9.21 8.95 the prediction error variances of the speech and noise AR DeepLPC-ResNet-TCN [34] 9.21 8.74 7.59 6.91 5.89 models, wf is the analysis frame duration (ms), and sf is the Proposed 7.84 6.32 4.88 4.56 4.49 analysis frame shift (ms)). Noisy 29.46 25.21 21.16 18.36 16.83 1) Noisy: speech corrupted with additive noise. DNN-LPC [29] 18.74 17.15 16.47 15.79 14.67 Factory LSTM-CKFS [31] 16.39 15.91 14.61 13.60 13.12 2) AKF-Oracle: AKF, where ({ai }, σw 2 ) and ({bk }, σu2 ) Deep Xi-KF [33] 15.10 14.98 13.87 12.72 12.33 are computed from the clean speech and the noise DeepLPC-ResNet-TCN [34] 12.29 10.89 9.48 8.21 7.89 signal, where p = 16, q = 16, wf = 32 ms, sf = 16 Proposed 10.15 8.23 7.01 6.11 5.52 ms, and a rectangular window is used for framing. Noisy 28.81 24.56 20.54 17.78 15.32 3) LSTM-CKFS [31]: AKF constructed using ({âi }, σ̂w 2 ) DNN-LPC [29] 18.93 17.78 16.55 15.23 13.22 LSTM-CKFS [31] 16.78 15.36 14.65 13.13 12.78 and ({b̂k }, σ̂u ) computed using an LSTM and ML- 2 F16 Deep Xi-KF [33] 14.21 13.01 12.59 11.96 10.81 DeepLPC-ResNet-TCN [34] 12.13 10.46 9.49 8.63 7.83 2 The AB listening tests were conducted on the approval of Griffith Proposed 9.76 8.09 6.12 5.82 5.48 University Human Research Ethics: database protocol number 2018/671. 8 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement V. RESULTS AND DISCUSSIONS A. SD LEVEL COMPARISON 4 The average SD levels (found over all frames for each test 3 PESQ condition) attained by the proposed method are given in 2 Table 2. It can be seen that for both real-world non-stationary (voice babble and street) and coloured (factory and f16) noise 1 conditions, the proposed method produced lower SD levels 0 than DeepLPC-ResNet-TCN [34]. This demonstrates that an 4 attention-based network is able to produce clean speech LPC estimates with less bias. This indicates that attention-based 3 PESQ networks are more apt for clean speech LPC estimation than 2 TCNs. This indicates that the AKF constructed from the 1 clean speech LPC estimates of the proposed method will pro- duce enhanced speech at a higher quality and intelligibility 0 than the competing methods. 4 3 PESQ B. OBJECTIVE EVALUATION In this section, we evaluate the objective quality and intelli- 2 gibility scores attained by the proposed method. The mean 1 objective scores attained by each SEA on the NOIZEUS 0 corpus are shown in Tables 3. It can be seen that AKF-Oracle produces the highest scores for all measures, which can be 4 thought of as the upper boundary of performance. Noisy 3 PESQ speech produced the lowest scores for all measures, indicat- ing the lower boundary of performance. When comparing the 2 proposed method to DeepLPC-ResNet-TCN-AKF, it can be 1 seen it attains higher score for each objective measure. This 0 demonstrates that the MHANet is better suited for the AKF -5 0 5 10 15 than the ResNet-TCN. The proposed method also achieves Input SNR(dB) higher objective scores than any of the competing methods, showing that it is currently the leading AKF in the literature. Noisy Figures 4 and 5 show the PESQ and STOI scores, respec- LSTM-CKFS tively, of each SEA for multiple conditions. The proposed EEUE-FCNN method produced higher PESQ and STOI scores than the Deep Xi-KF Deep Xi-ResNet-TCN-MMSE-LSA competing SEAs for each condition. This demonstrates that DeepLPC-ResNet-TCN-AKF the proposed method is able to produce higher objective qual- Proposed ity and intelligibility scores than the competing methods— AKF-Oracle across multiple SNR levels and noise sources. The results also indicate that the performance of the MHANet is better FIGURE 4: PESQ score for each SEA for each condition speci- able to generalise to different conditions than ResNet-TCN. fied in Section IV-C. C. SPECTROGRAM ANALYSIS In this section, we analyse the enhanced speech spectrograms LSTM-CKFS (Figure 6 (c)), however, residual background produced by each SEA. Figure 6 (a) shows the spectrogram noise remains. Less background noise is present in the of the clean speech recording (male recording sp05). The enhanced speech produced by Deep Xi-KF (Figure 6 (e)) clean speech is corrupted by voice babble noise at an SNR than the enhanced speech produced by EEUE-FCNN (Fig- level of 5 dB to create the noisy speech shown in Figure 6 (b). ure 6 (d)), however, the speech is more distorted. Deep Xi- This is a particularly tough condition for speech enhancement ResNet-TCN-MMSE-LSA produced less distorted speech since the background noise exhibits characteristics similar to (Figure 6 (f)) than that of Deep Xi-KF (Figure 6 (e)). the speech produced by the target speaker. The enhanced speech produced by the DeepLPC-ResNet- The enhanced speech produced by LSTM-CKFS is shown TCN-AKF is shown in Figure 6 (g). It can be seen that in Figure 6 (c). It can be seen that LSTM-CKFS signifi- the enhanced speech of DeepLPC-ResNet-TCN-AKF has cantly reduces the amount of background noise in the noisy less residual background noise and speech distortion than speech, although suffers from significant speech distortion. that of Deep Xi-ResNet-TCN-MMSE-LSA (Figure 6 (f)). Figure 6 (d) shows the enhanced speech produced by EEUE- The enhanced speech produced by the proposed method is FCNN. This method produced less distorted speech than shown in Figure 6 (h). It can be seen that there is less VOLUME **, **** 9 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement TABLE 3: Mean objective scores on the NOIZEUS dataset in terms of CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR. Apart from AKF-Oracle, the highest score amongst the methods for each measure is given in boldface. Methods CSIG CBAK COVL PESQ STOI SegSNR SI-SDR Noisy speech 2.41 2.27 2.12 1.64 67.87 0.89 6.39 LSTM-CKFS 2.63 2.55 2.42 1.99 77.58 6.54 11.15 EEUE-FCNN 2.76 2.66 2.56 2.05 79.45 6.93 11.59 Deep Xi-KF 3.11 2.83 2.72 2.16 81.89 7.14 12.15 Deep Xi-ResNet-TCN-MMSE-LSA 3.38 3.02 2.81 2.22 82.05 7.67 13.39 DeepLPC-ResNet-TCN-AKF 3.49 3.17 2.95 2.35 84.71 8.78 14.44 Proposed 3.66 3.32 3.14 2.59 88.41 9.21 15.01 AKF-Oracle 4.21 4.07 3.97 2.74 95.18 10.87 16.43 enhanced speech produced by the AKF-Oracle method is 100 shown in Figure 6 (i). The enhanced speech of AKF-Oracle 80 is most similar to the clean speech in Figure 6 (a). This is STOI 60 due to AKF-Oracle using the clean speech and noise LPC 40 parameters (which are unobserved in practice). 20 0 D. SUBJECTIVE EVALUATION 100 The mean subjective preference score (%) for each SEA is shown in Figures 7-8. The non-stationary (voice bab- 80 ble) noise experiment in Figure 7 reveals that the proposed STOI 60 method is widely preferred (72.23%) by the listeners to 40 that of the competing methods, apart from the clean speech 20 (100%) and the AKF-Oracle method (82.86%). DeepLPC- 0 ResNet-TCN-AKF is found to be the most preferred method 100 (68.43%) amongst the competing SEAs. Amongst the re- 80 maining SEAs, the listeners preferred the enhanced speech produced by Deep Xi-ResNet-TCN-MMSE-LSA (62.22%) STOI 60 40 the most, followed by Deep Xi-KF (53.71%), LSTM-CKFS 20 (40%), and then EEUE-FCNN (38%). LSTM-CKFS was pre- 0 ferred by the listeners more than EEUE-FCNN, even though EEUE-FCNN attained higher objective scores. This may 100 be due to the fact that LSTM-CKFS demonstrates superior 80 noise suppression in regions of speech than EEUE-FCNN, as STOI 60 indicated in [13]. 40 For the coloured (factory) noise experiment (Figure 8), 20 the listeners again preferred the proposed method (75%) 0 over the competing SEAs, with only clean speech (100%) -5 0 5 10 15 and AKF-Oracle (84.86%) being more preferred. As in the Input SNR(dB) previous experiment, DeepLPC-ResNet-TCN-AKF was the most preferred amongst the competing methods (71.41%), Noisy with Deep Xi-ResNet-TCN-MMSE-LSA being the next most LSTM-CKFS preferred (67.22%), followed by Deep Xi-KF (58.71%). As EEUE-FCNN with the scores in Figure 7, the enhanced speech of LSTM- Deep Xi-KF CKFS was preferred (42%) more than that of EEUE-FCNN Deep Xi-ResNet-TCN-MMSE-LSA (41%). In light of the blind AB listening tests, it is evident to DeepLPC-ResNet-TCN-AKF say that the enhanced speech of the proposed method exhibits Proposed the best perceived quality amongst all tested methods for AKF-Oracle both male and female recordings corrupted by real-life non- FIGURE 5: STOI score for each SEA for each condition speci- stationary as well as coloured noises. fied in Section IV-C. VI. CONCLUSION In this study, we investigate if an attention-based network is residual background noise in the enhanced speech than that more appropriate for AKF-based speech enhancement than of DeepLPC-ResNet-TCN-AKF (Figure 6 (g)). Finally, the a TCN. To this end, we replaced the ResNet-TCN used 10 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement 8 8 8 Freq.(kHz) Freq.(kHz) Freq.(kHz) 6 6 6 4 4 4 2 2 2 0 0 0 8 8 8 Freq.(kHz) Freq.(kHz) Freq.(kHz) 6 6 6 4 4 4 2 2 2 0 0 0 8 8 8 Freq.(kHz) Freq.(kHz) Freq.(kHz) 6 6 6 4 4 4 2 2 2 0 0 0 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 Time (s) Time (s) Time (s) FIGURE 6: Spectrograms of: (a) clean speech (recording sp05), (b) noisy speech ((a) corrupted with 5 dB of voice babble noise), (c)-(i) enhanced speech produced by each SEA. 100 100 Mean subjective preference score (%) Mean subjective preference score (%) 80 80 60 60 40 40 20 20 0 0 N N S S an an sy sy e e d ed F F KF KF KF KF N N SA A cl cl se -K -K oi oi le le FC FC LS os ra ra -A -A -C -C po Xi Xi N N -L C C O O op - E- E- N N SE SE TM TM o p p F- F- C C Pr Pr ee ee U U -T -T AK AK M M LS LS EE EE D D -M et -M et N N N N es es C C -T -T -R -R et et PC PC N N pL pL es es -R -R ee ee Xi Xi D D p p ee ee D D Speech Enhancement Methods Speech Enhancement Methods FIGURE 7: The mean preference score (%) comparison be- FIGURE 8: The mean preference score (%) comparison be- tween the proposed and benchmark SEAs for the recording tween the proposed and benchmark SEAs for the recording sp05 corrupted with 5 dB non-stationary voice babble noise. sp27 corrupted with 5 dB coloured factory noise. in the DeepLPC framework with the MHANet. Compared tion from surface reflections (or noisy-reverberant speech). to DeepLPC-ResNet-TCN, the proposed method, DeepLPC- Therefore, our future research direction is on Kalman fil- MHANet produces LPC estimates with less bias. Moreover, tering for speech enhancement in the presence of noisy- the AKF constructed with the clean speech and noise LPC pa- reverberant speech. Such a Kalman filter will be constructed rameters estimated from DeepLPC-MHANet is able to attain from parameters estimated using the MHANet. higher quality and intelligibility scores. We also compared . the proposed method to all other deep learning-based KFs and AKFs, and DeepLPC-MHANet-AKF performed best. APPENDIX A DeepLPC TRAINING TARGET The proposed method performs speech enhancement in the Here, we describe the training targets for DeepLPC [34]. presence of additive background noise. However, in practice, The clean speech and noise LPC-PS, denoted as λs (l, m) speech can be corrupted by background noise and reverbera- and λv (l, m), respectively. During training, λs (l, m) and VOLUME **, **** 11 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement λv (l, m) are computed as in [35, Chapter 9]: λs (l, m)[dB] = 10 log10 (λs (l, m)) and λv (l, m)[dB] = σw2 10 log10 (λv (l, m)) [34]. As shown in Figures 9 (a) and (c), λs (l, m) = 2 , (25) λs (l, 64)[dB] and λv (l, 64)[dB] follow a Gaussian distribu- P 1 + p −j2πim/M tion. Hence, we assume that λs (l, m)[dB] and λv (l, m)[dB] i=1 ai e are distributed normally with mean µs and µv , and variance σu2 σs2 and σv2 , respectively λs (l, m)[dB] ∼ N (µs , σs2 ) and λv (l, m) = 2 , (26) P q λv (l, m)[dB] ∼ N (µv , σv2 ) . 1 + −j2πkm/M k=1 bk e The statistics of λs (l, m)[dB] and λv (l, m)[dB] , i.e., (µs , σs2 ) and (µv , σv2 ) for each frequency bin, m were found over where ({ai }, σw 2 ) (p = 16) and ({bk }, σu2 ) (q = 16) are a sample of the training set. 2 500 randomly selected clean computed from the clean speech, s(n, l) and the noise signal, speech recordings were mixed with 2 500 randomly selected v(n, l) using the autocorrelation method [35, Chapter 8], and noise recordings from the training set (Section IV-A) with mǫ{0, 1, . . . , M − 1} (M = 257). As in [34], we used SNR levels ranging from -10 dB to +20 dB in 1 dB incre- the speech and noise LPC order; p = 16 and q = 16, ments, giving 2 500 noisy speech signals. For each frequency respectively. bin, m, the sample mean and variances, (µs , σs2 ) and (µv , σv2 ) were computed from 2 500 concatenated clean speech 10 6 (a) recordings and scaled noise recordings, respectively. This 10 sample was also used as the sample for Figure 9. 8 The CDF of λs (l, 64)[dB] over the sample is shown in Count 6 4 Figure 9 (b), and is used to compress the dynamic range of 2 λs (l, 64)[dB] . Similarly, the CDF of λv (l, 64)[dB] over the 0 sample is shown in Figure 9 (d), and is used to compress the -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 dynamic range of λv (l, 64)[dB] . The CDFs of λs (l, m)[dB] and λv (l, m)[dB] are defined as [34]: (b) 1 " !# 0.8 1 λs (l, m)[dB] − µs 0.6 λ̄s (l, m) = 1 + erf √ , (27) 2 σs 2 0.4 " !# 0.2 1 λv (l, m)[dB] − µv 0 λ̄v (l, m) = 1 + erf √ . (28) 2 σv 2 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 APPENDIX B DeepLPC INFERENCE 10 6 (c) ˆ (l, m) and λ̄ ˆ (l, m). 10 During inference, ζ̂ l is first split into λ̄ s v 8 The LPC-PS of the clean speech and the noise signal are then ˆ (l, m) and λ̄ ˆ (l, m) as: Count 6 computed from λ̄ s v 4 √ ˆ (l,m)−1)+µ )/10) 2erf−1 (2λ̄ 2 λ̂s (l, m) = 10((σs s s , (29) 0 √ ˆ (l,m)−1)+µ )/10) ((σv 2erf−1 (2λ̄ -50 -40 -30 -20 -10 0 10 20 30 40 50 λ̂v (l, m) = 10 v v . (30) (d) Next, ({âi }, σ̂w 2 ) and ({b̂k }, σ̂u2 ) are computed from λ̂s (l, m) 1 and λ̂v (l, m), as described in [34]. 0.8 0.6 0.4 REFERENCES 0.2 [1] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Boca 0 Raton, FL, USA: CRC Press, Inc., 2013. -50 -40 -30 -20 -10 0 10 20 30 40 50 [2] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac- tion,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 113–120, April 1979. [3] S. Kamath and P. Loizou, “A multi-band spectral subtraction method FIGURE 9: (Color online) The distribution of (a) λs (l, 64)[dB] for enhancing speech corrupted by colored noise,” IEEE International and (c) λv (l, 64)[dB] . The CDF of (b) λs (l, 64)[dB] and (d) Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 4160– λv (l, 64)[dB] , where the sample mean and variance were found 4164, May 2002. over the sample of the training set. [4] K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech en- hancement using spectral subtraction in the short-time modulation do- Next, the dynamic range of λs (l, m) and λv (l, m) are com- main,” Speech Communication, vol. 52, no. 5, pp. 450–475, May 2010. [5] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal pressed to the interval [0, 1] by using the cumulative distribu- to noise estimation,” IEEE International Conference on Acoustics, Speech, tion function (CDF) of λs (l, m)[dB] and λv (l, m)[dB] , where and Signal Processing, vol. 2, pp. 629–632, May 1996. 12 VOLUME **, **** S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement [6] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean [27] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech square error short-time spectral amplitude estimator,” IEEE Transactions enhancement by fully convolutional networks,” 2017 Asia-Pacific Signal on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– and Information Processing Association Annual Summit and Conference 1121, December 1984. (APSIPA ASC), pp. 006–012, 2017. [7] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean- [28] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end wave- square error log-spectral amplitude estimator,” IEEE Transactions on form utterance enhancement for direct evaluation metrics optimization by Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, fully convolutional neural networks,” IEEE/ACM Transactions on Audio, April 1985. Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018. [8] B. M. Mahmmod, A. R. Ramli, T. Baker, F. Al-Obeidat, S. H. Abdulhus- [29] C. Pickersgill, S. So, and B. Schwerin, “Investigation of dnn prediction of sain, and W. A. Jassim, “Speech enhancement algorithm based on super- power spectral envelopes for speech coding & asr,” 17th Speech Science gaussian modeling and orthogonal polynomials,” IEEE Access, vol. 7, pp. and Technology Conference (SST2018), Sydney, Australia, Dec 2018. 103 485–103 504, 2019. [30] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neural [9] K. Paliwal and A. Basu, “A speech enhancement method based on Kalman network based Kalman filter for time domain speech enhancement,” IEEE filtering,” IEEE International Conference on Acoustics, Speech, and Signal International Symposium on Circuits and Systems, pp. 1–5, May 2019. Processing, vol. 12, pp. 177–180, April 1987. [31] H. Yu, W.-P. Zhu, and B. Champagne, “Speech enhancement using a dnn- [10] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise for augmented colored-noise Kalman filter,” Speech Communication, vol. 125, speech enhancement and coding,” IEEE Transactions on Signal Process- pp. 142 – 151, 2020. ing, vol. 39, no. 8, pp. 1732–1742, August 1991. [32] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Codebook driven short- [11] G. J. Brown and D. Wang, Separation of Speech by Computational term predictor parameter estimation for speech enhancement,” IEEE Auditory Scene Analysis. Berlin, Heidelberg: Springer Berlin Heidelberg, Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, 2005. pp. 163–176, 2006. [12] Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on speech [33] S. K. Roy, P. Nicolson, and K. K. Paliwal, “A deep learning-based Kalman enhancement based on deep neural networks,” IEEE Signal Processing filter for speech enhancement,” prof. of Interspeech2020, October 2020. Letters, vol. 21, no. 1, pp. 65–68, 2014. [34] S. K. Roy, A. Nicolson, and K. K. Paliwal, “DeepLPC: A deep learning [13] A. E. George, S. So, R. Ghosh, and K. K. Paliwal, “Robustness metric- approach to augmented Kalman filter-based single-channel speech en- based tuning of the augmented Kalman filter for the enhancement of hancement,” TechRxiv, 2021. speech corrupted with coloured noise,” Speech Communication, vol. 105, [35] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduction. pp. 62–76, October 2018. Hoboken, NJ, USA: John Wiley Sons, Inc., 2006. [14] S. K. Roy, A. Nicolson, and K. K. Paliwal, “Deep learning with augmented [36] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. Kalman filter for single-channel speech enhancement,” in 2020 IEEE abs/1607.06450, 2016. International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1–5. [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern [15] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech Recognition (CVPR), June 2016. enhancement using subband iterative Kalman filter,” IEEE International [38] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An Symposium on Circuits and Systems, pp. 762–765, May 2016. ASR corpus based on public domain audio books,” IEEE International [16] Y. Wang and D. Wang, “Towards scaling up classification-based speech Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210, separation,” IEEE Transactions on Audio, Speech, and Language Process- April 2015. ing, vol. 21, no. 7, pp. 1381–1390, 2013. [39] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: [17] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised English multi-speaker corpus for CSTR voice cloning toolkit,” University speech separation,” IEEE/ACM Transactions on Audio, Speech, and Lan- of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. guage Processing, vol. 22, no. 12, pp. 1849–1858, 2014. [40] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, [18] N. Saleem, M. I. Khattak, M. Al-Hasan, and A. B. Qazi, “On learning spec- “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. tral masking for single channel speech enhancement using feedforward NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, and recurrent neural networks,” IEEE Access, vol. 8, pp. 160 581–160 595, Feb. 1993. 2020. [41] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE- [19] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive TIMIT corpus for the evaluation of voice activity detection algorithms,” in and recognition-boosted speech separation using deep recurrent neural Proceedings Interspeech 2010, 2010, pp. 3110–3113. networks,” in 2015 IEEE International Conference on Acoustics, Speech [42] G. Hu, “100 nonspeech environmental sounds,” The Ohio State University, and Signal Processing (ICASSP), 2015, pp. 708–712. Department of Computer Science and Engineering, 2004. [20] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for [43] F. Saki, A. Sehgal, I. Panahi, and N. Kehtarnavaz, “Smartphone-based real- monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, time classification of noise signals using subband features and random and Language Processing, vol. 24, no. 3, pp. 483–492, 2016. forest classifier,” in 2016 IEEE International Conference on Acoustics, [21] N. Zheng and X. Zhang, “Phase-aware speech enhancement based on Speech and Signal Processing (ICASSP), March 2016, pp. 2204–2208. deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and [44] F. Saki and N. Kehtarnavaz, “Automatic switching between noise classi- Language Processing, vol. 27, no. 1, pp. 63–76, 2019. fication and speech enhancement for hearing aid devices,” in 2016 38th [22] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, Annual International Conference of the IEEE Engineering in Medicine and “Learning spectral mapping for speech dereverberation and denoising,” Biology Society (EMBC), Aug 2016, pp. 736–739. IEEE/ACM Transactions on Audio, Speech, and Language Processing, [45] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise vol. 23, no. 6, pp. 982–992, 2015. corpus,” CoRR, vol. abs/1510.08484, 2015. [23] A. Nicolson and K. K. Paliwal, “Deep learning for minimum mean-square [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” error approaches to speech enhancement,” Speech Communication, vol. 2014. 111, pp. 44–55, 2019. [47] A. Gray and J. Markel, “Distance measures for speech processing,” IEEE [24] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, “Deep- Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 5, MMSE: A deep learning approach to MMSE-based noise power spectral pp. 380–391, 1976. density estimation,” IEEE/ACM Transactions on Audio, Speech, and Lan- [48] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for guage Processing, vol. 28, pp. 1404–1415, 2020. speech enhancement,” IEEE Transactions on Audio, Speech, and Lan- [25] A. Nicolson and K. K. Paliwal, “Masked multi-head self-attention for guage Processing, vol. 16, no. 1, pp. 229–238, 2008. causal speech enhancement,” Speech Communication, vol. 125, pp. 80–96, [49] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual 2020. evaluation of speech quality (PESQ)-a new method for speech quality [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, assessment of telephone networks and codecs,” IEEE International Con- L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances ference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752, in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, May 2001. S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. [50] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for Curran Associates, Inc., 2017, pp. 5998–6008. intelligibility prediction of time–frequency weighted noisy speech,” IEEE VOLUME **, **** 13 S. K. Roy et al.: DeepLPC-MHANet: Masked Multi-Head Self-Attention for AKF-Based Speech Enhancement Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, KULDIP K. PALIWAL was born in Aligarh, India, pp. 2125–2136, 2011. in 1952. He received the B.S. degree from Agra [51] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked University, Agra, India, in 1969, the M.S. degree or well done?” in ICASSP 2019 - 2019 IEEE International Conference on from Aligarh Muslim University, Aligarh, India, in Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630. 1971, and the Ph.D degree from Bombay Univer- [52] P. Mermelstein, “Evaluation of a segmental SNR measure as an indicator sity, Bombay, India, in 1978. of the quality of ADPCM coded speech,” The Journal of the Acoustical He has been carrying out research in the area of Society of America, vol. 66, no. 6, pp. 1664–1667, 1979. speech processing since 1972. He has worked at a number of organizations including Tata Institute of Fundamental Research, Bombay, India Norwegian Institute of Technology, Trondheim, Norway, University of Keele, U.K., AT&T Bell Laboratories, Murray Hill, New Jersey, U.S.A., AT&T Shannon SUJAN KUMAR ROY (Student Member, IEEE) Laboratories, Florham Park, New Jersey, U.S.A., and Advanced Telecom- received the B.Sc. and M.Sc. degrees in Computer munication Research Laboratories, Kyoto, Japan. Since July 1993, he has Science and Engineering from the University of been a professor at Griffith University, Brisbane, Australia, in the School of Rajshahi, Bangladesh, in 2008 and 2010, respec- Microelectronic Engineering. His current research interests include speech tively. He also received a Master of Applied Sci- recognition, speech coding, speaker recognition, speech enhancement, face ence (M.A.Sc) degree in Electrical and Computer recognition, image coding, pattern recognition and artificial neural networks. Engineering from Concordia University, Canada He has published more than 300 papers in these research areas. Prof. Paliwal in May 2016. He is currently a Ph.D candidate is a Fellow of Acoustical Society of India. He has served the IEEE Signal in the School of Engineering, Griffith University, Processing Society’s Neural Networks Technical Committee as a founding Brisbane, Australia. His research interests include member from 1991 to 1995 and the Speech Processing Technical Committee speech processing, machine learning, and data science. from 1999 to 2003. He was an Associate Editor of the IEEE Transactions on Speech and Audio Processing during the periods 1994-1997 and 2003-2004. He is in the Editorial Board of the IEEE Signal Processing Magazine. He also served as an Associate Editor of the IEEE Signal Processing Letters from 1997 to 2000. He was the General Co-Chair of the Tenth IEEE Workshop on AARON NICOLSON was born in Brisbane, Aus- Neural Networks for Signal Processing (NNSP2000). He has co-edited two tralia in 1994. He received his BEng (Class 1 A books: "Speech Coding and Synthesis" (published by Elsevier), and "Speech Hons.) and PhD degree from from Griffith Uni- and Speaker Recognition: Advanced Topics" (published by Kluwer). He has versity, Brisbane, Australia, in 2016 and 2020, re- received IEEE Signal Processing Society’s best (senior) paper award in 1995 spectively. He is currently a postdoctoral research for his paper on LPC quantization. He served as the Editor-in-Chief of the fellow at the Australian eHealth Research Centre, Speech Communication journal (published by Elsevier) during 2005-2011. CSIRO. His research interests include speech, nat- ural language, image, and multimodal processing using deep learning. 14 VOLUME **, **** Chapter 8 On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-Based Speech Enhancement 121 STATEMENT OF CONTRIBUTION TO CO-AUTHORED PUBLISHED PAPER This chapter includes a co-authored paper. The bibliographic details of the co-authored paper, including all authors, are: Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal, "On training targets for supervised LPC estimation to augmented Kalman filter-based speech enhancement", Under review with: Speech Communication (Submitted at 12 April 2021). My contribution to the paper involved: • Preliminary experiments. • Experiment design. • Conducted the experiments. • Code writing. • Design of models. • Analysis of results. • Literature review. • Manuscript writing. Aaron Nicolson aided with drafting the final manuscript. Professor Kuldip K. Paliwal provided supervision and aided with editing the final manuscript. (Signed) _____________ ____________ (Date) 02/04/2021 Sujan Kumar Roy (Countersigned) _______ _________ (Date) 02/04/2021 Aaron Nicolson (Countersigned) ____ _____ (Date) 02/04/2021 Supervisor: Professor Kuldip K. Paliwal On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-based Speech Enhancement Sujan Kumar Roya,∗, Aaron Nicolsonb , Kuldip K. Paliwala a Signal Processing Laboratory, Griffith University, Nathan Campus, Brisbane, QLD, 4111, Australia b Australian eHealth Research Centre, CSIRO, Herston, QLD, 4006, Australia Abstract The performance of speech coding, speech recognition, and speech enhancement largely depends upon the accuracy of the linear prediction coefficient (LPC) of clean speech and noise in practice. Formulation of speech and noise LPC estimation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised technique, typically a deep neural network (DNN) is trained to learn a mapping from noisy speech features to clean speech and noise LPCs. Training targets for DNN to clean speech and noise LPC estimation fall into four categories: line spectrum frequency (LSF), LPC power spectrum (LPC-PS), power spectrum (PS), and magnitude spectrum (MS). The choice of appropriate training target as well as the DNN method can have a significant impact on LPC estimation in practice. Motivated by this, we perform a comprehensive study on the training targets using two state-of-the-art DNN methods— residual network and temporal convolutional network (ResNet-TCN) and multi-head attention network (MHANet). This study aims to determine which training target as well as DNN method produces more accurate LPCs in practice. We train the ResNet-TCN and MHANet for each training target with a large data set. Experiments on the NOIZEUS corpus demonstrate that the LPC-PS training target with MHANet produces a lower spectral distortion (SD) level in the estimated speech LPCs in real-life noise conditions. We also construct the AKF with the estimated speech and noise LPC parameters from each training target using ResNet-TCN and MHANet. Subjective AB listening tests and seven different objective quality and intelligibility evaluation measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR) on the NOIZEUS corpus demonstrate that the AKF constructed with MHANet-LPC-PS driven speech and noise LPC parameters produced enhanced speech with higher quality and intelligibility than competing methods. Keywords: Speech enhancement, augmented Kalman filter, LPC, residual network, temporal convolutional network, multi-head attention network. 1. Introduction LSTM-CKFS) (Yu et al., 2020), a deep learning frame- work (denoted as Deep Xi (Nicolson and Paliwal, 2019)) Speech processing applications, such as low-bit rate au- and whitening filter assisted KF (Deep Xi-KF) (Roy et al., dio coding, speech enhancement, speech recognition, rely 2020a) and AKF (Deep Xi-AKF) (Roy et al., 2020b) have upon the accuracy of linear prediction coefficient (LPC) been introduced. This paper focuses on training targets for estimates of clean speech and noise in practice (Vaseghi, supervised LPC estimation with an application of AKF- 2006, Chapter 8). For example, the inaccurate estimates based speech enhancement. of speech and noise LPC parameters impact the qual- Paliwal and Basu (1987) introduced the KF for speech ity and intelligibility of enhanced speech produced by an enhancement in white noise condition. For the KF, each augmented Kalman filter (AKF) (Gibson et al., 1991). clean speech frame is represented by an auto-regressive To address this, deep learning assisted LPC estimation (AR) process, whose parameters comprise the LPC and has been integrated with Kalman filter (KF) and AKF prediction error variance. The LPC parameters and addi- for speech enhancement. For instance, a DNN assisted tive noise variance are used to construct the KF recursive LPC estimation (DNN-LPC) (Pickersgill et al., 2018), a equations. Given a frame of noisy speech samples, the KF fully-connected feed-forward DNN (denoted as FNN) as- gives a linear MMSE estimate of the clean speech samples sisted KF (FNN-KF) (Yu et al., 2019), FNN and LSTM- using the recursive equations. Paliwal and Basu (1987) based LPC estimation for the colored KF (FNN-CKFS and demonstrated that the inaccurate estimates of the LPC parameters and noise variance result in poor quality and ∗ Corresponding intelligibility in the enhanced speech produced by the KF. author Email addresses:

[email protected]

(Sujan Gibson et al. (1991) introduced an AKF for speech en- Kumar Roy),

[email protected]

(Aaron Nicolson), hancement in coloured noise conditions. In AKF, both

[email protected]

(Kuldip K. Paliwal) the clean speech and additive noise are represented by two Preprint submitted to Elsevier AR processes. The speech and noise LPC parameters form tocorrelation matrix using the Levinson-Durbin recursion an augmented matrix, which is used to construct the re- (Vaseghi, 2006, Chapter 8), yielding the LPC parameters cursive equations of the AKF. In this speech enhancement of the clean speech. The spectral distortion (SD) (dB) algorithm (SEA), the AKF processes the noisy speech it- measure has been used to evaluate the performance of the eratively (usually three-four iterations) to eliminate the estimated LPCs. However, the SD evaluation results were colored background noise, yielding the enhanced speech. not given for low SNR levels (below 10 dB). Moreover, only During this, the LPC parameters for the current frame are six noise recordings were used for training the DNN, re- computed from the corresponding filtered speech frame of ducing its generalisation capabilities for unseen noise con- the previous iteration (Gibson et al., 1991). Although the ditions. iterative AKF improves the signal-to-noise ratio (SNR) of Yu et al. (2019) proposed a deep learning assisted KF noisy speech, the resultant enhanced speech suffers from for speech enhancement (FNN-KF). A traditional FNN musical noise and speech distortion. Therefore, the AKF (containing input layer, three hidden layers, and output method (Gibson et al., 1991) does not adequately address layer) is used as deep learning method. The line spectrum the LPC parameter estimates in practice. frequency (LSF) (Itakura, 1975) of clean speech (12 order) George et al. (2018) proposed a robustness metric-based is used as a training target for the FNN. Specifically, the tuning of the AKF for enhancing colored noise corrupted FNN learns a mapping from each frame of the noisy speech speech (denoted as AKF-RMBT). They demonstrated that LSFs to the clean speech LSFs. During inference, the es- the inaccurate estimates of the clean speech and noise LPC timated LFSs are converted to clean speech LPCs. For parameters introduce bias in the AKF gain, leading to a training the FNN, only 10 720 samples constructed from degradation in speech enhancement performance. To ad- 670 speech recordings, including four noise recordings and dress this, the noise LPC parameters are computed from four SNR levels was used. The small training data reduces the first noisy speech frame by assuming that there re- the generalizing capabilities of the FNN to unseen condi- mains no speech. The computed noise LPC parameters tions. In addition, the additive noise variance is computed remain constant during processing all noisy speech frames from the first noisy speech frame by assuming that there for a given utterance. A whitening filter is also constructed remains no-speech. However, this does not account for with the constant noise LPCs to pre-whiten each noisy conditions that have time-varying amplitudes. Later on, speech frame prior to computing the speech LPC parame- Yu et al. (2020) used FNN and LSTM networks to estimate ters. As a result, the speech and noise LPC parameter esti- the speech and noise LPCs for colored KF-based speech en- mation process in AKF-RMBT do not adequately address hancement (FNN-CKFS and LSTM-CKFS). Specifically, the noise conditions that have time-varying amplitudes. the authors used a similar FNN as in (Yu et al., 2019), Moreover, the adjusted AKF gain is under-estimated in while the LSTM network is constructed by stacking input speech regions, resulting in distorted speech. Roy and layer, two LSTM layers (with 512 units in each layer), one Paliwal (2020) proposed an extension of the AKF-RMBT feed-forward layer with 512 units, and an output layer. (George et al., 2018) by employing a sensitivity metric- The FNN and LSTM learn a mapping from each frame of based tuning of the AKF (denoted as AKF-SMBT). In the noisy speech LSFs to the clean speech and noise LSFs. AKF-SMBT, the speech and noise LPC parameters are Then convert the estimated LSFs to the clean speech and computed with a similar process as in AKF-RMBT (George noise LPCs. Besides, a maximum likelihood (ML) ap- et al., 2018). It is demonstrated that the application of proach (Srinivasan et al., 2006) is employed to estimate sensitivity metric in AKF-SMBT minimizes the under- the prediction error variances of the speech and noise AR estimation issue of AKF gain, particularly in speech re- processes. However, FNN-CKFS and LSTM-CKFS do not gions as compared to AKF-RMBT. It is also shown that adequately address speech and noise LPC parameter esti- the reduced-biased AKF gain in AKF-SMBT minimizes mates in various noise condition—leading to the use of the amount of residual noise and distortion in the en- multi-band spectral subtraction (MB-SS) (Kamath and hanced speech as compared to AKR-RMBT. However, AKF- Loizou, 2002) for post-processing. This could be due to SMBT also does not adequately address speech enhance- training the FNN and LSTM with a small dataset (Yu ment for conditions that have time-varying amplitudes. et al., 2019). Deep learning has been used for LPC estimation—a Roy et al. (2020a) incorporated the DeepMMSE frame- key parameter for KF and AKF-based SEA (Gibson et al., work (Zhang et al., 2020) with KF (denoted as Deep Xi- 1991; Paliwal and Basu, 1987). Pickersgill et al. (2018) KF, since DeepMMSE uses Deep Xi (Nicolson and Pali- proposed a deep learning-based LPC estimation method wal, 2019)) for speech enhancement. In Deep Xi-KF, the (DNN-LPC). In DNN-LPC, a traditional DNN (originally DeepMMSE framework (Zhang et al., 2020) within a resid- proposed in (Xu et al., 2014)) learns a mapping from each ual network and temporal convolutional network (ResNet- frame of the noisy speech log power spectra (LPS) to the TCN) (Bai et al., 2018; He et al., 2016) has been used to log LPC power spectra of clean speech. Then the esti- estimate noise power spectral density (PSD) for each noisy mated log LPC power spectra is converted to LPC power speech frame. Roy et al. (2020b) proposed a deep learning- spectra (LPC-PS) followed by an inverse Fourier transform based AKF for speech enhancement (Deep Xi-AKF). In of it gives the autocorrelation matrix. By solving the au- this method, a ResNet-TCN within the Deep Xi frame- work (Nicolson and Paliwal, 2019); Deep Xi-ResNet-TCN using a large data set as found in (Roy et al., 2021b). We is used to estimate the noise PSD for each noisy speech compare the SD level of the estimated speech LPCs for frame. Roy and Paliwal (2020a) proposed a causal convo- each training target and the deep learning method. We lutional encoder-decoder (CCED)-based AKF for speech also verify the performance of estimated LPCs in speech enhancement (denoted as CCED-AKF). In this method, enhancement context, where AKF is constructed with the the CCED maps each frame of the noisy speech magni- estimated speech and noise LPCs from different training tude spectrum (MS) to the noise magnitude spectrum, targets using MHANet and ResNet-TCN. Specifically, we from where the noise PSD is computed. Roy and Paliwal evaluate the performance of enhanced speech produced by (2020b) proposed a deep residual network assisted AKF AKF using subjective AB listening tests and seven differ- for speech enhancement (denoted as ResNet-AKF). Dif- ent objective quality and intelligibility measures (CSIG, ferent from MS or PSD estimation of noise, the ResNet CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). The learns a mapping from the noisy speech waveform to the SD level evaluation as well as the subjective and objec- noise waveform. In Deep Xi-KF, Deep Xi-AKF, CCED- tive experiments are conducted on the NOIZEUS corpus AKF, and ResNet-AKF (Roy et al., 2020a,b; Roy and Pali- in the presence of real-world non-stationary and colored wal, 2020a,b), the noise parameters are computed from noise conditions for a wide range of SNR levels. the estimated noise. A whitening filter is also constructed The structure of this paper is as follows: background with its coefficients computed from the estimated noise knowledge is presented in Section 2, including the signal to pre-whiten each noisy speech frame prior to computing model and the AKF for speech enhancement. In Section the speech LPC parameters. As demonstrated in (George 3, we present the supervised LPC estimation framework, et al., 2018), the whitening filter partially reduces bias in including the training targets. Following this, Section 4 the speech LPC parameters. Thus, Deep Xi-KF, Deep Xi- describes the details of DNN methods used in evaluating AKF, CCED-AKF, and ResNet-AKF (Roy et al., 2020a,b; the training targets. After that, Section 5 describes the ex- Roy and Paliwal, 2020a,b) do not adequately address the perimental setup in terms of speech corpus and objective speech LPC parameter estimates in practice. and subjective evaluation measures. The experimental re- Recently, a DeepLPC framework (Roy et al., 2021a) sults are then presented in Section 6. Finally, Section 7 within ResNet-TCN is proposed to jointly estimate the gives some concluding remarks. speech and noise LPC power spectra (LPC-PS). The clean speech and noise LPC parameters are computed from the 2. Background LPC-PS estimates. The DeepLPC-ResNet-TCN demon- strates a lower SD level in the estimated speech LPCs than 2.1. Signal model the competing methods. This leads to the production of The noisy speech y(n), at discrete-time sample n, is the highest quality and most intelligible enhanced speech assumed to be given by: amongst current KF and AKF SEAs—outperforming Deep y(n) = s(n) + v(n), (1) Xi-KF while using the same training set. However, the multi-head attention network (MHANet) has been shown where s(n) is the clean speech and v(n) is uncorrelated to outperform ResNet-TCN for speech enhancement. Mo- additive coloured noise. Since the LPC estimation as well tivated by this, Roy et al. (2021b) proposed an extension as AKF-based speech enhancement operates on a frame- of the DeepLPC framework (Roy et al., 2021a) within by-frame basis, firstly, a 32 ms rectangular window with MHANet, called DeepLPC-MHANet, to further improv- 50% overlap is used to convert y(n) into frames, denoted ing the speech and noise LPC parameter estimates for by y(n, l): the AKF. The DeepLPC-MHANet demonstrates a lower y(n, l) = s(n, l) + v(n, l), (2) SD level in the estimated speech LPCs than DeepLPC- ResNet-TCN (Roy et al., 2021a) in various noise condi- where lǫ{0, 1, . . . , L − 1} is the frame index, L is the total tions. In addition, the AKF constructed with the DeepLPC- number of frames in an utterance, and N is the total num- MHANet driven speech and noise LPC parameters pro- ber of samples within each frame, i.e., nǫ{0, 1, . . . , N − 1}. duces higher quality and intelligible enhanced speech than DeepLPC-ResNet-TCN-AKF (Roy et al., 2021a). 2.2. AKF for speech enhancement Motivated by the recent improvement on deep learn- For simplicity, the frame index is omitted in the AKF ing assisted LPC estimation, this paper aims to perform recursive equations. Each frame of the clean speech and a comprehensive study on LSF, LPC-PS, power spectrum noise signal in (2) can be represented with pth and q th (PS), and magnitude spectrum (MS) training targets us- order AR models, as in (Vaseghi, 2006, Chapter 8): ing two state-of-the-art deep learning methods— ResNet- p X TCN and MHANet. The aim of this study is to determine s(n) = − ai s(n − i) + w(n), (3) which training target as well as deep learning method pro- i=1 duces more accurate speech and noise LPC parameters in Xq real-life noise conditions. Due to this purpose, we train v(n) = − bk v(n − k) + u(n), (4) the ResNet-TCN and MHANet for each training target k=1 2 where {ai ; i = 1, 2, . . . , p} and {bk ; k = 1, 2, . . . , q} are the σw 0 where Q = is the process noise covariance. LPCs. w(n) and u(n) are assumed to be white noise with 0 σu2 zero mean and variances σw 2 and σu2 , respectively. For a noisy speech frame, the error covariances (Ψ(n|n− Equations (2)-(4) can be used to form the following 1) and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n)) augmented state-space model (ASSM) of the AKF, as in and the Kalman gain K(n) are continually updated on a 2 (Gibson et al., 1991): sample-by-sample basis, while ({ai }, σw ) and ({bk }, σu2 ) ⊤ remain constant. At sample n, h x̂(n|n) gives the output ⊤ x(n) = Φx(n − 1) + rg(n), (5) of the AKF, ŝ(n|n), where h = 1 0 0 . . . 0 is a ⊤ (p+q)×1 column vector. As demonstrated in AKF-RMBT y(n) = c x(n). (6) (George et al., 2018), ŝ(n|n) is given by: In the above ASSM, ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)[y(n)− 1. x(n) = [s(n) . . . s(n − p + 1) v(n) . . . v(n − q + 1)]T v̂(n|n − 1)], (15) is a (p + q) × 1 state-vector, where K0 (n) is the 1st component of K(n), given by (George Φs 0 2. Φ = is a (p + q) × (p + q) state-transition et al., 2018): 0 Φv matrix with: α2 (n) + σw2   K0 (n) = , (16) −a1 −a2 . . . −ap−1 −ap α2 (n) + σw + β (n) + σu2 2 2  1 0 ... 0 0    where  0 1 ... 0 0  Φs =  , (7)  .. .. .. .. ..  ⊤  . . . . .  α2 (n) = c⊤ s Φs Ψs (n − 1|n − 1)Φs cs , (17) 0 0 ... 1 0 and   ⊤ −b1 −b2 ... −bq−1 −bq β 2 (n) = c⊤ v Φv Ψv (n − 1|n − 1)Φv cv , (18)  1 0 ... 0 0    are the transmission of a posteriori error variances of the  0  Φv =  0 1 ... 0 , (8)  .. .. .. .. ..  speech and the noise augmented dynamic model from the  . . . . .  previous sample, n − 1, respectively (George et al., 2018). 0 0 ... 1 0 Equation (15) reveals that K0 (n) has a significant im- pact on ŝ(n|n). In practice, the inaccurate estimates of 2 rs 0 ⊤ ({ai }, σw ) and ({bk }, σu2 ) introduce bias into K0 (n), which 3. r = , where r s = 1 0 . . . 0 , rv = impacts ŝ(n|n). In this paper, we investigate both the ap- 0 rv ⊤ propriate training target as well as a deep learning method 1 0 ... 0 , 2 for more accurate estimates of ({ai }, σw ) and ({bk }, σu2 ), 4. leading to an improved ŝ(n|n). w(n) g(n) = , (9) 3. Supervised LPC estimation u(n) ⊤ The supervised LPC estimation framework is shown in 5. c⊤ = c⊤ s c⊤ v , where cs = 1 0 . . . 0 and Fig. 1. It can be seen that the framework is fed as input ⊤ cv = 1 0 . . . 0 are p × 1 and q × 1 vectors, the single-sided noisy speech magnitude spectrum |Y l | = {|Y (l, 0)|, |Y (l, 1)|, . . . , |Y (l, M − 1)|}. This is computed 6. y(n) is the noisy measurement at sample n. from the noisy speech in Equation 1 using the short-time Fourier transform (STFT): For each frame, the AKF computes an unbiased linear MMSE estimate, x̂(n|n) at sample n, given y(n), by using Y (l, m) = S(l, m) + V (l, m), (19) the following recursive equations (Gibson et al., 1991): where Y (l, m), S(l, m), and V (l, m) denote the complex- x̂(n|n − 1) = Φx̂(n − 1|n − 1), (10) valued STFT coefficients of the noisy speech, clean speech, Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ⊤ + Qrr ⊤ , (11) and noise, respectively, for time-frame index l and discrete- ⊤ −1 frequency bin m. The Hamming window is used for anal- K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (12) ysis and synthesis. In this framework, a DNN learns to ⊤ x̂(n|n) = x̂(n|n − 1) + K(n)[y(n) − c x̂(n|n − 1)], (13) map from |Y l | to the speech and noise features (in con- Ψ(n|n) = [I − K(n)c⊤ ]Ψ(n|n − 1), (14) catenated form) ζ̂ l . In this study, LSF, LPC-PS, PS, and MS (Sections 3.1-3.4) of the speech and noise are used as Figure 1: (Color online) Supervised LPC estimation framework. training targets for the DNN. In addition, ResNet-TCN The traditional FNN and LSTM in (Yu et al., 2020) and MHANet are used as DNN methods. During infer- learn a mapping form noisy speech LSFs to the clean speech ence, split ζ̂ l into mapped features of speech and noise and noise LSFs. Then the estimated LSFs are converted to followed by inverse mapping of them, yielding the esti- ({âi } and {b̂k }. However, the prediction error variances, 2 mated features for speech and noise. Then compute ({âi }, σ̂w and σ̂u2 are estimated using a ML approach (Srinivasan 2 σ̂w ) and ({b̂k }, σ̂u2 ) from the estimated speech and noise et al., 2006) with estimated {âi } and {b̂k }. In this study, 2 features. As in DeepLPC-ResNet-TCN framework (Roy we jointly estimate ({âi }, σ̂w ) and ({b̂k }, σ̂u2 ) using the et al., 2021a), we have used the speech and noise LPC LPC estimation framework (Fig. 1). Due to this purpose, order; p = 16 and q = 16, respectively. for lth frame, we concatenate ρ̄l , η̄ l , σw 2 , and σu2 to form ζ l of size 34 as: 3.1. LSF training target 2 The LSFs of speech and noise (denoted as {ρi } and ζ l = {ρ̄1 , ρ̄2 , . . . , ρ̄p , η̄1 , η̄2 , . . . , η̄q , σw , σu2 }. (24) {ηi }) are used as training targets for DNN in LPC es- Then ζ l is used as the final training target for the DNN in timation framework (Fig. 1). During training the DNN, the supervised LPC estimation framework (Fig. 1). During the clean speech and noise are known (the oracle case). inference, ζ̂ l is split into ρ̄ˆl , η̄ ˆl , σ̂w 2 , and σ̂u2 . Then multiply Therefore, we first compute the speech and noise LPC pa- 2 ρ̄ˆl , η̄ ˆl by π (inverse mapping), yields ρ̂l and η̂ l . Finally, rameters, ({ai }, σw ) (p = 16) and ({bk }, σu2 ) (q = 16) from convert ρ̂l and η̂ l into {âi } and {b̂k } using LSF to LPC s(n, l) and v(n, l) using the autocorrelation method as in conversion method as in (McLoughlin, 2008). (Vaseghi, 2006, Chapter 8). Then compute {ρi } and {ηk } from {ai } and {bk }. We briefly demonstrate the LPC to 3.2. LPC-PS training target LSF conversion processes for clean speech as in (McLough- lin, 2008). Each s(n, l) under the linear prediction analysis The LPC-PS of speech and noise signal, Ps (l, m) and model can be generated as the output of finite impulse re- Pv (l, m) were used as the training targets in DeepLPC- sponse filter, A(z). {ai } computed from s(n, l) are used to ResNet-TCN and DeepLPC-MHANet (Roy et al., 2021a,b). generate A(z) as (McLoughlin, 2008): During training, Ps (l, m) and Pv (l, m) are computed as in (Vaseghi, 2006, Chapter 9): A(z) = 1 + a1 z −1 + a2 z −2 + · · · + ap z −p . (20) 2 σw Ps (l, m) = 2 , (25) To compute LSFs, A(z) is decomposed into symmetrical Pp 1 + −j2πim/M and anti-symmetrical parts, represented by the polynomi- i=1 ai e als, P (z) and Q(z), as McLoughlin (2008): σu2 −(p+1) −1 Pv (l, m) = 2 , (26) P (z) = A(z) + z A(z ), (21) P q 1 + −j2πkm/M k=1 bk e Q(z) = A(z) − z −(p+1) A(z −1 ). (22) {ρi } are expressed as the zeros (or complex roots denoted where mǫ{0, 1, . . . , M − 1} (M = 257). by {θi }) of P (z) and Q(z) in terms of angular frequency. As demonstrated in DeepLPC-ResNet-TCN (Roy et al., Then {ρi } are computed as (McLoughlin, 2008): 2021a), the dynamic range of Ps (l, m) and Pv (l, m) are ! compressed to the interval [0, 1] through utilizing the cu- −1 Re{θi } mulative distribution function (CDF) of Ps (l, m)[dB] and {ρi } = tan , i = 1, 2, . . . , p, (23) Im{θi } Pv (l, m)[dB] , where Ps (l, m)[dB] = 10 log10 (Ps (l, m)) and Pv (l, m)[dB] = 10 log10 (Pv (l, m)). For example, Figs. 2(a) where {ρi } are expressed in radians (between [0, π]). and (c) show that Ps (l, 64)[dB] and Pv (l, 64)[dB] follow a Using eq. (20)-(23), LSFs for the noise signal, {ηk } Gaussian distribution. In light of the analysis, we assume are computed from {bk }. To improve the rate of con- that Ps (l, m)[dB] and Pv (l, m)[dB] are distributed normally vergence of stochastic gradient descent algorithm, the dy- with mean, µs and µv , and variance σs2 and σv2 , respectively namic range of {ρi } and {ηk } are compressed to the inter- (Ps (l, m)[dB] ∼ N (µs , σs2 ) and Pv (l, m)[dB] ∼ N (µv , σv2 )). ρ η val [0, 1] as: ρ̄l = { ρπ1 , ρπ2 , . . . , πp } and η̄ l = { ηπ1 , ηπ2 , . . . , πq }. The statistics of Ps (l, m)[dB] and Pv (l, m)[dB] , i.e., (µs , σs2 ) 10 6 (a) DeepLPC-ResNet-TCN (Roy et al., 2021a): 10 " !# 8 1 Ps (l, m)[dB] − µs Count 6 P̄s (l, m) = 1 + erf √ , (27) 2 σs 2 4 " !# 2 1 Pv (l, m)[dB] − µv 0 P̄v (l, m) = 1 + erf √ . (28) -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 2 σv 2 (b) To facilitate the training, the final training target, ζ l of size 1 M × 2 is formed by concatenating P̄s (l, m) and P̄v (l, m) 0.8 as: 0.6 0.4 ζ l = {P̄s (l, 0), P̄s (l, 1), . . . , P̄s (l, M − 1), P̄v (l, 0), 0.2 P̄v (l, 1), . . . , P̄v (l, M − 1)}, (29) 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 Then ζ l is used as the final training target in the super- vised LPC estimation framework (Fig. 1). During infer- 10 6 (c) ence, ζ̂ l is first split into P̄ˆs (l, m) and P̄ˆv (l, m) and com- 10 pute the LPC-PS as: 8 Count 6 √ 2erf−1 (2P̄ˆs (l,m)−1)+µs )/10) 4 P̂s (l, m) = 10((σs , (30) √ 2 −1 (2P̄ˆv (l,m)−1)+µv )/10) P̂v (l, m) = 10((σv 2erf . (31) 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 The |IDFT| of P̂s (l, m) and P̂v (l, m) gives an estimate of the autocorrelation matrices, R bss (τ ) and R bvv (τ ). As (d) 1 in DeepLPC-ResNet-TCN (Roy et al., 2021a, eq. (26)- 0.8 (27)), we construct Yule-Walker equations with the esti- 0.6 mated R bss (τ ) and R bvv (τ ) followed by solving them using 0.4 the Levinson-Durbin recursion (Vaseghi, 2006, Chapter 8), 0.2 2 0 yielding ({âi }, σ̂w ) (p = 16) and ({b̂k }, σ̂u2 ) (q = 16). -50 -40 -30 -20 -10 0 10 20 30 40 50 3.3. PS training target As like LPC-PS in Section 3.2, the PS of clean speech and noise (denoted as λs (l, m) and λv (l, m)) can also be Figure 2: (Color online) The distribution of (a) Ps (l, 64)[dB] and used as the training targets for supervised LPC estima- (c) Pv (l, 64)[dB] . The CDF of (b) Ps (l, 64)[dB] and (d) Pv (l, 64)[dB] , tion. For this purpose, in the training stage, λs (l, m) and where the sample mean and variance were found over the sample of λv (l, m) are computed directly from the squared magni- the training set1 . tude of the clean speech and noise spectral components (single-sided) as: λs (l, m) = |S(l, m)|2 and λv (l, m) = and (µv , σv2 ) for each frequency bin, m were found over a |V (l, m)|2 . Motivated by the dynamic range compression sample of the training set1 . It is also shown in Figs. 2(b) of LPC-PS in Section 3.2, we also utilise the CDF of and (d) that the CDFs of Ps (l, 64)[dB] and Pv (l, 64)[dB] the training targets (in dB), λs (l, m)[dB] and λv (l, m)[dB] computed over the training sample1 were used to com- to compress their dynamic range to the interval [0, 1], press the dynamic range of Ps (l, 64)[dB] and Pv (l, 64)[dB] , where λs (l, m)[dB] = 10 log10 (λs (l, m)) and λv (l, m)[dB] = respectively. In light of this observation, the CDFs of 10 log10 (λv (l, m)). For instance, we observed in Figs. 3(a) Ps (l, m)[dB] and Pv (l, m)[dB] are used to compress the dy- and (c) that λs (l, 64)[dB] and λv (l, 64)[dB] follow a Gaus- namic range of the training targets, as demonstrated in sian distribution. In light of the analysis, we assume that λs (l, m)[dB] and λv (l, m)[dB] are also distributed normally with mean, µs and µv , and variance σs2 and σv2 , respectively 1 2500 randomly selected clean speech recordings were mixed with (λs (l, m)[dB] ∼ N (µs , σs2 ) and λv (l, m)[dB] ∼ N (µv , σv2 )). 2500 randomly selected noise recordings from the training set (Sec- The statistics of λs (l, m)[dB] and λv (l, m)[dB] , i.e., (µs , σs2 ) tion 5.1) with SNR levels: -10 dB to +20 dB in 1 dB increments, giving 2500 noisy speech signals. For each frequency bin, m, the and (µv , σv2 ) for each frequency bin, m were found over sample mean and variances, (µs , σs2 ) and (µv , σv2 ) were computed a sample of the training set1 . The CDFs of λs (l, 64)[dB] from 2500 concatenated clean speech recordings and scaled noises, and λv (l, 64)[dB] computed over the training set1 (Fig. 3 respectively. (b) and (d)) are used to compress the dynamic range of λs (l, 64)[dB] and λv (l, 64)[dB] , respectively. Thus, the CDFs 10 6 (a) Then ζ l is used as the training target for the LPC estima- 10 tion framework (Fig. 1). During inference, ζ̂ l is first split 8 ˆ (l, m) and λ̄ into λ̄ ˆ (l, m) and compute the PSs as: Count s v 6 √ 4 ˆ (l,m)−1)+µ )/10) 2erf−1 (2λ̄ 2 λ̂s (l, m) = 10((σs s s , (35) √ ˆ (l,m)−1)+µ )/10) 0 λ̂v (l, m) = 10((σv 2erf −1 (2λ̄ v v . (36) -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 The |IDFT| of λ̂s (l, m) and λ̂v (l, m) yields an estimate (b) bss (τ ) and R bvv (τ ). As 1 of the autocorrelation matrices, R 0.8 demonstrated in (Roy et al., 2021a, eq. (26)-(27)), we con- 0.6 struct Yule-Walker equations with the estimated R bss (τ ) 0.4 b and Rvv (τ ) followed by solving them using the Levinson- 0.2 Durbin recursion (Vaseghi, 2006, Chapter 8), yielding ({âi }, 0 2 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 σ̂w ) (p = 16) and ({b̂k }, σ̂u2 ) (q = 16). 3.4. MS training target 10 6 (c) Motivated by CCED-AKF (Roy and Paliwal, 2020a), 10 in this study, the MS of clean speech and noise signal 8 (denoted as Cs (l, m) and Cv (l, m)) can also be used as Count 6 the training targets for supervised LPC estimation. In 4 2 the training stage, Cs (l, m) and Cv (l, m) are computed di- 0 rectly from the magnitude of the clean speech and noise -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 spectral components as: Cs (l, m) = |S(l, m)| and Cv (l, m) = |V (l, m)|. As demonstrated in CCED-AKF (Roy and Pali- (d) wal, 2020a), we have also used the min-max normalization 1 (Han et al., 2011, section 3.5.2) to compress the dynamic 0.8 range of Cs (l, m) and Cv (l, m) as: 0.6 0.4 Cs (l, m) − Smin (m) 0.2 C̄s (l, m) = , (37) Smax (m) − Smin (m) 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 Cv (l, m) − Vmin (m) C̄v (l, m) = , (38) Vmax (m) − Vmin (m) where (Smax , Smin ) and (Vmax , Vmin ) for each frequency bin, m were found over a sample of the training set1 . Figure 3: (Color online) The distribution of (a) λs (l, 64)[dB] and As in Section 3.3, for lth frame, the final training target, (c) λv (l, 64)[dB] . The CDF of (b) λs (l, 64)[dB] and (d) λv (l, 64)[dB] , where the sample mean and variance were found over the sample of ζ l of size M × 2 is formed by concatenating C̄s (l, m) and the training set1 . C̄v (l, m) as: ζ l = {C̄s (l, 0), C̄s (l, 1), . . . , C̄s (l, M − 1), C̄v (l, 0), of λs (l, m)[dB] and λv (l, m)[dB] are used to compress the C̄v (l, 1), . . . , C̄v (l, M − 1)}. (39) dynamic range of training targets, as in equations ((27)- (28)): Then ζ l is used as the training target for the LPC esti- " !# mation framework (Fig. 1). During inference, ζ̂ l is first 1 λs (l, m)[dB] − µs split into C̄ˆs (l, m) and C̄ˆv (l, m). Then compute the clean λ̄s (l, m) = 1 + erf √ , (32) 2 σs 2 speech and noise MSs from C̄ˆs (l, m) and C̄ˆv (l, m) using " !# the inverse min-max normalization as (Han et al., 2011, 1 λv (l, m)[dB] − µv λ̄v (l, m) = 1 + erf √ . (33) section 3.5.2): 2 σv 2 Ĉs (l, m) = Smin (m) + C̄ˆs (l, m)(Smax (m) − Smin (m)), To facilitate the training, for lth frame, the final training (40) target, ζ l of size M ×2 is formed by concatenating λ̄s (l, m) ˆ and λ̄v (l, m) as: Ĉ (l, m) = V v (m) + C̄ (l, m)(V min v (m) − V (m)). max min (41) ζ l = {λ̄s (l, 0), λ̄s (l, 1), . . . , λ̄s (l, M − 1), λ̄v (l, 0), By taking the square of Ĉs (l, m) and Ĉv (l, m) followed λ̄v (l, 1), . . . , λ̄v (l, M − 1)}. (34) by |IDFT| of them yields the autocorrelation matrices, bss (τ ) and R R bvv (τ ). Then compute ({âi }, σ̂ 2 ) (p = 16) kernel size of ks . The output size of the first and second w bss (τ ) and R and ({b̂k }, σ̂u2 ) (q = 16) from R bvv (τ ) using the CU is df , while the third one is dmodel . A dilation rate similar procedure as in Section 3.3. one is set for the first and the third CU, which is d for the second CU. The second CU provides a contextual field over previous time steps. The dilation rate, d is cycled as 4. Deep neural network the block index j increases as: d = 2(j−1 mod (log2 (D)+1) , The DNN methods—ResNet-TCN and MHANet used where mod is the modulo operation, and D is the maxi- in this study are briefly described below. mum dilation rate. In (Roy et al., 2021a, Fig. 5), it was demonstrated that the contextual field gained by the use 𝜻෡𝒍 of causal dilated CUs. The last residual block is followed by the output layer, O, which is a fully-connected layer with sigmoidal units. The O layer gives an estimate of ζ̂ l . For LPC-PS, PS, and MS training targets, the hy- O perparameters used in DeepLPC (Roy et al., 2021a) also used in this study: dmodel = 256, df = 64, B = 40, ks = 3, and D = 16. With this set of hyperparameters, the + ResNet-TCN exhibits approximately 2.1 million parame- Conv. Unit ters. For LSF, all the above hyperparameters were used Conv. Unit (1, 𝑑𝑚𝑜𝑑𝑒𝑙 , 1) 3𝑟𝑑 Conv1D except df = 34 gives around 1.91 million parameters. Sec- tion 5.2 details the training strategy of the ResNet-TCN. 𝐵× Conv. Unit (𝑘𝑠 , 𝑑𝑓 , 𝑑) 2𝑛𝑑 ReLU 𝜻෡𝒍 Layer Norm Conv. Unit 1𝑠𝑡 Output layer (1, 𝑑𝑓 , 1) FC Add & norm |𝒀𝒍 | Feed-forward Figure 4: (Color online) ResNet-TCN. The kernel size, output size, and dilation rate for each convolutional unit is denoted as Add & norm ×𝐵 (kernel size, output size, dilation rate). Masked 4.1. ResNet-TCN multi-head The ResNet-TCN from DeepLPC framework (Roy et al., self-attention 2021a) is also used in this study to estimate ζ l (Equa- tions (24), (29), (34), and (39)) from |Y l |. The ResNet- TCN is shown in Fig. 4. For each training target (Sec- tion 3), the input, |Y l | is first passed through FC, a fully- connected layer of size dmodel , followed by layer normal- ization (LN) (Ba et al., 2016) and the rectified linear unit First layer (ReLU) activation function (He et al., 2015). FC is fol- lowed by B bottleneck residual blocks, where jǫ{1, 2, . . . , B} is the block index. Each block comprise of three one- Pos. encoding dimensional causal convolutional units. Each convolutional unit (CU) is pre-activated by LN (Ba et al., 2016) followed |𝒀𝒍 | by the ReLU activation function (He et al., 2015). The kernel size, output size, and dilation rate for each convolu- Figure 5: (Color online) MHANet within the multi-head attention tional unit is denoted as (kernel size, output size, dilation network for LPC estimation. rate). The first and third CUs in each block has a ker- nel size of one, whilst the second convolutional unit has a 4.2. MHANet coloured noise recordings (with a value ranging from 2 to The MHANet from (Nicolson and Paliwal, 2020) is 2 in increments of 0.25) were used, giving a total of 16 243 used in this study. Since MHANet is a large framework, noise recordings. For the validation set, 5% of the noise we briefly summarize it in this section. For a detailed de- recordings (813) are randomly selected. The remaining scription of MHANet, we refer the reader to (Nicolson and 15 430 noise recordings are used for the training set. All Paliwal, 2020). The simplest form of MHANet is shown the clean speech and noise recordings are single-channel in Fig. 5. The processing steps of the MHANet from in- with a sampling frequency of 16 kHz. To create the noisy put to output are described as follows. For each training speech for the validation set, each of the 3 713 clean speech target, ζ l (Equations (24), (29), (34), and (39)), |Y l | is recordings is corrupted by a random section of a randomly fed as the input to MHANet. The first layer in MHANet selected noise recording (from the set of 813 noise record- is used to project the input to a size of dmodel . As in ings) at a randomly selected SNR level (-10 to +20 dB, in (Nicolson and Paliwal, 2019), the first layer is formed as: 1 dB increments). The noisy speech for the training set max(0, LN(|X|W I + bIs )), where LN is frame-wise layer was created using the method described in Section 5.2. normalisation (Ba et al., 2016), W I ∈ RM ×dmodel , and bIs ∈ Rdmodel . Next, the positional encoding from (Nicol- 5.2. Training strategy son and Paliwal, 2020) is added after the first layer, where The following training strategy was employed for train- the time-frame index indicates the position. The position ing ResNet-TCN and MHANet: encoding is learned using weight matrix Wp , with a max- • Mean squared error is used as the loss function. imum length of 2048 time-frames (i.e. Wp ∈ R2048×256 ). • The Adam optimiser (Kingma and Ba, 2014) with This is followed by B cascading blocks. Each block in- β1 = 0.9, β2 = 0.98, and ǫ = 10−9 is used for stochas- cludes an MHA module, a two-layer feed-forward neural tic gradient descent optimisation, where the learning network (FNN) (Xu et al., 2014), residual connections (He rate, αr , is controlled over the course of training as et al., 2016), and frame-wise LN (Ba et al., 2016). For the in (Vaswani et al., 2017): detailed description of the blocks, we refer the reader to (Nicolson and Paliwal, 2020, Section 3.1). The last block αr = d−0.5 model · min(γ −0.5 , γ · Γ−1.5 ), (42) is followed by the output layer, which is a sigmoidal feed- where γ is the training step and Γ is the number of forward layer as in (Nicolson and Paliwal, 2019). The out- warmup steps. put layer gives an estimate of ζ̂. For all training targets, the hyperparameters used in • Gradients are clipped between [−1, 1]. DeepLPC-MHANet (Roy et al., 2021b) also used in this • The number of training examples in an epoch is equal study: B = 5, df = 1 024, dmodel = 256, H = 8, Pdrop = to the number of clean speech recordings used in the 0.0, and Γ = 40 000. With this set of hyperparameters, the training set, i.e., 70 537. MHANet exhibits approximately 4.27 million parameters. • A mini-batch size of eight training examples is used. Section 5.2 details the training strategy of the MHANet. • The noisy speech signals are generated on the fly as follows: each clean speech recording is randomly se- 5. Experimental setup lected and corrupted with a randomly selected noise 5.1. Training and validation set recording at a randomly selected SNR level (-10 to The noisy speech for the training and validation sets +20 dB, in 1 dB increments). are formed from clean speech and noise recordings. For • Epoch 150 is used to train both the ResNet-TCN the clean speech recordings, the train-clean-100 set of the and MHANet. Librispeech corpus (Panayotov et al., 2015) (28 539), the CSTR VCTK corpus (Veaux et al., 2017) (42 015), and 5.3. Test set the si∗ and sx∗ training sets of the TIMIT corpus (Garo- For the objective experiments, 30 clean speech utter- folo et al., 1993) (3 696) were used, giving a total 74 250 ances belonging to six speakers (three male and three fe- clean speech recordings. To form the validation set, 5% male) are taken from the NOIZEUS corpus (Loizou, 2013, of the clean speech recordings (3 713) are randomly se- Chapter 12). The noisy speech for the test set is formed lected. Thus, 70 537 of the clean speech recordings are by mixing the clean speech with real-world non-stationary used for the training set. For the noise recordings, the (voice babble, street) and colored (factory and f16 ) noise QUT-NOISE dataset (Dean et al., 2010), the Nonspeech recordings selected from (Saki and Kehtarnavaz, 2016; Saki dataset (Hu, 2004), the Environmental Background Noise et al., 2016) at multiple SNR levels varying from -5dB to dataset (Saki and Kehtarnavaz, 2016; Saki et al., 2016), the +15 dB, in 5 dB increments. This provides 30 examples noise set from the MUSAN corpus (Snyder et al., 2015), per condition with 20 total conditions. All the clean speech multiple FreeSound packs (https://freesound.org/)2 , and and noise recordings are single-channel with a sampling frequency of 16 kHz. Note that the speech and the noise 2 Freesound packs that were used: 147, 199, 247, 379, 622, 643, recordings in the test set are different from those used in 1 133, 1 563, 1 840, 2 432, 4 366, 4 439, 15 046, 15 598, 21 558. the training and validation sets. 5.4. SD level evaluation The frame-wise spectral distortion (SD) (dB) (Gray 30 and Markel, 1976) is used to evaluate the accuracy of LPC 25 SD (dB) estimates obtained using ResNet-TCN and MHANet for 20 the training targets; LSF, LPC-PS, PS, and MS. Specifi- 15 cally, the estimated clean speech LPCs are evaluated. SD 10 for lth frame, denoted by Dl (in dB) is defined as the root- 5 mean-square-difference between the LPC-PS estimate in 0 dB, P̂s (l, m)[dB] , and the oracle case in dB, Ps (l, m)[dB] as 30 (Gray and Markel, 1976): v 25 SD (dB) u M −1 20 u 1 X 2 Dl = t Ps (l, m)[dB] − P̂s (l, m)[dB] . (43) 15 M m=0 10 5 5.5. Speech enhancement methods 0 We also evaluate the performance of the LPC estima- tion in speech enhancement context. For this purpose, the 30 specifications of competitive SEAs are as follows: 25 SD (dB) 20 1. Noisy: speech corrupted with additive noise. 15 2. Oracle-AKF: AKF, where ({ai }, σw 2 ) and ({bk }, σu2 ) 10 are computed from the clean speech and the noise 5 signal, where p = 16, q = 16, window length=32 ms, 0 frame shift=16 ms, and a rectangular window is used 30 for framing. 25 SD (dB) 3. AKF constructed with speech and noise LPC param- 20 eters derived from the training targets— LSF, LPC- 15 PS, PS, and MS using ResNet-TCN and MHANet. 10 Thus, there are eight AKF methods, where p = 16, 5 q = 16, window length=32 ms, frame shift=16 ms, 0 and a rectangular window is used for framing. -5 0 5 10 15 4. Deep Xi-ResNet-TCN-MMSE-LSA: Deep Xi-ResNet- Input SNR(dB) TCN (Zhang et al., 2020) estimates the a priori SNR Noisy for the MMSE-LSA estimator (Ephraim and Malah, ResNet-TCN-LSF 1985), where wf = 32 ms, sf = 16 ms, and a square- MHANet-LSF root-Hann window is used for analysis and synthesis. ResNet-TCN-MS ResNet-TCN-PS Table 1: Objective measures, what each assesses, and the range of MHANet-MS their scores. For each measure, higher is better. MHANet-PS ResNet-TCN-LPC-PS Measure Assesses Range MHANet-LPC-PS CSIG (Hu and Loizou, 2008) Quality [1, 5] CBAK (Hu and Loizou, 2008) Quality [1, 5] COVL (Hu and Loizou, 2008) Quality [1, 5] PESQ (Rix et al., 2001) Quality [−0.5, 4.5] Figure 6: (Color online) Average SD (dB) level for each SEA found STOI (Taal et al., 2011) Intelligibility [0, 100]% over all frames for each test condition in Section 5.3. SI-SDR (Roux et al., 2019) Quality [−∞, ∞] SegSNR (Mermelstein, 1979) Quality [−∞, ∞] 5.7. Subjective evaluation for speech enhancement The subjective evaluation was carried out through a se- ries of blind AB listening tests (Paliwal et al., 2010, Section 5.6. Objective quality and intelligibility measures 3.3.4). The test is performed on utterance sp05 (“Wipe the Objective measures are used to evaluate the quality grease off his dirty face”) from NOIZEUS corpus (Loizou, and intelligibility of the enhanced speech with respect to 2013, Chapter 12) corrupted by 5 dB voice babble noise in the corresponding clean speech. Table 1 shows the objec- Section 5.3. In this test, the enhanced speech produced by tive quality and intelligibility measures used in this study. ten SEAs as well as the corresponding clean speech and Table 2: Mean objective scores on NOIZEUS corpus in terms of CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR. Apart from Oracle-AKF, the highest score amongst the competing methods for each measure is given in boldface. Methods CSIG CBAK COVL PESQ STOI SegSNR SI-SDR Noisy speech 2.41 2.27 2.12 1.64 67.67 0.89 6.39 ResNet-TCN-LSF-AKF 2.91 2.72 2.51 2.03 74.73 7.14 10.19 MHANet-LSF-AKF 2.97 2.76 2.55 2.11 76.96 7.35 11.52 ResNet-TCN-MS-AKF 3.23 2.81 2.70 2.25 79.17 7.45 12.27 ResNet-TCN-PS-AKF 3.32 2.89 2.77 2.29 80.29 7.53 12.69 Deep Xi-ResNet-TCN-MMSE-LSA 3.38 3.02 2.81 2.34 81.53 7.67 13.39 MHANet-MS-AKF 3.39 3.06 2.84 2.38 82.59 7.73 13.52 MHANet-PS-AKF 3.42 3.09 2.88 2.42 83.84 7.83 13.67 ResNet-TCN-LPC-PS-AKF 3.49 3.17 2.95 2.51 85.34 8.78 14.44 MHANet-LPC-PS-AKF 3.66 3.32 3.14 2.63 88.64 9.81 15.48 Oracle-AKF 4.21 4.07 3.97 2.74 95.61 10.87 16.43 noisy speech signals were played as stimuli pairs to the lis- objective measures described in Table 1. The mean objec- teners. Specifically, the test is performed on a total of 132 tive evaluation results for the NOIZEUS corpus are shown stimuli pairs played in a random order to each listener, ex- in Tables 2, respectively. It can be seen that Oracle-AKF cluding the comparisons between the same method. The produces the highest scores for all measures, which can be listener prefers the first or second stimuli, which is percep- thought of as the upper boundary of performance. Noisy tually better, or a third response indicating no difference is speech produced the lowest scores for all measures, indi- found between them. For a pairwise scoring, 100% award cating the lower boundary of performance. When com- is given to the preferred method, 0% to the other, and paring the performance of the MHANet-LPC-PS-AKF, it 50% for the similar preference response. The participants shows a consistent CSIG, CBAK, COVL, PESQ, STOI, could re-listen to stimuli if required. Ten English speak- SegSNR, and SI-SDR score improvement over the com- ing listeners participate in the blind AB listening tests3 . peting methods. This demonstrates that MHANet-LPC- The average of the preference scores given by the listeners PS-AKF produces enhanced speech at a higher quality and termed as mean subjective preference score (%), which is intelligibility than any competing methods. used to compare the efficiency among the SEAs. Figures 7 and 8 show the PESQ and STOI scores, respectively, of each SEA for multiple conditions. The MHANet-LPC-PS-AKF method produced higher PESQ 6. Results and discussion and STOI scores than other competing SEAs for each 6.1. SD level comparison condition. This demonstrates that the MHANet-LPC-PS- AKF method is able to produce higher objective quality The average SD levels (in dB) attained by the ResNet- and intelligibility scores than the competing methods— TCN and MHANet for training targets— LSF, LPC-PS, across multiple SNR levels and noise sources. PS, and MS are shown in Fig. 6. It can be seen that the MHANet-LPC-PS is able to produce the lowest SD levels 6.3. Subjective evaluation by AB listening test than other competing methods for both real-world non- stationary as well as colored noise conditions. Amongst The mean subjective preference score (%) for each SEA other competing methods, ResNet-TCN-LPC-PS produced is shown in Fig. 9. It can be seen that the MHANet-LPC- the next lowest SD level. The SD levels for noisy speech PS-AKF method is widely preferred (around 73%) by the indicate the upper bounds of the SD level. In light of the listeners to that of the competing methods, apart from the comparative study, it is evident to say that the best train- clean speech (100%) and the Oracle-AKF method (83%). ing target and DNN method for accurate LPC estimation Also, ResNet-TCN-LPC-PS-AKF is found the next best in practice are LPC-PS and MHANet, respectively. The preferred (71%), with MHANet-PS-AKF (69%), MHANet- lowest SD level attained by the MHANet-LPC-PS will be MS-AKF (67%), Deep Xi-ResNet-TCN-MMSE-LSA (64%), of benefit to the AKF for speech enhancement. ResNet-TCN-PS-AKF (58%), and ResNet-TCN-MS-AKF (55%). Also, MHANet-LSF-AKF (49%) and ResNet-TCN- 6.2. Objective evaluation LSF-AKF (44%) achieve the lowest preference scores among the competing methods. It is due to the MHANet-LSF and In this section, we analyze the performance improve- ResNet-TCN-LSF methods produced higher SD levels in ment among the competing methods (Section 5.5) for the the estimated speech LPCs (Fig. 6). In light of the blind AB listening tests, it is evident to say that the enhanced 3 The AB listening tests were conducted with approval from the speech produced by the MHANet-LPC-PS-AKF exhibits Griffith University’s Human Research Ethics Committee: database the best perceived quality amongst all tested methods for protocol number 2018/671. the tested condition specified in Section 5.7. 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 4 100 3 80 PESQ STOI60 2 40 1 20 0 0 4 100 3 80 PESQ STOI 60 2 40 1 20 0 0 -5 0 5 10 15 -5 0 5 10 15 Input SNR(dB) Input SNR(dB) Noisy Noisy ResNet-TCN-LSF-AKF ResNet-TCN-LSF-AKF MHANet-LSF-AKF MHANet-LSF-AKF ResNet-TCN-MS-AKF ResNet-TCN-MS-AKF ResNet-TCN-PS-AKF ResNet-TCN-PS-AKF Deep Xi-ResNet-TCN-MMSE-LSA Deep Xi-ResNet-TCN-MMSE-LSA MHANet-MS-AKF MHANet-MS-AKF MHANet-PS-AKF MHANet-PS-AKF ResNet-TCN-LPC-PS-AKF ResNet-TCN-LPC-PS-AKF MHANet-LPC-PS-AKF MHANet-LPC-PS-AKF Oracle-AKF Oracle-AKF Figure 7: PESQ score for each SEA for each condition specified in Figure 8: STOI score for each SEA for each condition specified in Section 5.3. Section 5.3. 7. Conclusion well as DNN method estimates more accurate speech and noise LPC parameters in real-life noise conditions. The This paper performs a comprehensive study on training performance of the LPC estimation for each training target targets— LSF, LPC-PS, PS, and MS using two state-of- using ResNet-TCN and MHANet was evaluated using SD the-art DNNs— ResNet-TCN and MHANet for LPC esti- level comparisons. The accuracy of the estimated LPCs mation. This study aims to find which training target as 100 Dean, D.B., Sridharan, S., Vogt, R.J., Mason, M.W.. The Mean subjective preference score (%) QUT-NOISE-TIMIT corpus for the evaluation of voice activity 80 detection algorithms. In: Proceedings Interspeech 2010. 2010. p. 3110–3113. Ephraim, Y., Malah, D.. Speech enhancement using a 60 minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 40 1985;33(2):443–445. doi:10.1109/TASSP.1985.1164550. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, 20 D.S.. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N 1993;93. 0 George, A.E., So, S., Ghosh, R., Paliwal, K.K.. Robustness metric- KF F F an F KF KF sy based tuning of the augmented Kalman filter for the enhance- F F KF A AK AK AK AK AK oi TS le -A -A -A A N S- C S- F- e- S- - S ment of speech corrupted with coloured noise. Speech Communi- S SF -S S -P -P S -P cl -P -M -M SE -L -L ra PC PC et N N et cation 2018;105:62–76. doi:https://doi.org/10.1016/j.specom. N et O C M N C AN C -L -L AN -T A -M -T -T H N et et H et H et 2018.10.002. et M C AN M N M N N -T N es es es H es et Gibson, J.D., Koo, B., Gray, S.D.. Filtering of colored noise for R M R -R N R es Xi speech enhancement and coding. IEEE Transactions on Signal R p ee Processing 1991;39(8):1732–1742. doi:10.1109/78.91144. D Speech Enhancement Methods Gray, A., Markel, J.. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 1976;24(5):380–391. doi:10.1109/TASSP.1976.1162849. Figure 9: (Color online) The mean preference score (%) comparison Han, J., Pei, J., Kamber, M.. Data Mining: Concepts and Tech- between the proposed and benchmark SEAs for the utterance sp05 niques. The Morgan Kaufmann Series in Data Management Sys- corrupted with 5 dB voice babble noise. tems. Elsevier Science, 2011. He, K., Zhang, X., Ren, S., Sun, J.. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. was also verified in a speech enhancement context. Specif- CoRR 2015;abs/1502.01852. arXiv:1502.01852. He, K., Zhang, X., Ren, S., Sun, J.. Deep residual learning ically, the AKF is constructed with the estimated speech for image recognition. IEEE Conference on Computer Vision and and noise LPC parameters from training targets—LSF, Pattern Recognition 2016;:770–778doi:10.1109/CVPR.2016.90. LPC-PS, PS, and MS using ResNet-TCN and MHANet. Hu, G.. 100 nonspeech environmental sounds. The Ohio State Uni- Experiments on the NOIZEUS corpus demonstrate that versity, Department of Computer Science and Engineering 2004;. Hu, Y., Loizou, P.C.. Evaluation of objective quality measures the MHANet-LPC-PS produces a lower SD level in the es- for speech enhancement. IEEE Transactions on Audio, Speech, timated speech LPCs than other estimators in real-world and Language Processing 2008;16(1):229–238. doi:10.1109/TASL. non-stationary and colored noise conditions. Objective 2007.911054. Itakura, F.. Line spectrum representation of linear predictor coeffi- and subjective scores on the NOIZEUS corpus also indi- cients of speech signals. The Journal of the Acoustical Society of cate that the AKF constructed with the speech and noise America 1975;57. doi:10.1121/1.1995189. LPC parameters derived from MHANet-LPC-PS exhibits Kamath, S., Loizou, P.. A multi-band spectral subtraction method higher quality and intelligibility in the enhanced speech for enhancing speech corrupted by colored noise. IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing than the competing methods. 2002;4:4160–4164. doi:10.1109/ICASSP.2002.5745591. Kingma, D.P., Ba, J.. Adam: A method for stochastic optimization. 2014. arXiv:1412.6980. CRediT authorship contribution statement Loizou, P.C.. Speech Enhancement: Theory and Practice. 2nd ed. Boca Raton, FL, USA: CRC Press, Inc., 2013. Sujan Kumar Roy: Conceptualization, Methodology, McLoughlin, I.V.. Line spectral pairs. Signal Processing Software, Data Curation, Writing - review & editing, In- 2008;88(3):448–467. doi:https://doi.org/10.1016/j.sigpro. vestigation, Visualization. Aaron Nocolson: Writing - re- 2007.09.003. Mermelstein, P.. Evaluation of a segmental SNR measure as an view & editing. Kuldip K. Paliwal: Supervision. indicator of the quality of ADPCM coded speech. The Journal of the Acoustical Society of America 1979;66(6):1664–1667. doi:10. 1121/1.383638. arXiv:https://doi.org/10.1121/1.383638. Declaration of Competing Interest Nicolson, A., Paliwal, K.K.. Deep learning for minimum mean- square error approaches to speech enhancement. Speech Com- The authors declare that they have no known compet- munication 2019;111:44–55. doi:https://doi.org/10.1016/j. ing financial interests or personal relationships that could specom.2019.06.002. Nicolson, A., Paliwal, K.K.. Masked multi-head self-attention for have appeared to influence the work reported in this pa- causal speech enhancement. Speech Communication 2020;125:80– per. 96. doi:https://doi.org/10.1016/j.specom.2020.10.004. Paliwal, K., Basu, A.. A speech enhancement method based on Kalman filtering. IEEE International Conference on Acous- References tics, Speech, and Signal Processing 1987;12:177–180. doi:10.1109/ ICASSP.1987.1169756. Ba, L.J., Kiros, J.R., Hinton, G.E.. Layer normalization. CoRR Paliwal, K., Wójcicki, K., Schwerin, B.. Single-channel speech 2016;abs/1607.06450. arXiv:1607.06450. enhancement using spectral subtraction in the short-time mod- Bai, S., Kolter, J.Z., Koltun, V.. An empirical evaluation of ulation domain. Speech Communication 2010;52(5):450–475. generic convolutional and recurrent networks for sequence model- doi:https://doi.org/10.1016/j.specom.2010.02.004. ing. ArXiv 2018;abs/1803.01271. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.. Librispeech: Fergus, R., Vishwanathan, S., Garnett, R., editors. Advances An ASR corpus based on public domain audio books. IEEE Inter- in Neural Information Processing Systems 30. Curran Associates, national Conference on Acoustics, Speech and Signal Processing Inc.; 2017. p. 5998–6008. 2015;:5206–5210doi:10.1109/ICASSP.2015.7178964. Veaux, C., Yamagishi, J., MacDonald, K.. CSTR VCTK corpus: Pickersgill, C., So, S., Schwerin, B.. Investigation of dnn prediction English multi-speaker corpus for CSTR voice cloning toolkit. Uni- of power spectral envelopes for speech coding & asr. 17th Speech versity of Edinburgh The Centre for Speech Technology Research Science and Technology Conference (SST2018), Sydney, Australia (CSTR) 2017;. 2018;. Xu, Y., Du, J., Dai, L., Lee, C.. An experimental study on speech Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.. Percep- enhancement based on deep neural networks. IEEE Signal Pro- tual evaluation of speech quality (PESQ)-a new method for speech cessing Letters 2014;21(1):65–68. doi:10.1109/LSP.2013.2291240. quality assessment of telephone networks and codecs. IEEE Inter- Yu, H., Ouyang, Z., Zhu, W., Champagne, B., Ji, Y.. A deep neu- national Conference on Acoustics, Speech, and Signal Processing ral network based Kalman filter for time domain speech enhance- 2001;2:749–752. doi:10.1109/ICASSP.2001.941023. ment. IEEE International Symposium on Circuits and Systems Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.. SDR – half- 2019;:1–5doi:10.1109/ISCAS.2019.8702161. baked or well done? In: ICASSP 2019 - 2019 IEEE International Yu, H., Zhu, W.P., Champagne, B.. Speech enhancement us- Conference on Acoustics, Speech and Signal Processing (ICASSP). ing a DNN-augmented colored-noise Kalman filter. Speech Com- 2019. p. 626–630. doi:10.1109/ICASSP.2019.8683855. munication 2020;125:142 – 151. doi:https://doi.org/10.1016/j. Roy, S.K., Nicolson, A., Paliwal, K.K.. A deep learning-based specom.2020.10.007. Kalman filter for speech enhancement. In: Proc. Interspeech 2020. Zhang, Q., Nicolson, A., Wang, M., Paliwal, K.K., Wang, 2020a. p. 2692–2696. doi:10.21437/Interspeech.2020-1551. C.. DeepMMSE: A deep learning approach to MMSE-based Roy, S.K., Nicolson, A., Paliwal, K.K.. Deep learning with aug- noise power spectral density estimation. IEEE/ACM Transactions mented Kalman filter for single-channel speech enhancement. In: on Audio, Speech, and Language Processing 2020;28:1404–1415. 2020 IEEE International Symposium on Circuits and Systems (IS- doi:10.1109/TASLP.2020.2987441. CAS). 2020b. p. 1–5. doi:10.1109/ISCAS45731.2020.9180820. Roy, S.K., Nicolson, A., Paliwal, K.K.. DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement. TechRxiv 2021a;doi:10.36227/techrxiv.14384672. v1. Roy, S.K., Nicolson, A., Paliwal, K.K.. DeepLPC-MHANet: Multi- head self-attention for augmented Kalman filter-based speech en- hancement. TechRxiv 2021b;doi:10.36227/techrxiv.14384909. v1. Roy, S.K., Paliwal, K.K.. Causal convolutional encoder decoder- based augmented Kalman filter for speech enhancement. In: 2020 14th International Conference on Signal Processing and Communication Systems (ICSPCS). 2020a. p. 1–7. doi:10.1109/ ICSPCS50536.2020.9310011. Roy, S.K., Paliwal, K.K.. Deep residual network-based augmented Kalman filter for speech enhancement. In: 2020 Asia-Pacific Sig- nal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2020b. p. 667–673. Roy, S.K., Paliwal, K.K.. Sensitivity metric-based tuning of the augmented Kalman filter for speech enhancement. 14th International Conference on Signal Processing and Communi- cation Systems (ICSPCS) 2020 2020;doi:10.1109/ICSPCS50536. 2020.9310005. Saki, F., Kehtarnavaz, N.. Automatic switching between noise classification and speech enhancement for hearing aid devices. In: 2016 38th Annual International Conference of the IEEE Engineer- ing in Medicine and Biology Society (EMBC). 2016. p. 736–739. doi:10.1109/EMBC.2016.7590807. Saki, F., Sehgal, A., Panahi, I., Kehtarnavaz, N.. Smartphone- based real-time classification of noise signals using subband fea- tures and random forest classifier. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016. p. 2204–2208. doi:10.1109/ICASSP.2016.7472068. Snyder, D., Chen, G., Povey, D.. MUSAN: A Music, Speech, and Noise Corpus. 2015. arXiv:1510.08484; arXiv:1510.08484v1. Srinivasan, S., Samuelsson, J., Kleijn, W.B.. Codebook driven short-term predictor parameter estimation for speech enhance- ment. IEEE Transactions on Audio, Speech, and Language Pro- cessing 2006;14(1):163–176. doi:10.1109/TSA.2005.854113. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Pro- cessing 2011;19(7):2125–2136. doi:10.1109/TASL.2011.2114881. Vaseghi, S.V.. Advanced Digital Signal Processing and Noise Re- duction. Hoboken, NJ, USA: John Wiley Sons, Inc., 2006. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.. Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Part V Conclusion 137 Chapter 9 Summary, Conclusions, and Future Work 9.1 Summary and Conclusions This section summarizes the main findings and conclusions presented in the research contributing Chapters 2-8 of this dissertation. 9.1.1 Chapter 2: An Iterative Kalman Filter with Reduced- Biased Kalman Gain for Single Channel Speech En- hancement in Non-stationary Noise Condition In this chapter, an IT-KF with reduced-biased Kalman gain has been pro- posed for single-channel speech enhancement in non-stationary noise con- ditions (NNCs). Unlike existing IT-KF, the speech LPC parameters are computed from the pre-smoothed speech. The noise variance is also es- timated from each noisy speech frame. Then construct the KF with the 139 CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE 140 WORK estimated parameters and process the noisy speech at the first iteration. The parameters are re-estimated from the processed speech, re-construct the KF, and the process is repeated at the second iteration. It is shown that the re-estimated parameters at the second iteration result in a signif- icant reduction of bias in KF gain— which address speech enhancement in non-stationary noise conditions. Objective and subjective scores on the NOIZEUS corpus demonstrate that the proposed method outperforms other benchmark methods in NNCs for a wide range of SNR levels. This chapter also exhibits that the opportunity for further research lies in dynamically offsetting the bias of the KF gain for speech enhancement in NNCs. 9.1.2 Chapter 3: Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement This chapter presents robustness and sensitivity metrics-based tuning of the Kalman filter gain for single-channel speech enhancement. At first, a speech presence probability (SPP) method yields an estimate of noise PSD from each noisy speech frame to compute the noise variance. A whitening filter is also constructed to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. To achieve better noise reduction, the robust- ness and the sensitivity metrics are incorporated differently depending on the speech activity of the noisy speech to dynamically offset the bias in KF gain. For this purpose, the noise variance and the speech AR model parame- ters are adopted as a speech activity detector. It is shown that the significant reduced-biased KF gain achieved by the proposed tuning algorithm address speech enhancement in real-life noise conditions. Objective and subjective scores on the NOIZEUS corpus demonstrate that the proposed method out- CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE WORK 141 performs the benchmark methods in real-life noise conditions for a wide range of SNR levels. The study of this chapter reveals that the robustness and sensitivity metrics-based tuning can also be adopted with an augmented Kalman filter for speech enhancement. 9.1.3 Chapter 4: Robustness and Sensitivity Metrics-Based Tuning of the Augmented Kalman Filter for Single- Channel Speech Enhancement This chapter investigates a tuning algorithm of the augmented Kalman fil- ter (AKF) gain for speech enhancement in real-life noise conditions. At first, an SPP method estimates noise PSD from each noisy speech frame to compute the noise LPC parameters. A whitening filter is also constructed with the estimated noise LPCs to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. Then construct the AKF with the estimated speech and noise LPC parameters. To achieve better noise reduction, the robustness metric is employed to offset the bias in AKF gain during speech absence of the noisy speech to that of the sensitivity metrics during speech presence. The speech and noise AR model parameters are adopted as a speech activity detector. It is shown that the reduced-biased AKF gain achieved by the proposed tuning algorithm address speech en- hancement in real-life noise conditions. Objective and subjective scores on the NOIZEUS corpus demonstrate that the proposed method outperforms the benchmark methods in real-life noise conditions for a wide range of SNR levels. This chapter’s study exhibits that the performance of AKF-based speech enhancement can be improved by utilizing the speech and noise LPC parameter estimates in practice. CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE 142 WORK 9.1.4 Chapter 5: Deep Learning-Based Kalman Filter and Augmented Kalman Filter for Speech Enhancement This chapter investigates deep learning and whitening filter assisted KF and AKF for speech enhancement. Specifically, a deep learning framework estimates noise PSD from each noisy speech frame to compute the noise parameters. A whitening filter is also constructed (with its coefficients com- puted from the estimated noise PSD) to pre-whiten each noisy speech frame prior to computing the speech LPC parameters. Then construct the KF and AKF with the estimated parameters for speech enhancement. Sub- jective and objective scores on the NOIZEUS corpus demonstrate that the proposed KF and AKF methods outperform other competing methods in various noise conditions for a wide range of SNR levels. However, the study of this chapter reveals that the whitening filter produces biased speech LPC parameters for the KF and AKF, which impacts the quality and intelligibil- ity of the enhanced speech. Therefore, the opportunity for further research lies in applying deep learning methods to jointly estimate the speech and noise parameters for the KF and AKF in practice. 9.1.5 Chapter 6: DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement In this chapter, we propose a DeepLPC framework within ResNet-TCN to jointly estimate the speech and noise LPC parameters for the AKF in real-life noise conditions. Specifically, DeepLPC learns a mapping from each frame of the noisy speech magnitude spectrum to the clean speech and noise LPC-PS. The inverse Fourier transform of the estimated speech CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE WORK 143 and noise LPC-PS yields the autocorrelation matrices, which are solved by the Levinson-Durbin recursion, giving the LPCs and the prediction error variances of the speech and noise. The DeepLPC framework produces a lower SD level in the estimated speech LPCs than competing deep learning methods for NOIZEUS and DEMAND Voice Bank test sets. Objective and subjective scores on the NOIZEUS and DEMAND Voice Bank test sets also demonstrate that the AKF constructed with the DeepLPC-ResNet-TCN driven speech and noise LPC parameters outperforms the competing deep learning-based methods in various noise conditions for a wide range of SNR levels. Recently, MHANet has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than ResNet-TCN. Thus, the application of MHANet within the DeepLPC will be of benefit to achieve more accurate estimates of the LPC parameters for the AKF. 9.1.6 Chapter 7: DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-Based Speech Enhance- ment In this chapter, we propose an extension of the DeepLPC framework within MHANet, called DeepLPC-MHANet, to further improve the LPC parameter estimates of the AKF for speech enhancement in real-life noise conditions. Specifically, DeepLPC-MHANet learns a mapping from each frame of the noisy speech magnitude spectrum to the LPC-PS of the clean speech and noise signal. The speech and noise LPC parameters are computed from the estimated LPC-PSs using a similar procedure as in DeepLPC (Chap- ter 6). The DeepLPC-MHANet produces the lowest SD in the estimated speech LPCs than the DeepLPC-ResNet-TCN. Objective and subjective CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE 144 WORK scores on the NOIZEUS corpus demonstrate that the AKF constructed with DeepLPC-MHANet driven speech and noise LPC parameters outperforms the competing deep learning methods in various noise conditions for a wide range of SNR levels. For the LPC-PS training target, the MHANet within DeepLPC shows a significant improvement in LPC estimation than that of the ResNet-TCN. However, other training targets have also been investi- gated in existing deep learning methods. Therefore, the further research scope lies in a comprehensive study on training targets for LPC estimation using ResNet-TCN and MHANet. 9.1.7 Chapter 8: On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-based Speech Enhancement This chapter performs a comprehensive study on LSF, LPC-PS, PS, and MS training targets for LPC estimation using ResNet-TCN and MHANet. We aim to determine which training target as well as DNN method pro- duces accurate speech and noise LPC parameter estimation in real-life noise conditions. For this purpose, we train the ResNet-TCN and MHANet for each training target with a large data set. Experiments on the NOIZEUS corpus demonstrate that the MHANet-LPC-PS produces a lower SD level in the estimated speech LPCs than other competing methods. In speech en- hancement context, objective and subjective scores on NOIZEUS corpus also demonstrate that the AKF constructed with the MHANet-LPC-PS driven speech and noise LPC parameters exhibits higher quality and intelligibil- ity in the enhanced speech than competing deep learning-based methods in real-life noise conditions. CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE WORK 145 9.2 Future Work This dissertation addresses single-channel speech enhancement using Kalman filtering in real-life noise conditions. Particularly, the application of machine learning methods with Kalman filtering shows substantial improvement in noise reduction than some latest deep learning methods. However, in prac- tice, clean speech can be corrupted by background noise and reverberation from surface reflections (or noisy-reverberant speech). In this particular case, the multiple microphones placed in different directions can capture the clean speech, including other interfering signals more precisely than a single-channel. Therefore, the multi-channel system can be more suited for speech enhancement in conditions that have additive background noise as well as reverberation. Although some machine learning methods have been investigated in this particular area of speech enhancement, however, there is still room for improvement. The Kalman filtering with machine learning methods can be applied for multi-channel speech enhancement. Specifically, the DeepLPC framework can be adopted for LPC estimation from noisy- reverberant speech to construct the AKF. Then the AKF can be employed to extract the clean speech from the noisy-reverberant speech. CHAPTER 9. SUMMARY, CONCLUSIONS, AND FUTURE 146 WORK Bibliography [1] Sujan Kumar Roy and Kuldip K. Paliwal. An iterative Kalman filter with reduced-biased Kalman gain for single channel speech en- hancement in non-stationary noise condition. International Journal of Signal Processing Systems, 7(1):7–13, March 2019. [2] Sujan Kumar Roy and Kuldip K. Paliwal. Robustness and sensitivity tuning of the Kalman filter for speech enhancement. Under review with: Signals (Submitted at 26 Feb. 2021). [3] Sujan Kumar Roy and Kuldip K. Paliwal. Robustness and sensitivity metrics-based tuning of the augmented Kalman filter for single-channel speech enhancement. Under review with: Applied Acoustics (Submitted at 4 March 2021). [4] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. A deep learning-based Kalman filter for speech enhancement. Proc. Inter- speech2020, pages 2692–2696, October 2020. [5] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. Deep learning with augmented Kalman filter for single-channel speech en- hancement. IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, October 2020. 147 148 BIBLIOGRAPHY [6] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement. IEEE Access (Submitted Revisions at 18 March, 2021). [7] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K. Paliwal. DeepLPC-MHANet: Multi-head self-attention for augmented Kalman filter-based speech enhancement. Under review with: IEEE Access (Submitted at 08 April, 2021). [8] Sujan Kumar Roy and Kuldip K. Paliwal. On training targets for supervised LPC estimation to augmented Kalman filter-based speech enhancement. Under review with: Speech Communication (Submitted at 12 April, 2021). [9] Sujan Kumar Roy and Kuldip K. Paliwal. A non-iterative Kalman filter for single channel speech enhancement in non-stationary noise con- dition. 12th International Conference on Signal Processing and Com- munication Systems (ICSPCS), pages 1–7, December 2018. [10] Sujan Kumar Roy and Kuldip K. Paliwal. Sensitivity metric-based tuning of the augmented Kalman filter for speech enhancement. 14th In- ternational Conference on Signal Processing and Communication Sys- tems (ICSPCS), pages 1–6, December 2020. [11] Sujan Kumar Roy and Kuldip K. Paliwal. Causal convolution en- coder decoder-based augmented Kalman filter for speech enhancement. 14th International Conference on Signal Processing and Communica- tion Systems (ICSPCS), pages 1–7, December 2020. [12] Sujan Kumar Roy and Kuldip K. Paliwal. Deep residual network- BIBLIOGRAPHY 149 based augmented Kalman filter for speech enhancement. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 667–673, December 2020. [13] Sujan Kumar Roy and Kuldip K. Paliwal. Causal convolutional neural network-based Kalman filter for speech enhancement. Asia- Pacific Conference on Computer Science and Data Engineering, De- cember 2020.