Causal Convolutional Encoder Decoder-Based Augmented Kalman Filter for Speech Enhancement Sujan Kumar Roy, Kuldip K. Paliwal Signal Processing Laboratory, Griffith School of Engineering Griffith University, Brisbane, QLD, Australia, 4111

[email protected]

,

[email protected]

Abstract—Speech enhancement using augmented Kalman filter Paliwal and Basu for the first time introduced KF-based SEA (AKF) suffers from the biased estimates of the linear prediction for enhancing stationary noise corrupted speech [7]. However, coefficients (LPCs) of speech and noise signal in noisy condi- the LPCs are computed from the clean speech signal, which tions. The existing AKF was particularly designed to enhance the colored noise corrupted speech. In this paper, a causal is unavailable in practice. In [10], Gibson et al. introduced an convolutional encoder-decoder (CCED)-based method utilizes the augmented KF (AKF) for enhancing colored noise corrupted LPC estimates of the AKF for speech enhancement. Specifically, speech. In this method, the LPC estimates for the current noisy a CCED network is used to estimate the instantaneous noise speech frame are computed from the filtered signal of the pre- spectrum for computing the LPCs of noise on a framewise vious iteration by AKF. Although the enhanced speech (after basis. Each noise corrupted speech frame is pre-whitened by a whitening filter, which is constructed with the noise LPCs. The 2-3 iterations) shows SNR improvement, however, suffering speech LPCs are computed from the pre-whitened speech. The from spectral distortion as well as musical noise. In [11], Roy improved speech and noise LPCs enables the AKF to minimize et al. proposed a sub-band iterative KF-based SEA. Due to the residual noise as well as distortion in the enhanced speech. processing the high-frequency sub-bands (SBs) among the 16 Objective and subjective testing on NOIZEUS corpus reveal that decomposed SBs for a given noise corrupted utterance, some the enhanced speech produced by the proposed method exhibits higher quality and intelligibility than the benchmark methods in noise components may still remain in the low-frequency SBs. various noise conditions for a wide range of SNR levels. The enhanced speech also suffers from distortion. In [12], Index Terms—Speech enhancement, augmented Kalman filter, George et al. introduced a robustness metric-based tuning of convolution neural network, LPC, whitening filter. the AKF. This SEA is particularly designed for colored noise suppression. In addition, the robustness metric-based tuning of I. I NTRODUCTION the AKF gain causes distortion in the enhanced speech. The objective of a speech enhancement algorithm (SEA) Over the decades, the deep neural network (DNN) has been is to estimate the clean speech from the noisy speech signal. used widely for speech enhancement [13]. The DNN usually The SEAs can be used as a pre-processor for many speech gives an estimate of the ideal binary mask (IBM), which is processing systems, such as voice communication systems, used to compute the clean speech spectrum [13]. It is shown hearing-aid devices, speech recognition. Various SEAs, such as that the ideal ratio mask (IRM) [14] exhibits better speech spectral subtraction (SS) [1], [2], MMSE [3], [4], Wiener Filter quality than the IBM. In [15], Williamson et al. introduced a (WF) [5], [6], Kalman filter (KF) [7] have been introduced complex ideal ratio mask (cIRM), which is capable to recover over the decades. However, it is still a demanding work to both the amplitude and phase spectrum of clean speech. develop an efficient SEA for real-world noise conditions. However, the masking technique usually introduces musical The SS-based SEA heavily depends on the accuracy of noise noise in the enhanced speech [14]. power spectral density (PSD) estimates [8]. The under/over- In [16], a convolutional encoder decoder (CED)-based SEA estimation of the noise PSD introduces musical noise and has been proposed. It was particularly designed to enhance distortion in the enhanced speech [9, Chapter 5]. The perfor- the babble noise corrupted speech. In [17], a long short- mance of the MMSE and WF based SEA somehow depends term memory (LSTM) was incorporated with a CED to on the accuracy of the a priori SNR estimates in practice. In form a convolutional recurrent network (CRM) for speech [3], Ephraim and Malah proposed a decision-directed (DD) enhancement. The CRM network [17] is constructed with 2D approach to compute the a priori SNR in noisy conditions. Convolution (Conv2D) layers, which is normally required for However, this approach uses the speech and noise power processing image data. Since speech signal is 1D, it can be spectrum estimated from the previous noisy speech frame, processed with 1D convolution (Conv1D) layer as used in leading to an inaccurate estimate of the a priori SNR for the CED [16]. Thus, CRM [17] takes huge training parameters, current frame. The biased estimate of the a priori SNR in which increases the training time accordingly. In [18], a fully the MMSE-based SEA typically introduce musical noise and convolutional neural network (FCNN)-based SEA has been spectral distortion in the enhanced speech [9]. introduced. It processes the raw-waveform of noise corrupted The efficiency of KF-based SEA depends on how accurately speech, yielding an estimate of clean speech waveform. Thus, the key parameters, LPCs are estimated in noisy conditions. the enhanced speech does not depend on the phase spectrum, which has a significant impact on other acoustic-domain SEAs Eqs. (1)-(3) can be used to form the following augmented [13], [14], [16] (keep the phase-spectrum unprocessed). In state-space model (ASSM) of AKF as [12]: [19], Zheng et al. introduced a phase-ware SEA using DNN. x(n) = Φx(n − 1) + dz(n), (4) Here, the phase information (converted to the instantaneous T frequency deviation (IFD)) is jointly used with different masks, y(n) = c x(n). (5) namely ideal amplitude mask (IAM) as a training target. The In the above ASSM, clean speech spectrum is reconstructed with the estimated 1) x(n) = [s(n) . . . s(n − p + 1) v(n) . . . v(n − q + 1)]T mask and the phase information (extracted from the IFD). is a (p + q) × 1state-vector, Yu et al. introduced a KF-based SEA, where the LPCs are Φs 0 estimated using a traditional DNN [20]. However, the training 2) Φ = is a (p + q) × (p + q) state-transition 0 Φv is performed with four different noise recordings including matrix with: four SNR levels. Technically, it reduces the performance of   −a1 −a2 . . . ap−1 ap this SEA for a wide range of noise conditions as well as  1  0 ... 0 0  SNR levels. Also, the noise covariance is estimated from the  0 Φs =  1 ... 0 0 , initial frames of noisy speech (considered as silent), which is  .. .. .. ..   . .. irrespective with the non-stationary noise conditions. . . . . The direct estimation of speech spectrum using benchmark 0 0 ... 1 0 deep learning methods reported in literature may suffer from   musical noise and distortion. Our investigation reveals that −b1 −b2 ... bq−1 bq the estimate of noise spectrum using deep learning technique  1 0 ... 0 0   would be more beneficial, since it is a crucial parameter for  0 Φv =  0 1 ... 0 , most of the SEAs in literature. For example, the AKF-based  .. .. .. .. ..   . . . . . SEA suffering from the noise LPC estimates in practice. In this paper, a causal convolutional encoder-decoder (CCED) 0 0 ... 1 0 network addresses the speech and noise LPC estimates of the ds 0 ⊤ AKF for speech enhancement. Specifically, the CCED network 3) d = , where ds = 1 0 . . . 0 , dv = 0 dv gives an estimate of the instantaneous noise spectrum for ⊤ 1 0 . . . 0 , computing the noise LPCs on a framewise basis. A whitening w(n) filter is then constructed with the noise LPCs to pre-whiten the 4) z(n) = , u(n) noise corrupted speech frame prior to estimate speech LPCs. T With the improvement of speech and noise LPCs, the AKF 5) cT = cTs cTv , where cs = 1 0 . . . 0 and T is found to be effective in minimizing the residual noise as cv = 1 0 . . . 0 are p × 1 and q × 1 vectors, well as distortion in the enhanced speech. The efficiency of 6) y(n) is the noisy measurement at sample n. the proposed SEA is compared against the benchmark SEAs Firstly, y(n) is windowed into non-overlapped and short using objective and subjective testing on NOIZEUS corpus. (e.g., 20 ms) frames. For a particular frame, the AKF computes an unbiased and linear MMSE estimate, x̂(n|n) at sample n, II. AKF FOR COLORED NOISE SUPPRESSION given y(n) by using the following recursive equations [12]: x̂(n|n − 1) = Φx̂(n − 1|n − 1), (6) Assuming the colored noise v(n) to be additive with speech T T s(n) and uncorrelated each other, at sample n, the noisy Ψ(n|n − 1) = ΦΨ(n − 1|n − 1)Φ + dQd , (7) speech y(n) is given by: T −1 K(n) = Ψ(n|n − 1)c(c Ψ(n|n − 1)c) , (8) x̂(n|n) = x̂(n|n − 1) + K(n)[y(n) − cT x̂(n|n − 1)], (9) y(n) = s(n) + v(n). (1) Ψ(n|n) = [I − K(n)cT ]Ψ(n|n − 1), (10) The s(n) and v(n) of eq. (1) can be modeled with pth and 2 σ 0 th q order linear predictors as [21]: where Q = w is the process noise covariance. 0 σu2 p For a noisy speech frame, the error covariances Ψ(n|n − 1) X s(n) = − ai s(n − i) + w(n), (2) and Ψ(n|n) corresponding to x̂(n|n − 1) and x̂(n|n), and the i=1 Kalman gain K(n) are continually updated on a samplewise 2 Xq basis, while ({ai }, σw ) and ({bk }, σu2 ) remain constant. At v(n) = − bj v(n − j) + u(n), (3) ⊤ sample n, g x̂(n|n) gives the estimated speech, ŝ(n|n), ⊤ j=1 where g = 1 0 0 . . . 0 is a (p + q) × 1 column vector. As in [12], ŝ(n|n) is given by: where {ai ; i = 1, 2, . . . , p} and {bj ; j = 1, 2, . . . , q} are the LPCs, w(n) and u(n) are assumed to be white noise with zero ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)[y(n) − v̂(n|n − 1)], 2 mean and variance σw and σu2 , respectively. (11) where K0 (n) is the 1st component of K(n), given by [12]: λs (l, m), and E{|V (l, m)|2 } = λv (l, m), where E{·} repre- sents the statistical expectation operator. α2 (n) + σw2 K0 (n) = , (12) α2 (n) 2 + σw + β (n) + σu2 2 2 A. Proposed ({bk }, σu2 ) and ({ai }, σw ) Estimation Method where α2 (n) and β 2 (n) are the transmission of a posteriori The existing AKF-based SEA [12] estimates the noise from error variances (of the speech and measurement noise samples) the initial noise corrupted speech frames by considering that by the augmented dynamic model from the previous time there remains no speech. Then compute ({bk }, σu2 ) from the sample, n − 1 [12]. estimated noise, which remains constant during processing Eq. (11) reveals that K0 (n) has a significant impact on all the frames for a given noise corrupted speech utterance. ŝ(n|n) estimates (the output of the AKF). In practice, the This concept may be effective for enhancing the colored noise poor estimates of ({ai }, σw2 ) and ({bk }, σu2 ) introduce bias in corrupted speech. Due to the time varying nature of non- K0 (n), which affects the estimates of ŝ(n|n). In the proposed stationary noise amplitude, it is required to update ({bk }, σu2 ) SEA, a CCED network utilizes the speech and noise LPC continuously during processing each noise corrupted speech estimates of the AKF, leading to an improved ŝ(n|n) estimate. frame. Thus, ({bk }, σu2 ) estimation process in [12] becomes irrespective with the non-stationary noise conditions. III. P ROPOSED S PEECH E NHANCEMENT S YSTEM In the proposed SEA, we introduce a CCED-based method (described in section III-B) to estimate the instantaneous Fig. 1 shows the block diagram of the proposed SEA. noise spectrum, |Vb (l, m)| for a given noisy speech spectrum, Firstly, a 32 ms rectangular window with 50% overlap was considered for converting y(n) (eq. (1)) into frames y(n, l), |Y (l, m)| on a framewise basis. By taking square of |Vb (l, m)|, i.e., y(n, l) = s(n, l) + v(n, l), where lǫ{0, 1, 2, . . . , N − 1} i.e., |Vb (l, m)|2 , we get the instantaneous noise PSD from is the frame index with N being the total number of frames where ({bk }, σu2 ) are computed. Specifically, the |IDFT| of in an utterance, and M is the total number of samples within |Vb (l, m)|2 yields an estimate of the noise autocorrelation, Rbvv (τ ), where τ is the autocorrelation lag. By solving R bvv (τ ) each frame, i.e., nǫ{0, 1, 2, . . . , M − 1}. The DFT coefficients using the Levinson-Durbin recursion [21], the ({bk }, σu2 ) (q = 20) estimates are obtained. Then {bk }’s are used to design the whitening filter, Hw (z) as [21]: q X Hw (z) = 1 + bk z −k . (14) k=1 Employing Hw (z) to y(n, l) gives the pre-whitened speech, 2 yw (n, l). Then ({ai }, σw ) (p = 10) are computed from yw (n, l) using autocorrelation method [21]. B. CCED for Noise Spectrum Estimation We propose a CCED-based method to estimate |Vb (l, m)|. The proposed CCED network structure is shown in Fig. 2. It consists of a convolution encoder followed by a corre- sponding decoder. The encoder consists of a stack of five convolution layers. Unlike 2-dimensional convolution layers (Conv2D) in [17], we have used 1-dimensional convolutional layer (Conv1D), since it is appropriate to process the 1D speech signal. The Conv1D layer also reduces huge training parameter as well training time. The decoder also consists of a stack of five Conv1D layers. In addition, we have used the causal Conv1D layer [22]. Fig. 3 demonstrates the operating principle of the standard and causal Conv1D layers. The Fig. 1. Block diagram of the proposed SEA. standard Conv1D layers (Fig. 3 (a)) are comprised of filters that capture the local correlation of nearby data points, thus Y (l, m), S(l, m), and V (l, m) are found using the Hamming leaking the future information into the current data during window and correspond to y(n), s(n) and v(n). These can operating. Conversely, in the causal Conv1D layer (Fig. 3 (b)), also be represented as: the output at any time step t only uses the information from Y (l, m) = S(l, m) + V (l, m), (13) the previous time steps, i.e., 0 to t−1 [22]. It allows the CCED network for real-time noise spectrum estimation. where m is the discrete-frequency index. The CCED network maps the single-sided magnitude spec- It is assumed that S(l, m) and V (l, m) follow a Gaussian trum (257-point DFT coefficients including the Nyquist fre- distribution with zero-mean and variances; E{|S(l, m)|2 } = quency components) of noisy speech, |Y (l, m)| to that of the Fig. 2. Architecture of the proposed CCED network for noise spectrum estimation. Each of the encoder-decoder layer is passed through a layer normalization (LN) [23] followed by SELU activation function [24], except the last layer, which passes through a sigmoid activation function [25] as it is the output layer. Reason of using SELU activation is that it has less impact on vanishing gradients than that of ReLU [26] and ELU [27]. Also, SELUs itself learn faster and better than ReLU and ELU even if they are combined with layer normalization [24]. Unlike [17], the Conv1D layer in the CCED network makes pooling and up- sampling unnecessary in the encoder and decoder layers. To improve the flow of information and gradients through- out the network, we also utilize skip connection between the causal Conv1D usits of encoder and decoder. It resolves the so called vanishing gradient issue in a deep neural network. The skip connection is represented by arrows (used same color for the corresponding Conv1D units) as shown in Fig. 2. IV. S PEECH E NHANCEMENT E XPERIMENT A. Training Set For training the proposed CCED network, a total of 30, 000 clean speech recordings are randomly selected belonging to the Fig. 3. Working principle of: (a) standard and (b) causal Conv1D layer. train-clean-100 set of the Librispeech corpus [28] (28, 539), the CSTR VCTK corpus [29] (42, 015), and the si∗ and sx∗ training sets of the TIMIT corpus [30] (3, 696). The 5% of noise spectrum, |Vb (l, m)|. Therefore, the output size of the 30, 000, i.e., 1500 speech recordings are randomly selected first Conv1D layer in the encoder is 257. Specifically, the for cross-validation of the CCED network accuracy during output size of the Conv1D layers in the encoder is decreased training. Thus, 28, 500 speech recordings are used for training in the order of 257, 128, 64, 32, 16, which is increased in the of the CCED network. Also, a total of 500 noise recordings decoder Conv1D layers, i.e., 16, 32, 64, 128, 257. We also use are randomly selected from the QUT-NOISE dataset [31], the symmetric kernel in each Conv1D layer. The kernel size in the Nonspeech dataset [32], the Environmental Background Noise encoder Conv1D layers is increased gradually according to 1, dataset [33], [34], the noise set from the MUSAN corpus [35]. 3, 5, 7, 9, which is decreased in the order of 9, 7, 5, 3, 1 for The 5% of 500, i.e., 25 noise recordings are selected for cross- the decoder Conv1D layers. Therefore, the proposed CCED validation purposes, while the remaining 475 of them are used network encodes the features into lower dimension along for training. All the clean speech and noise recordings are the encoder and achieves decompression along the decoder. single-channel, with a sampling frequency of 16 kHz. AKF-Ideal Proposed IAM+IFD DNN-KF RWF-FCN Noisy a 3 b 3 PESQ PESQ 2 2 1 1 0 0 c 1 d 1 0.8 0.8 QSTI QSTI 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -5 0 5 10 15 -5 0 5 10 15 Input SNR(dB) Input SNR(dB) Fig. 4. Performance comparison of the proposed SEA with the benchmark SEAs in terms of the average: PESQ; (a) passing car, (b) restaurant and QSTI; (c) passing car, (d) cafe babble noise conditions. a Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz) 8 6 4 2 0 b 8 6 4 2 0 c 8 6 4 2 0 d 8 6 4 2 0 e 8 6 4 2 0 f 8 6 4 2 0 g 8 6 4 2 0 0.5 1 1.5 2 Time (s) Fig. 5. (a) Clean speech, (b) noisy speech (sp05 is corrupted with 5 dB passing car noise), the enhanced speech spectrograms produced by the: (c) RWF-FCN [18], (d) DNN-KF [20], (e) IAM+IFD [19], (f) proposed, and (g) AKF-Ideal methods. 2 B. Training Strategy ({ai }, σw ) and ({bk }, σu2 ) are computed from the clean speech The following training strategy was employed to train the and noise signal) and Noisy (noise corrupted speech). proposed CCED network for noise spectrum estimation: E. Results and Discussion • The ’mean square error’ is chosen as the loss function. • The Adam algorithm [36] with default hyperparameters Fig. 4 (a)-(b) demonstrates that the proposed SEA con- is also selected for gradient descent optimisation. sistently shows improved PESQ scores over the benchmark • Gradients are clipped between [−1, 1]. SEAs, except the AKF-Ideal method for all noise conditions as • 120 epochs are used to train the CCED network. well as the SNR levels. The IAM+IFD method [19] relatively exhibits better PESQ score among the benchmark methods • The number of training examples in an epoch is equal across the noise experiments. The Noisy speech shows the to the number of clean speech recordings used in the worse PESQ score for all noise conditions. training set (28, 500). Fig. 4 (c)-(d) also shows that the proposed method demon- • The noisy speech signals are generated as follows: each strates a consistent QSTI score improvement across the noise randomly selected clean speech recording (without re- experiments as well as the SNR levels, apart from the AKF- placement) is corrupted with a randomly selected noise Ideal method. The existing IAM+IFD method [19] is found recording (without replacement) at a randomly selected to be competitive with the proposed method in QSTI im- SNR level (-10 to +20 dB, in 1 dB increments). provement typically at low SNR levels. However, at high SNR C. Test Set levels, all the SEAs, even the noisy speech signal relatively For objective experiments, 30 clean speech utterances be- shows the competitive QSTI score for all noise conditions. longing to six speakers (3 male and 3 female) are taken from It can be seen that the proposed SEA (Fig. 5 (f)) exhibits the NOIZEUS corpus. The speech recordings are sampled at significantly less residual noise in the enhanced speech than 16 kHz [9, Chapter 12]. We generate a noisy speech data set by that of the benchmark SEAs (Fig. 5 (c)-(e)) and is closely corrupting the speech recordings with (passing car) and (cafe similar to the AKF-Ideal method (Fig. 5 (g)). When going from babble) noise recordings selected from [33], [34] at multiple RWF-FCN method [18] to IAM+IFD method [19] (Fig. 5 (c)- SNR levels varying from -5dB to +15 dB, in 5 dB increments. (e)), noise-flooring is seen decreasing. The informal listening It is important to note that both the speech and noise recordings tests conducted on the enhanced speech also confirm that are unseen and not used in training the CCED network. the benchmark SEAs relatively produce annoying sound as compared to negligible audio artifacts by the proposed method. D. Evaluation Metrics The objective quality and intelligibility evaluation are car- ried out through the perceptual evaluation of speech quality 100 (PESQ) [37] and quasi-stationary speech transmission index Mean preference (%) (QSTI) [38] measures. We also analyze the spectrograms of 80 the enhanced speech produced by the proposed and benchmark 60 SEAs to quantify the level of residual noise and distortion. The subjective evaluation was carried out through blind 40 AB listening test [39, Section 3.3.4]. It is conducted on the utterance sp05 (“Wipe the grease off his dirty face”) corrupted 20 with 5 dB passing car noise. The enhanced speech produced by five SEAs as well as the corresponding clean and noisy 0 an sy FD F N d Al speech recordings, a total of 42 stimuli pairs played in a e -K C ea oi le os +I F N N C Id F- op M N F- W random order to each listener, excluding the comparisons IA D Pr AK R between the same method. For each pair, the listener prefers Speech Enhancement Methods the first or second stimuli which is perceptually better, or a third response indicating no difference is found between them. Fig. 6. The mean preference score (%) for each SEA on sp05 corrupted with A 100% award is given to the preferred method, 0% to the 5 dB passing car noise. other, and 50% to each method for the similar preference response. Participants could re-listen to stimuli if required. The outcome of AB listening tests in terms of mean Five English speaking listeners participate in the AB listening preference score (%) is shown in Fig. 6. It can be seen that tests. The average of the preference scores given by the the enhanced speech produced by the proposed SEA is widely listeners, termed as the mean preference score (%). preferred by the listeners (around 74%) than the benchmark The performance of the proposed method is carried out methods, apart from the AKF-Ideal method (around 84%) and by comparing it with the benchmark methods, such as raw clean speech signal (100%). The IAM+IFD method [19] is waveform processing using FCNN (RWF-FCN) method [18], found to be the best preferred (62%) amongst the benchmark phase-aware DNN (IAM+IFD) method [19], deep learning- methods, followed by the DNN-KF method [20] (50%), and based KF (DNN-KF) method [20], AKF-Ideal method (where RWF-FCN method [18] (35%). V. C ONCLUSION [16] S. R. Park and J. Lee, “A fully convolutional neural network for speech enhancement,” Proceedings of Interspeech, p. 1993–1997, 2017. This paper introduced a causal convolution encoder decoder- [17] K. Tan and D. Wang, “A convolutional recurrent neural network for real- based augmented Kalman filter for speech enhancement in time speech enhancement,” Proceedings of Interspeech, pp. 3229–3233, 2018. various noise conditions. Specifically, the proposed CCED net- [18] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform work gives an estimate of the instantaneous noise magnitude utterance enhancement for direct evaluation metrics optimization by spectrum to compute the noise PSD. Then the noise LPCs fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018. are computed from the estimated noise PSD. A whitening [19] N. Zheng and X. Zhang, “Phase-aware speech enhancement based on filter is also constructed with the estimated noise LPCs. It deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and is employed to the noise corrupted speech, yielding a pre- Language Processing, vol. 27, no. 1, pp. 63–76, 2019. [20] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neural whitened speech. The speech LPCs are computed from the network based Kalman filter for time domain speech enhancement,” pre-whitened speech. The large training set of CCED network IEEE International Symposium on Circuits and Systems, pp. 1–5, May enables the speech and noise LPC estimates to be effective 2019. [21] S. V. Vaseghi, “Linear prediction models,” in Advanced Digital Signal in various noise conditions. As a result, the AKF constructed Processing and Noise Reduction. John Wiley & Sons, 2009, ch. 8, pp. with the improved LPCs of speech and noise signal minimizes 227–262. the residual noise as well as distortion in the enhanced [22] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” 2016. speech. Extensive objective and subjective testing imply that [23] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016. the proposed method outperforms the benchmark methods in [24] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self- various noise conditions for a wide range of SNR levels. normalizing neural networks,” 2017. [25] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation R EFERENCES functions: Comparison of trends in practice and research for deep learning,” 2018. [1] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac- [26] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: tion,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Surpassing human-level performance on imagenet classification,” 2015. vol. 27, pp. 113–120, April 1979. [27] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep [2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech cor- network learning by exponential linear units (elus),” 2015. rupted by acoustic noise,” IEEE International Conference on Acoustics, [28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An Speech, and Signal Processing, vol. 4, pp. 208–211, April 1979. ASR corpus based on public domain audio books,” IEEE International [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210, square error short-time spectral amplitude estimator,” IEEE Transactions April 2015. on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– [29] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: 1121, December 1984. English multi-speaker corpus for CSTR voice cloning toolkit,” Univer- [4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum sity of Edinburgh. The Centre for Speech Technology Research (CSTR), mean-square error log-spectral amplitude estimator,” IEEE Transactions 2017. on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443– [30] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, 445, April 1985. “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. [5] P. Scalart and J. V. Filho, “Speech enhancement based on a priori NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, signal to noise estimation,” IEEE International Conference on Acoustics, Feb. 1993. Speech, and Signal Processing, vol. 2, pp. 629–632, May 1996. [31] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT- [6] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step noise NOISE-TIMIT corpus for the evaluation of voice activity detection reduction technique,” IEEE International Conference on Acoustics, algorithms,” in Proceedings Interspeech 2010, 2010, pp. 3110–3113. Speech, and Signal Processing, vol. 1, pp. 289–292, May 2004. [32] G. Hu, “100 nonspeech environmental sounds,” The Ohio State Univer- [7] K. Paliwal and A. Basu, “A speech enhancement method based on sity, Department of Computer Science and Engineering, 2004. kalman filtering,” IEEE International Conference on Acoustics, Speech, [33] F. Saki, A. Sehgal, I. Panahi, and N. Kehtarnavaz, “Smartphone-based and Signal Processing, vol. 12, pp. 177–180, April 1987. real-time classification of noise signals using subband features and [8] N. Upadhyay and A. Karmakar, “Speech enhancement using spectral random forest classifier,” in 2016 IEEE International Conference on subtraction-type algorithms: A comparison and simulation study,” Pro- Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. cedia Computer Science, vol. 54, pp. 574 – 584, 2015. 2204–2208. [9] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Boca [34] F. Saki and N. Kehtarnavaz, “Automatic switching between noise classi- Raton, FL, USA: CRC Press, Inc., 2013. fication and speech enhancement for hearing aid devices,” in 2016 38th [10] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise Annual International Conference of the IEEE Engineering in Medicine for speech enhancement and coding,” IEEE Transactions on Signal and Biology Society (EMBC), Aug 2016, pp. 736–739. Processing, vol. 39, no. 8, pp. 1732–1742, August 1991. [35] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and [11] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech noise corpus,” CoRR, vol. abs/1510.08484, 2015. [Online]. Available: enhancement using subband iterative kalman filter,” IEEE International http://arxiv.org/abs/1510.08484 Symposium on Circuits and Systems, pp. 762–765, May 2016. [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [12] A. E. W. George, S. So, R. Ghosh, and K. K. Paliwal, “Robustness 2014. metric-based tuning of the augmented Kalman filter for the enhancement [37] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual of speech corrupted with coloured noise,” Speech Communication, vol. evaluation of speech quality (PESQ)-a new method for speech quality 105, pp. 62 – 76, December 2018. assessment of telephone networks and codecs,” IEEE International [13] Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on speech Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. enhancement based on deep neural networks,” IEEE Signal Processing 749–752, May 2001. Letters, vol. 21, no. 1, pp. 65–68, 2014. [38] B. Schwerin and K. K. Paliwal, “An improved speech transmission index [14] Y. Wang, A. Narayanan, and D. Wang, “On training targets for super- for intelligibility prediction,” Speech Communication, vol. 65, pp. 9–19, vised speech separation,” IEEE/ACM Transactions on Audio, Speech, December 2014. and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014. [39] K. K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech [15] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for enhancement using spectral subtraction in the short-time modulation monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, domain,” Speech Communication, vol. 52, no. 5, pp. 450–475, May and Language Processing, vol. 24, no. 3, pp. 483–492, 2016. 2010.