A method to detect spoken keywords in a given speech utterance is proposed, called as joint Dynamic Time Warping (DTW)-Convolution Neural Network (CNN). It is a combination of DTW approach with a strong classifier like CNN. Both these methods have independently shown significant results in solving problems related to optimal sequence alignment and object recognition, respectively. The proposed method modifies the original DTW formulation and converts the warping matrix into a gray scale image. A CNN is trained on these images to classify the presence or absence of keyword by identifying the texture of warping matrix. The TIMIT corpus has been used for conducting experiments and our method shows significant improvement over other existing techniques.
We investigate a number of Deep Neural Network (DNN) architectures for emotion identification with the IEMOCAP database. First we compare different feature extraction frontends: we compare high-dimensional MFCC input (equivalent to filterbanks), versus frequency-domain and time-domain approaches to learning filters as part of the network. We obtain the best results with the time-domain filter-learning approach. Next we investigated different ways to aggregate information over the duration of an utterance. We tried approaches with a single label per utterance with time aggregation inside the network; and approaches where the label is repeated for each frame. Having a separate label per frame seemed to work best, and the best architecture that we tried interleaves TDNN-LSTM with
time-restricted self-attention, achieving a weighted accuracy of 70.6%, versus 61.8% for the best previously published system which used 257-dimensional Fourier log-energies as input.
In this work, we present a simple and elegant approach to language modeling for bilingual code-switched text. Since code switching is a blend of two or more different languages, a standard bilingual language model can be improved upon by using structures of the monolingual language models. We propose a novel technique called dual language models, which involves building two complementary monolingual language models and combining them using a probabilistic model for switching between the two. We evaluate the efficacy of our approach using a conversational Mandarin-English speech corpus. We prove the robustness of our model by showing significant improvements in perplexity measures over the standard bilingual language model without the use of any external information. Similar consistent improvements are also reflected in automatic speech recognition error rates.
In recent years, harmonic-percussive source separation methods are gaining importance because of their potential applications in many music information retrieval tasks. The goal of the decomposition methods is to achieve near real-time separation, distortion and artifact free component spectrograms and their equivalent time domain signals for potential music applications. In this paper, we propose a decomposition method based on filtering/suppressing the impulsive interference of percussive source on the harmonic components and impulsive interference of the harmonic source on the percussive components by modified moving average filter in the Fourier frequency domain. The significant advantage of the proposed method is that it minimizes the artifacts in the separated signal spectrograms. In this work, we have proposed Affine and Gain masking methods to separate the harmonic and percussive components to achieve minimal spectral leakage. The objective measures and separated spectrograms showed that the proposed method is better than the existing rank-order filtering based harmonic-percussive separation methods.
Added on December 12, 2019
Contributed by : Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das
We introduce a monaural audio source separation framework using a latent generative model. Traditionally, discriminative training for source separation is proposed using deep neural networks or non-negative matrix factorization. In this paper, we propose a principled generative approach using variational autoencoders (VAE) for audio source separation. VAE computes efficient Bayesian inference which leads to a continuous latent representation of the input data(spectrogram). It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from latent space back to input space. This allows us to learn a robust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the decoder to yield the separated source. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more principled approach to source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and autoencoders. We show that our method performs better than best of the relevant methods with _ 2 dB improvement in the source to distortion ratio.