Automatic Speech Recognition Using Deep Neural Networks: New Possibilities
Abdel-Hamid, Ossama Abdel-Hamid Mohamed
MetadataShow full item record
Recently, automatic speech recognition (ASR) systems that use deep neural networks (DNNs) for acoustic modeling have attracted huge research interest. This is due to the recent results that have significantly raised the state of the art performance of ASR systems. This dissertation proposes a number of new methods to improve the state of the art ASR performance by exploiting the power of DNNs. The first method exploits domain knowledge in designing a special neural network (NN) structure called a convolutional neural network (CNN). This dissertation proposes to use the CNN in a way that applies convolution and pooling operations along frequency to handle frequency variations that commonly happen due to speaker and pronunciation differences in speech signals. Moreover, a new CNN structure called limited weight sharing is proposed to better suit special spectral characteristics of speech signals. Our experimental results have shown that the use of a CNN leads to 6-9% relative reduction in error rate. The second proposed method deals with speaker variations in a more explicit way through using a new speaker code based adaptation. This method adapts the speech acoustic model to a new speaker by learning a suitable speaker representation based on a small amount of adaptation data from each target speaker. This method alleviates the need to modify any model parameters as is done with other commonly used adaptation methods for neural networks. This greatly reduces the number of parameters to estimate during adaptation; hence, it allows rapid speaker adaptation. The third proposed method aims to handle the temporal structure within speech segments by using a deep segmental neural network (DSNN). The DSNN model alleviates the need to use an HMM model as it directly models the posterior probability of the label sequence. Moreover, a segment-aware NN structure has been proposed. It is able to model the dependency among speech frames within each segment and performs better than the conventional frame based DNNs. Experimental results show that the proposed DSNN can significantly improve recognition performance as compared with the conventional frame based models.