语音识别中基于深度学习的声学模型研究毕业论文
2021-03-15 20:02:34
摘 要
Abstract III
1.绪论 1
1.1研究目的及意义 1
1.2国内外研究现状 1
1.3主要内容及结构安排 2
2 语音识别基本原理 4
2.1特征提取 4
2.2语音建模 7
2.3传统语音识别模型 8
2.4 人工神经网络(ANN) 9
2.5 本章小结 12
3 声学模型与深度神经网络 13
3.1 循环神经网络 13
3.2 联结时间分类 16
3.3 时延神经网络 18
3.4本章小结 20
4.实验与结果分析 22
4.1.实验环境搭建 22
4.2 准备工作 22
4.3 模型训练 24
4.4 识别结果与分析 26
5.总结与展望 28
参考文献 29
致 谢 31
摘要
近十年来,语音识别技术的发展突飞猛进,在无噪近距离等条件下,人工智能的识别结果已达到97%,超越了人类识别水平。传统的基于GMM-HMM框架曾一度成为人们在语音识别任务中的瓶颈,并难以在长序列识别任务上有所突破。深度学习的出现不仅可以加快大量数据训练语音识别模型的学习速度,并且在自由对话形式上,语音识别技术有了新的发展。
本次设计将在讨论传统mfcc特征的提取过程、GMM-HMM和基本的深度神经网络DNN-HMM两种声学模型的基础上,介绍预训练加微调方式的深度神经网络学习算法以及结合CTC的LSTM端到端模型、TDNN两种声学模型。另外,本次设计提出一种识别双语数据的数据预处理方法,通过介绍数据准备和kaldi语音识别工具训练声学模型的方法,在kaldi上搭建上述声学模型的训练环境。实验结果发现,双语训练集较纯中文训练集在同一语音识别任务上的识别率降低1%-2%,将LSTM和TDNN相结合的深度声学模型较传统的GMM-HMM模型在中文和双语的长序列识别任务上提升25.75%,较DNN-HMM模型提升2.85%。
关键词:语音识别;深度学习;双语;LSTM;TDNN
Abstract
Over the past decade, the development of speech recognition technology by leaps and bounds.In the noisy distance and other conditions, the identification of artificial intelligence has reached 97%, beyond the level of human recognition. The traditional GMM-HMM framework has once become a bottleneck in speech recognition tasks, and it is difficult to make breakthroughs in long sequence recognition tasks. The emergence of deep learning not only can speed up the learning speed of a large number of data training speech recognition models, and in the form of free dialogue, speech recognition technology has a new development.
Based on the discussion of traditional mfcc feature extraction process, GMM-HMM and basic depth neural network DNN-HMM, this design will introduce the pre-training plus fine-tuning method of deep neural network learning algorithm and introduced two acoustic models, the combination of CTC with LSTM end-to-end model, and TDNN. In addition, this design proposes a data preprocessing method for recognizing bilingual data. By introducing the data preparation and kaldi speech recognition tool to train the acoustic model, the training environment of the above acoustic model is built on kaldi. The experimental results show that the recognition rate of the bilingual training set is 1% -2% lower than that of the pure Chinese training set on the same speech recognition task. The depth acoustic model combined with LSTM and TDNN is higher than that of the traditional GMM-HMM model in Chinese and bilingual Long sequence recognition task increased by 25.75%, compared with DNN-HMM model increased 2.85%
.Keywords: Speech Recognition;Deep Learning;Bilingual;LSTM;TDNN
1.绪论
1.1研究目的及意义
随着深度学习和人工智能的热潮,语音识别技术的研究自20世纪50年发展起,在近二十年来,语音识别技术有了显著的提升[1],如今以苹果手机的siri,科大讯飞的语音输入法为代表,被众多专家学者认为是21世纪10大最具发展潜力的热门学科之一。如今语音识别作为模式识别的一个分支,与机器学习中的深度学习相结合,通过计算机将语音自动识别成文本,即自动语音识别(Automatic Speech Recognition ,ASR)技术也成为了国内外许多专家学者备受追捧的研究方向。本次设计将训练出GMM-HMM(Gaussian Mixture Model -Hidden Markov Model)、DNN-HMM(Deep nerual network-Hidden Markov Model)、LSTM(Long Short Term Mermory network,LSTM)、TDNN(Time Delay Nerual Network,TDNN)、TDNN和LSTM相结合的五种声学模型用纯中文数据和双语语音数据在kaidi平台上进行训练,通过测试结果比对五种声学模型在两类数据上的识别率,得出五种声学模型的优劣。
本次设计将从理论上讨论语音识别过程、DNN-HMM声学模型中涉及的反向传播算法;讨论深度神经网络的逐层贪婪训练算法;讨论结合CTC(Connectionist Temporal Classification)算法的LSTM优化的网络模型结构;从实验上将从数据准备、特征提取、模型训练三个方面详细论述基于kaldi中nnet3的模型训练实现过程。