基于CMN网络的低资源柯尔克孜语识别研究

关键词: 语音识别; 低资源; 柯尔克孜语; 跨语种声学模型; CMN; 音素映射

中图分类号: TN711⁃34; TP391                 文献标识码: A                    文章编号: 1004⁃373X(2018)24⁃0132⁃05

Research on CMN⁃based recognition of Kirgiz with less resources

SUN Jie1,2, Wushour Silamu1, Reyiman Tursun1

(1. School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China;

2. Department of Physics, Changji University, Changji 831100, China)

Abstract: As there exists low recognition rate caused by sparse training data during the speech recognition of minority languages, a cross⁃language acoustic model based on convolutional maxout networks (CMNs) is constructed in this paper for less⁃resource Kirgiz recognition. In the CMN model, the local sampling and weight sharing technologies of the convolutional neural network (CNN) are used to reduce network parameters. The convolutional kernel of the CNN is replaced by the maxout neuron to improve the extraction capability of network abstract features. The cross⁃language CMN is pre⁃trained by using the Uygur language with relatively⁃rich resources. The Dropout regularization training method is used to prevent over⁃fitting. The phoneme mapping set based on forced alignment of synonyms is created according to the similarities of the two languages. The to⁃be recognized Kirgiz data is marked. The CMN parameters are fine⁃tuned by using the limited corpus of the target language. The experimental results show that the word error rate of the proposed CMN acoustic model is 8.3% lower than that of the baseline CNN acoustic model.

Keywords: speech recognition; less resource; Kirgiz; cross?language acoustic model; CMN; phoneme mapping

0  引  言

“一带一路”倡仪的提出使得我国与周边国家的商贸往来和文化交流日趋频繁。多语言特别是小语种的自动语言翻译机成为地区间互联互通的迫切需求。

小语种语音识别面临的困难是标注数据匮乏难以建立鲁棒的声学模型。目前,低资源条件下构建跨语种声学模型是一个研究的热点问题。Schultz等人提出利用Bootstrap将多个单语种的声学模型集成为跨语言的通用音素集,对瑞典语识别时获得最低34.3%的音素错误率,但该方法不能将富语料语种音素的上下文关系转移到目标语声学模型[1]。为此,Imseng等人使用KL距离(Kullback⁃Leibler divergence) 构建了多语种三音素隐马尔可夫模型HMM(Hidden Markov Model)。该模型的主要思想是用MLP(Multi⁃Layer Percetron)估计音素的后验概率,用多项式分布描述HMM状态,利用相对熵作为损失函数度量两者之间的距离[2]。实验结果表明,在较小数据集情况下,KL⁃HMM模型比GMM⁃HMM[3]模型识别效果好。但该方法假定模型中每个音子的状态转移概率固定,会降低解码精度。Miao,Joy等人在SGMM(Subspace Gaussian Mixture Model)基础上提出了共享SGMM模型,分别使用多语言语料训练模型的共享参数和有限资源语料训练状态特殊的向量,较单语种SGMM在字识别准确率有5%的提升[4⁃5]。由于深度神经网络[6](Deep Neural Network,DNN)具有强大的抽象特征提取能力,Huang等人将经过多语种训练的深度神经网络的隐藏层用于低资源语种的识别[7],称为共享隐藏层技术(Shared Hidden Layers,SHL)。该方法取得很好的识别效果,但需要多个语种的大量数据对模型进行充分训练。

推荐访问:柯尔克孜 识别 研究 资源 网络