作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2013, Vol. 39 ›› Issue (8): 1-4. doi: 10.3969/j.issn.1000-3428.2013.08.001

• 专栏 • 上一篇    下一篇

基于声学融合特征的说话人分类方法研究

杨 毅1,陈国顺2,鲍长春3   

  1. (1. 清华大学电子工程系清华信息科学与技术国家实验室(筹),北京 100084; 2. 石家庄机械技术研究所电子室,石家庄 050000; 3. 北京工业大学电子信息与控制工程学院,北京 100022)
  • 收稿日期:2012-04-01 出版日期:2013-08-15 发布日期:2013-08-13
  • 作者简介:杨 毅(1978-),女,助理研究员、博士,主研方向:说话人识别,语音增强技术;陈国顺,研究员、博士;鲍长春,教授、博士
  • 基金资助:

    国家自然科学基金资助项目(61105017);北京市自然科学基金资助项目(KZ201110005005)

Research on Speaker Classification Method Based on Acoustic Merging Feature

YANG Yi 1, CHEN Guo-shun 2, BAO Chang-chun 3   

  1. (1. Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China; 2. Electronic Room, Shijiazhuang Mechanical Technology Institute, Shijiazhuang 050000, China; 3. School of Electronic Information and Control Engineering, Beijing University of Technology, Beijing 100022, China)
  • Received:2012-04-01 Online:2013-08-15 Published:2013-08-13

摘要:

说话人分类系统的目的是将声音数据分段并按说话人进行分类。对每个说话人提取基于多距离麦克风的多时延特征,可以进一步提高说话人分类系统性能。但随着麦克风个数增加,多时延特征向量维数迅速增长。针对该问题,采用保留特征流形结构并降低计算代价的方法,提出一种基于多距离麦克风融合声学特征的多分量鉴别式保局投影算法,利用支持向量机分类器进行两说话人分类系统的训练和测试,实现会议场景下的说话人分类。实验结果证明,与传统DLPP等算法相比,该算法在大部分数据集上的分类性能较优,可将分类误差率降低至20%以下。

关键词: 说话人分类, 多距离麦克风, 多时延特征, 声学融合特征, 多分量鉴别式保局投影, 分类误差率

Abstract:

The purpose of the speaker classification system is to segment and classify speech data according to different speaker. It improves performance of the speaker classification system by extracting multi-delay feature based on multiple distance microphones. With the number of microphones increases, the multi-delay feature vector dimension grows rapidly. Aiming at this problem, a method is proposed with keeping manifold structure and reducing the computational cost. It uses the multi-component discriminant locality preserving projections algorithm based on multiple distance microphones acoustic merging feature. Experimental results show that Diarization Error Rate(DER) of this algorithm can be reduced to below 20% and is better than traditional methods in most of the data set.

Key words: speaker classification, multiple distance microphone, multi-delay feature, acoustic merging feature, multi-component discriminant locality preserving projection, Diarization Error Rate(DER)

中图分类号: