作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 专栏 • 上一篇    下一篇

基于相似性混合模型的蛋白质交互识别

王宇伟,牛耘,魏欧   

  1. (南京航空航天大学计算机科学与技术学院,南京 210016)
  • 收稿日期:2014-08-05 出版日期:2015-07-15 发布日期:2015-07-15
  • 作者简介:王宇伟(1989-),男,硕士研究生,主研方向:自然语言处理;牛耘、魏欧,副教授、博士。
  • 基金资助:
    国家自然科学基金资助项目(61202132,61170043)。

Identification of Protein-protein Interaction Based on Hybrid Similarity Model

WANG Yuwei,NIU Yun,WEI Ou   

  1. (School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)
  • Received:2014-08-05 Online:2015-07-15 Published:2015-07-15

摘要: 现有采用机器学习方法的蛋白质交互关系识别系统仅以单句为依据,并且存在标注数据缺乏导致训练集规模小的问题。为此,基于相似性混合模型提出一种新的蛋白质交互识别方法。采用基本的关系相似性(RS)模型做初始判断,利用大规模文本计算单词特征间的相似性,在基本RS模型的基础上通过特征聚类方式引入单词相似性模型,从而建立一个混合模型。实验结果表明,该方法能够取得较高且较均衡的精确度和召回率,而单词相似性的引入又进一步提高了F值,并且其直接利用已有的交互信息,可避免额外的人工标注。

关键词: 蛋白质交互, 关系相似性, 单词相似性, K近邻分类, 层次聚类

Abstract: Current machine learning-based Protein-protein Interaction(PPI)identification systems make predictions solely on evidence within a single sentence and suffer from small training set.In this paper,a hybrid similarity model-based approach is proposed to address these issues.A basic Relational Similarity(RS) model is established to make initial predictions.Word similarity matrices are constructed using a corpus-based approach.A clustering algorithm is applied to group words according to their similarity.The obtained word clusters are introduced to the basic RS model to build a hybrid model.Experimental results show that the basic RS model achieves higher and well-balanced precision and recall,and the introduction of the word similarity model further improves the F-score.This approach makes use of known PPI information,thus releases the burden of manual annotation.

Key words: Protein-protein Interaction(PPI), Relational Similarity(RS), word similarity, K-nearest Neighbor(KNN) classification, hierarchical clustering

中图分类号: