基于蛋白质相互作用网络的聚类和稀疏点检测算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Clustering and the Isolated Points' Detection Based on the Protein-protein Interaction Network
作者：彭利红
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：PPI网络 ; K-means聚类方法 ; 相似度 ; AAMV法 ; 稀疏点检测
英文关键词：The PPI network ; The k-means clustering algorithm ; The similarity ; The AAMV method ; The isolated points’detection
学位年度：2008
导师：廖波
学科代码：081201
学位授予单位：湖南大学
论文提交日期：2008-05-10
答辩委员会主席：邝继顺

摘要

随着人类基因组测序的完成,对蛋白质结构和功能的研究成为基因组学研究的一大热点。研究证明,蛋白质在其功能组中很少以单个个体而存在,一般与功能相似的蛋白质之间存在相互作用。因此,我们可以通过对蛋白质相互作用(Protein-Protein interaction,PPI)网络的研究来预测其功能。本文将在基于PPI网络的聚类及稀疏点检测算法两方面进行研究:
     提出一种蛋白质功能预测算法——基于算术平均最小值(the Arithmetic Ave- rage Minimum Value, AAMV)的K-means聚类算法。首先,根据蛋白质之间的相互作用,通过人类AD(Alzheimer’s Disease)相关PPI网络图,得出蛋白质之间的关联矩阵;然后,利用AAMV法求得相似度矩阵;接着,鉴于误差平方和准则不能较好的对聚类进行收敛,提出了加权的误差平方和准则;最后,在相似度矩阵的基础上,利用加权的误差平方和准则进行有效收敛,利用K-means聚类方法对PPI网络中的蛋白质进行聚类与功能预测。
     提出PPI网络中的稀疏点检测算法——基于加权的相似系数和算法。首先,根据蛋白质之间的相互作用,通过人类AD(Alzheimer’s Disease)相关PPI网络图,得出蛋白质之间的关联矩阵;然后,利用最大最小值法求得相似系数矩阵;接着,由于相似系数和不能对PPI网络中的稀疏点进行更好的检测,因此,在相似系数的基础上提出加权的相似系数和方法。最后,根据输入的阈值,利用相似系数和算法得出PPI网络中的稀疏蛋白质。
     基于人类AD相关PPI网络图,利用基于AAMV的K-means算法对图中蛋白质进行聚类,其结果与Maryland Bridge法和Korbel法所得结果非常相似;利用聚类结果,对四个孤立蛋白质的功能进行预测。同时,利用加权的相似系数和算法对图中的稀疏蛋白质进行检测,实验结果表明:输入的阈值取值在0.01-0.16之间时,其精确度比较高。
With the completion of a draft sequence of the human genome, the field of genetics stands on the threshold of significant advances. Crucial to furthering these investigations is a comprehensive understanding of the structure and function of the proteins. It has been observed that proteins seldom act as single isolated species in the performance of their functions; rather, proteins involved in the same cellular processes often interact with each other. Therefore, the functions of unknown proteins can be predicted through comparison with the interactions of similar known proteins in the protein-protein interaction (PPI) network.
     The clustering is the process of grouping data objects into clusters which demonstrate greater similarity among objects in the same clusters than in the different clusters. There are a larger number of interactions in the PPI network. The results of the clustering can suggest possible functions for the members of the cluster which were previously unknown.
     The chapter will discuss the clustering algorithm and the isolated points’detection based on the PPI.It will begin with the clustering algorithms based on the PPI.And it is to evaluate a novel clustering technique for clustering and detecting the isolated points’in the PPI networks, which iteratively refines clusters based on a combination of the k-means clustering algorithm based on the similarity and the arithmetic average minimum value.The associated matric and the similarity matric is obtained.If the similarity value of two elements is high, the spatial distance between them should be short.The result is that the algorithm is found to be effective at detecting clusters and identifying the isolated points in the PPI network graph with regard to human Alzheimer’s disease.
     The algorithm outperforms competing approaches and is capable of effectively predicted the function-unknown protein function.
     And the isolated points’detection algorithm will be applicated in the PPI network and look for the isolated points.

引文

[1] 钟扬,赵亮,赵琼. 简明生物信息学[M]. 北京:高等教育出版社,2002:35-42
    [2] 孙啸,陆祖宏,谢建明. 生物信息学基础[M]. 北京:清华大学出版社,2005:1-10
    [3] Luscombe N.M.,Gerstein M.. What is bioinformatics? A proposed definition and overview of the field[J]. Methods of Information in Medicine,2001,(40): 346-358
    [4] 李林,吴家睿,李伯良. 蛋白质组学的产生及其重要意义[J]. 生命科学,1999,11 (2): 49-50
    [5] 王志珍,邹承鲁. 后基因组一蛋白质组研究[J]. 生物化学与生物物理学报, 1998, 30(6):533-539
    [6] Walhout A.J.,Sordel R.,Lu X,d.. Protein interaction mapping in C.elegans using proteins involved in vulval development[J]. Science,2000,287(5450):1l6-122
    [7] 俞利荣,曾嵘,夏其昌. 蛋白质组研究技术及其进展[J]. 生命的化学,1998,18(6): 4-9
    [8] 李荣. 生物信息数据挖掘若干关键问题研究与应用[J]. 复旦大学博士论文,上海:复旦大学,2004,1-11
    [9] Olivia Parr Rud 著. 数据挖掘实践[M]. 朱扬勇,左子叶,张忠平等译. 北京: 机械工业出版社,2003:128-140
    [10] J.MacQueen. Some methods for classification and analysis of multivariate observations[C]. In Proc Natl Acad Sci,USA,1967,1:281-297
    [11] L.Kaufman,P.J.Rousseeuw. Finding Groups in Data:An Introduction to Cluster Analysis[C]. New York:John Wiley&Sons,1990
    [12] M.Ester,H.P.Kriegel,X.Xu.. A density-bases algorithm for discovering clusters in large spatial databases[C]. In Proc Natl Acad Sci,USA,Int.Conf.Knowledge Discovery and Data Mining(KDD’97), 1996,226-231
    [13] Jiawei Han,Micheline Kamber 著. 数据挖掘概念与技术[M]. 范明,孟小峰等译. 北京:机械工业出版社,2005,232-254
    [14] 王海波,安学丽,张艳贞等. 蛋白质相互作用研究方法及其应用[J]. 生物技术通报,2006,增刊:167-171
    [15] 朱新宇,沈百荣. 预测蛋白质间相互作用的生物信息学方法[J]. 生物技术通讯,2004,15(1):70-75
    [16] 关薇,王建,贺福初等. 大规模蛋白质相互作用研究方法进展[J]. 生命科学, 2006, 18(05):507-512
    [17] Long D.,Liu M.L.,Yang D.W.. Accurately Probing Slow Motions on Millisecond Timescales with a Robust NMR Relaxation Experiment[J]. Am Chem Soc,USA, 2008, 130 (8): 2432-2433
    [18] 赵亚雪,唐赟. 蛋白质-蛋白质相互作用及其抑制剂研究进展[J]. 生命科学, 2007, 19(5):506-511
    [19] 彭丹妮,黄静,吴自荣. 酵母三杂交系统的原理和应用[J]. 生命科学,2007, 19(4): 461-464
    [20] 施蕴渝,吴季辉. 核磁共振波谱应用于结构生物学的研究进展[J]. 生物物理学报, 2007,23(4):240-245
    [21] 姜虹,安普丽,蒋晔. 药物与蛋白质相互作用研究方法的进展[J]. 第二军医大学学报,2007,28(6):662-666
    [22] 田瑞,高诗娟,宋婀莉等. 相互作用结构域 SH3 和 WW 的蛋白质组学研究[J]. 基础医学与临床,2007,27(2):129-133
    [23] Schwikowski B.,Uetz P.,Fields S.. A network of protein-protein interactions in yeast[J]. Nat Biotechnol,2000,18(12):1257-1261
    [24] Vazquez A.,Flammini A.,Maritan A.,et al. Global protein function prediction from protein-protein interaction networks[J]. Nat Biotechnol,2003,21(6): 697-700
    [25] Deng M.,Tu Z.,Chen T.,et al. Mapping Gene Ontology to proteins based on protein-protein interaction data[J]. BMC Bioinformatics,2004,20(6):895- 902
    [26] Zhou X.,Kao M.C.,Wong W.H.. Transitive functional annotation by shortest-path analysis of gene expression data[C]. In Proc Natl Acad Sci,USA, 2002,99(20):12783 -12788
    [27] Karaoz U., Murali T.M.,Letovsky S.,et al. Whole-genome annotation by using evidence integration in functional-linkage networks[C]. In Proc Natl Acad Sci,USA, 2004,101(9):2888-2893
    [28] Samanta M.P.,Liang S.. Predicting protein functions from redundancies in large-scale protein interaction networks[C]. In Proc Natl Acad Sci,USA, 2003, 100(22):l2579- 12583
    [29] 王正华, 王秀鹤,王勇献等. 基于相互作用的蛋白质功能预测[J]. 激光生物学报, 2007,16(04):390-393
    [30] Jiang H.L.,Shen,J.W.,Zhang J.,et al. Predicting protein–protein interactions based only on sequences information[J]. PNAS,2007,104:4337-4341
    [31] 卢宏超,石秋艳,石宝晨等. 基于蛋白质网络功能模块的蛋白质功能预测[J]. 生物化学与生物物理进展,2006,33(5):446-451
    [32] 蒋雄飞,杨洁,王炜. Alzheimer’s 疾病相关蛋白质相互作用网络构建及其相互作用预测[J]. 南京大学学报:自然科学版,2006,42(5):479-489
    [33] Bamett V.,Lewis T.. Outliers in Stafistical Data[C]. New York John Wiley&Sons,1994, 4:312-323
    [34] Knorr E.,Ng R.. A Unified Notion of Outliers: Properties and Computation[C].In Proc Natl Acad Sci,USA, 1997, 8:2l9-222
    [35] Aming A.,Agrawal R.,Raghavail P.. A Linear Method for Deviation Detection in Large Database[C]. In Proc Natl Acad Sci,USA,1996, 8:164-169
    [36] 施化吉,周书勇,李星毅等. 基于平均密度的孤立点检测研究[J]. 电子科技大学学报,2007,36(6): 1286-1288
    [37] 尚俊平,邱保志,刘合兵. 一种基于距离的聚类和孤立点检测算法[J]. 河南科学, 2007,25(6):975-978
    [38] 罗敏,阴晓光, 张焕国等. 基于孤立点检测的入侵检测方法研究[J]. 计算机工程与应用, 2007,43(13):146-149,152
    [39] 孙焕良,鲍玉斌,于戈等. 一种基于划分的孤立点检测算法[J]. 软件学报,2006, 17 (5):1009-1016
    [40] 岳峰,邱保志. 基于反向 K 近邻的孤立点检测算法[J]. 计算机工程与应用,2007, 43(7):182-184
    [41] 张长,邱保志. LDC-mine——基于局部偏差系数的孤立点挖掘算法[J]. 计算机应用,2007,27 (1):95-97
    [42] 曹建平,马义才,李亦学等. 计算方法在蛋白质相互作用研究中的应用[J]. 生命科学,2005,17(01):82-87
    [43] Lee T.I.,Rinaldi N.J.,Robert F.,et al. Transcriptional regulatory networks in Saccharomyces cerevisiae[J]. Science,2002,298:799-804
    [44] Brown M.P.,Grundy W.N.,Lin D.,et al. Knowledge-based analysis of microarray gene expression data by using support vector machines[J]. In Proc Natl Acad Sci,USA, 2000,97(1):262-267
    [45] 孙景春,徐晋麟,李亦学等. 大规模蛋白质相互作用数据的分析与应用[J]. 科学通报,2005,50(19):2055-2060
    [46] 孙宇,贾凌云,任军. 蛋白质相互作用的研究方法[J]. 分析化学,2007,7(5):760- 766
    [47] Leone M.,Pagnani A.. Predicting protein functions with message passing algorithms[J]. BMC Bioinformatics,2005,21(2):239-247
    [48] Letovsky S.,Kasif S.. Predicting protein function from protein-protein interaction data:a probabilistic approach[J]. BMC Bioinformatics,2003,19(6):l97-204
    [49] Bu D.,Zhao Y.,Cai L.,et a1. Topological structure analysis of the protein-proteininteraction network in budding yeast[J]. Nucleic Acids Res,2003,31(9):2443- 2450
    [50] Rives A.W.,Galitski T.. Modular organization of cellular networks[C]. In Proc Natl Acad Sci,USA,2003,l00(3):Il28-I133
    [51] Gary D.B.,Hogue C.W..An automated method for finding molecular complexes in large protein interaction networks[J]. BMC Bioinformatics, 2003,4(2):932-941
    [52] Dunn R.,Dudbridge F.,Sanderson C.M.. The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks[J]. BMC Bioinformatics,2005,5:152-161
    [53] Brun C.,Chevenet F.,Martin D.,et al. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network[J]. Geno me Biology,2003,5(1):R6
    [54] Goldberg D.S.,Roth F.P.. Assessing experimentally derived interactions in a small world[C]. In Proc Natl Acad Sci,2003,100:4372-4376
    [55] Mirkin B.,Koonin E.V.. A top-down method for building genome classification trees with linear binary hierarchies[J]. Bioconsensus,2003,61:97-112
    [56] Korbel J.O.,Snel B.,Huynen M.A.,et al. A web server for the construction of genome phylogenies[J]. Trends Genet,2002,18:159-162
    [57] Eisen M.B.,Spellman P.T.,Brown P.O.,et al. Cluster analysis and display of genome- wide expression patterns[C]. In Proc Natl Acad Sci,USA,1998,95:14863- 14868
    [58] 孙云,李舟军,陈火旺. 孤立点检测算法及其在数据流挖掘中的可用性[J]. 计算机科学,2007,34(10):200-203
    [59] 黄斌,史亮,姜青山等. 基于孤立点挖掘的入侵检测技术[J]. 计算机工程,2008, 4(3): 88-90

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700