最优化方法在生物序列比对中的应用与研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

最优化方法在生物序列比对中的应用与研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Application and Study of Optimal Methods in Bio-Sequences Alignment
作者：陈科
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：生物序列比对 ; 自适应LCNN ; 空位种子 ; 插入-删除种子
英文关键词：bio-sequence alignment ; adaptive Lagrange Constraint Neural Network ; spaced seed ; in-del seed
学位年度：2010
导师：朱清新
学科代码：081203
学位授予单位：电子科技大学
论文提交日期：2010-09-08

摘要

序列比对是生物信息学中最重要和最基础的研究方向,是研究物种间同源性关系的重要手段。随着生物序列数据的飞速增长,如何提高序列比对速度和灵敏度成为生物序列比对研究需要迫切解决的问题。
     本文将最优化方法应用到生物序列比对中,从而提高序列比对的效率。主要研究内容和取得的成果如下:
     1.提出基于拉格朗日约束神经网络(LCNN)的自适应生物序列比对方法。把数字信号处理与生物序列分析融入到一起,通过建立风险函数并根据最优原则获得生物序列相关性指标,得到序列比对结果。
     2.研究空位种子(Spaced Seed)理论和灵敏度计算模型,并在此基础上提出了基于最优搜索的空位种子寻找和计算方法,实现在有限时间资源限制条件下以最大概率寻找到具有最高灵敏度的空位种子,从而大大提高空位种子的计算效率。
     3.构造与空位种子相关的重叠有向图(Overlap Digraph)模型,根据重叠有向图权值函数提出空位种子优劣判断准则。通过实验可以证明重叠有向图模型可以在很短的时间内得到灵敏度最优或者接近最优的空位种子。
     4.在前人研究的基础上,进一步对插入-删除种子进行更为深入的研究,并从数学上对插入-删除种子(In-del Seed)进行定义,建立插入-删除种子灵敏度计算模型。提出了基于种子重叠复杂度的计算方法,并通过flip函数对候选种子进行构造。该方法能够在较短时间内找到给定权值和相似度等参数下的最优插入-删除种子,并从实验上证明插入-删除种子具有更高的灵敏度,同时给出在权值从9到15的最优插入-删除种子的计算结果。
     本文研究的内容主要是针对生物序列比对,将最优化理论和方法应用到比对过程中,并在现有算法的基础上,提出新的序列比对算法和模型,为实现快速、高效的生物序列比对提供新的思路和方法。经过实验测试,算法在灵敏度上等同于最优或非常接近最优结果,但在计算时间和效率上大大提高,可以为生物信息学的相关研究提供一定的支持和帮助。
Biological sequences alignment is the most important and basic research field in bioinformatics. It is also an important method of studying the homology between species. With the rapid increase of bio-sequence data, how to improve both efficiency and sensitivity has become an urgent problem in sequence alignment.
     Optimal methods are applied to improve the efficiency of sequence alignment in this paper. The main contents and contributions can be briefly summarized as follows:
     1. Self-adaptive sequence alignment method, which is based on Lagrange Constraint Neural Network (LCNN) and digital signal methods, is applied to sequence alignment. Based on risk function and optimal criteria, performance index is calculated to measure the homology of the sequences.
     2. Theory of spaced seeds and sensitivity model are studied. Optimal search method is proposed to find optimal spaced seeds with maximum probability with limited time resources. By this method, calculation efficiency of spaced seeds can be dramatically improved.
     3. The Overlap Digraph model, which is dependent on the structure of spaced seed, is constructed and criterion of judging spaced seeds’quality is proposed based on overlap digraph weight function. The experimental results show that overlap digraph model can obtain optimal or near-optimal spaced seeds in very short time.
     4. With the former achievements and deeper study of in-del (insertion and deletion) seeds, mathematical definitions about indel seeds are proposed, and sensitivity calculation model is also constructed. Overlap complexity method is proposed to calculate in-del seeds, and“flip”function is utilized to select candidate seeds. Given the parameters of weight and similarity, using this method can find optimal seed in quite short time. Experimental results show that in-del seeds are more sensitive than spaced seeds. Based on the method, the optimal in-del seeds with weight from 9 to 15 are calculated.
     The main contents of this paper are about applying the optimal theories and methods to bio-sequence alignment. Based on the known algorithms, new alignment methods and models are proposed to provide new ideas to realize high speed and effective sequence alignment. By testing on biologic data, the arithmetic is equal or very close to the optimal results, but the calculation time and efficiency are improved. It can suggest some supports and helps for bioinformatics research.

引文

[1] E.Krane, L.Raymer. Fundamental Concepts of Bioinformatics. New York: Pearson Education. 2002, 2-20
    [2] J.D.Watson, F.H.C.Crick. Genetical implications of the structure of deoXyribonucleic acid. Nature, 1953, 171(4361):964-967
    [3] M. Lesk. Introduction to Bioinformatics. New York: OXford University Press Inc. 2002, 4-9
    [4] National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
    [5] D.W. Mount. Bioinformatics: Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press, 2001, 2-5
    [6] S. Needleman, C Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 1970, 48(3): 443-453
    [7] D. Sankoff. Matching sequences under deletion/insertion constraints. Proc. Natl. Acad. Sci. USA, 1972, 69(1): 4-6
    [8] T. Smith, M. Waterman. Identification of common molecular sequence. Journal of Molecular Biology, 1981, 147(1): 195-197
    [9] W. Goad, M. Kanehisa. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Research, 1982, 10(1): 247-263
    [10] W. Wilbur, D. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proc.Natl. Acad. Sci. USA, 1983, 80(3):726-730
    [11] D. Lipman, W. Pearson. Rapid and sensitive protein similarity searches. Science, 1985, 227(4693): 1435-1441
    [12] S. Altschul, W. Gish, W. Miller, et al. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3):403-410
    [13] S. Altschul, T. Madden, A. Schaaeffer, et al. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389-3402
    [14]李昭.生物序列相似性比较算法的研究: [博士学位论文].北京:中国科学院计算所, 2003
    [15] M. Li, B. Ma. PatternHunter: faster and more sensitive homology search. Bioinformatics, 2002, 18(3): 440-445
    [16] M. Li, B. Ma, D. Kisman, et al. PatternHunter II: Highly Sensitive and Fast Homology Search. J Bioinform Comput Biol, 2004, 2(3):164-175
    [17] EMBL北大镜像点. http://www.cbi.pku.edu.cn/ GenBank
    [18]李昭.存储约束条件下的序列联配算法.微电子学与计算机, 2002, 19 (6): 1-5
    [19] Z. Jun, S. Jun and E. L. Charles. Bayesian adaptive sequence alignment algorithms. Bioinformatics. 1998, 14(1):25-39
    [20] U. Muckstein, I. L. Hofacker and P. F. Stadler. Stochastic Pairwise Alignment, Bioinformatics. 2002, 18:153-160
    [21] P.D Cristea. Independent Component Analysis for Genetic Signals. International Biomedical Optics Symposium, San Jose: USA, 2001, 20-26.
    [22] P. D. Cristea. Conversion of nucleotides sequences into genomic signals. J. Cell .Mol. Med. 2002, 6(2):279-303
    [23] P. D. Cristea. Large scale features in DNA genomic signals. Signal Processing. 2003, 83(4):871-888
    [24] H. Chafia, I. Cosic. Protein sequence comparison based on the wavelet transforms approach. Protein Engineering, 2002, 15(3):193-203
    [25] I.Cosic, Macromolecular Bioactivity: Is It Resonant Interaction Between Macromolecules? Theory and Applications. IEEE Trans. Biomed. Eng., 1994, 41(12):1101-1114
    [26] Pierre Baldi. Bioinformatics: The Machine Learning Approach. Canterbury: MIT Press, 2001, 113-163
    [27] S. Rajko and S. Aluru. Space and Time Optimal Parallel Sequence Alignments. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2004, 15(12):1070-1081
    [28] S. Aluru, N. Futamura, and K. Mehrotra. Parallel Biological Sequence Comparison Using PrefiX Computations. Journal of Parallel and Distributed Computing. 2003, 63(3): 264-272
    [29] C.E.R. Alves, E.N. Caceres, and F. Dehne. Parallel Dynamic Programming for Solving the String Editing Problem on a CGM/ BSP. Proc. ACM Symp. Parallel Algorithms and Architectures, 2002, 275-281
    [30] D. Feng, R. Doolittle. Progressive alignment of amino acid sequence and construction of phylogenetic trees from them. Methods in Enzymology, 1996, 266: 368-382
    [31] D. Higgins, A. Bleasby, R. Fuchs. CLUSTAL V: improved software for multiple sequence alignment. Comput. Appl. Biosci., 1992, 8: 189-191
    [32] H. G. Higgins, J. D. Thompson, T. J. Gibson. Using clustal for multiple sequence alignments. Methods Enaymol, 1996, 266:383-402
    [33]计宏凯,周晴,闻芳.针对基因选择性剪接的多序列比对算法研究.清华大学学报(自然科学版), 2001, 41 (9): 111-114
    [34]袁激光,金人超,李红涛.基于A*算法的启发式算法求解多序列比对问题.华中科技大等学报(自然科学版), 2003, 31(9) : 50-52
    [35] H, Szu. ICA-an enabling technology for intelligent sensory processing. IEEE Circuits and System, 1999, 12(3): 14-41
    [36] H, Szu, I.Kopriva. Comparison of Lagrange constrained neural network with traditional ICA methods. Neural Networks, 2002, 2(13): 466-471
    [37]余堃,蒲红梅,周明天.自适应多目独立成分分析.电子科技大学学报, 2007, 36(1): 11-13
    [38] Q.X.Zhu, M.T.Zhou, John Oommen. Some Results on Optimal Search in Discrete and Continuous Spaces. Journal of Software, 2001,12(12):1748-1751
    [39] Q.X.Zhu. Optimal Search Problems in Discrete and Continuous Spaces. Ottawa: Carleton University Press, 1996, 25-40
    [40]朱清新,《离散和连续空间中的最优搜索理论》.北京:科学出版社, 2005, 24-36
    [41] D. Mak, Y. Gelfand, G. Benson. Indel seeds for homology search. Bioinformatics, 2006. 22(14): 341-349
    [42] M. Dayhoff, R.M.Schwatz and B.C.Orutt, A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure, 1978, 5(3):345-352
    [43] S. Henikoff and J.G.Henikoff. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci., 1992, 89(22):10915-10919
    [44] E. Joseph, O. Ian and A. Mark, Basic Local Alignment Search Tool, California: O'Reilly, 2003
    [45] G. Kauer, H. Blocker, Applying signal theory to the analysis of biomolecules. Bioinformatics, 2003, 19:6, 2016-2021
    [46] Z. Karu, Signals and Systems. ZiZi press, 1995, 20-40
    [47]饶妮妮,邱丽君. DNA序列数值映射方法的研究.生物医学工程学杂志, 2005, 22(4): 681-685
    [48]王宏漫,欧宗瑛.一种新的DNA序列映射规则及其分析应用.信号处, 2002, 18(2): 133-136
    [49] R. Edward, R. Dougherty and I. Shmulevich et al. Genomic Signal Processing and Statistics. Hindawi Publishing Corporation, 2005, 15-60
    [50] I. Cosic. The Resonant Recognition Model of Macromolecular Activity. Bioactivity. Birkhauser, 1997, 27-45
    [51] I.Cosic, Macromolecular Bioactivity: Resonant interaction between macromolecules. IEEE Trans. Biomed. Eng., 1994, 41(12):1101-1114
    [52] V. Veljkovic and I. Slavic. Simple General-Model Pseudopotential. Phys. Rev. Lett., 1972 29(5):105-107
    [53] Pierre Baldi. Bioinformatics: The Machine Learning Approach. MIT Press, 2001: 108-154
    [54] A. J. Bell and T.J.Sejnowski. An information maXimization approach to blind separation and blind deconvolution. Neural Computation, 1995, 7(6): 1129-1159
    [55] S.Amari, A.Cichocki. Adaptive Blind Signal and Image Processing. Wiley Press,2002, 296-355.
    [56] Hyvarinen, A.Karhunen, Oja. Independent Component Analysis. Wiley Press, 2001:154-188
    [57] P. Comon. Independent Component Analysis, a new concept? 1994, Signal Processing 36(3):287-314
    [58] S. Amari, A. Cichocki and H.H. Yang. Recurrent neural networks for blind separation of sources. Symposium Nonlinear Theory and its Applications. 1995, 1:37-42
    [59] Szu H. ICA-An enabling tech. for intelligent sensory processing [J]. IEEE Circuits and System. 1999, 12: 14-41
    [60] Szu H,I.Kopriva. Comparison of Lagrange constrained neural network with traditional ICA methods. Neural Networks,2002, 2(13):466-471
    [61] Kun She,Pu Hongmei etc. Adaptive Independent Component Analysis Under Multisensing, Journal of University of Electronic Science and Technology of China, 2007, 36(1):11-13
    [62] J.Fan, R.Li, Variable selection via nonconcave penalized likelihood and its oracle properties [J]. Journal of the American Statistical Association, 2001, 96(456):1348-1360
    [63] G.E.Crooks, R.E.Green and S.E.Brenner, Pairwise alignment incorporating dipeptide covariation, Bioinformatics, 2005, 21(19):3704-3710
    [64] Amari. Natural gradient works efficiently in learning. Neural Computation, 1998, 10(2): 251-276
    [65] S.Amari, A.Cichocki. Adaptive Blind Signal and Image Processing. Wiley Press,2002, 343-346
    [66] J.Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 2001, 17(5):419-428
    [67] S.Burhardt, J.Karkkainen. Better filtering with gapped q-grams. Proceeding of the 12th AnnualSymposium on Combinatorial Pattern Matching, 2001, 73-85
    [68] A. Califano, I.Rigoutos. FLASH: fast look-up algorithm for string homology. Technical Report, IBM T. J. Watson Research Center, 1995
    [69] P.Pevzner, M.S.Waterman. Multiple filtration and approXimate pattern matching. Algorithmica, 1995, 13(1-2), 135-145
    [70] K.P.Choi, L.Zhang. Sensitivity analysis and efficient method for identifying optimal spaced seeds. Journal of Computer and System Science, 2004, 68(1), 22-40
    [71] R.Karp, M.O.Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Develop. 1987, 31(202), 249-260
    [72] B.E.Blaisdell, A measure of the similarity of the sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. USA, 1986, 83(14):5155-5159
    [73] J.P.Dumas, J.Ninio. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acid Res. 1982, 10(1):197-206
    [74] B.Ma, J.Tromp. PatternHunter-faster and more sensitive homology search. Bioinformatics, 2002, 18(3):440-445
    [75] B.Ma, M.Li. PatternHunterII: Highly Sensitive and Fast Homology Search. Genome Informatics, 2003, 14(3):164-175
    [76] U.Keich, M.Li, B.Ma. On Spaced Seeds for Similarity Search. Discrete Appl. Math., 2004, 138(3):253-263
    [77] L.Zhang. Superiority of Spaced Seeds. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2007, 4(3): 496-505
    [78] M.Li, B.Ma. Superiority and CompleXity of the Spaced Seeds. SODA'06, 2006, 444-453
    [79] G.Kucherov, L.Noe, and Y.Ponty. Estimating seed sensitivity on homogeneous alignments, Proc. IEEEE 4th Symp. On Bioinformatics and bioengineering, Taiwan, 2004, 387-394
    [80] D.Kisman, M.Li, B.Ma. Gapped fast and sensitive translated homology search. Bioinformatics, 2005, 21(4):542-544
    [81] L.Noe and G..Kucherov. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acid. Res. 2005, 33(13):540-543
    [82] Z.Ning, A.J.CoX. SSAHA: A fast search method for large DNA database. Genome Res. 2001, 11(4):1725-1729
    [83] J.Xu, D.Brown. Optimizing multiple multiple spaced seeds for homology similarity search. Proc of CPM’04, 2004, 47-58
    [84] Y.Sun and J.Buhler. Designing multiple simultaneous seeds for DNA similarity search. Proc of RECOMP’04, 2004, 76-85
    [85] K.Choi, F.F.Zeng and L.Zhang. Good spaced seeds for homology search. Bioinformatics, 2004, 20(7):1053-1059
    [86] L.Zhang. Superiority of Spaced Seeds for Homology Search. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2007, 4(3):496-505
    [87] L.Ilie, S.Ilie. Long Spaced Seeds for finding similarities between biological sequences. 2007, BIOCOMP’07, 30:3-8
    [88] Mak D, Gelfand Y, Benson G, Indel seeds for homology search. Bioinformatics, 2006. 22(14): 341-349.
    [89] L.Ilie, S.Ilie. Multiple spaced seeds for homology search. Bioinformatics, 2007. 23(22), 2969-2977.
    [90] S.Aki, H.Kuboki, K.Hirano. On discrete distributions of order k, Ann. Inst. Statist. 1984, 36(1):431-440
    [91] S.F.Altschul, T.L.Madden, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 1997, 25(17):389-402
    [92] F.P.Prearata, L.Zhang and K.P.Choi, Quick, practical selection of effective seeds of homology search, J.Comput. Biol, 2005, 12(11):37-52
    [93] Y.Kong. Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. J. Comput. Biol., 2007, 14:238-254
    [94] L.Noe and G.Kucherov. Improved hit criteria for DNA local alignment.BMC Bioinformatics, 2004, 5(1):149-160
    [95] J.Buhler, U.Keich and Y.Sun. Designing Seeds for Similarity Search in Genomic DNA. Proc. of 7th annual international conference on Res. in computational molecular biology, 2003, 67-75

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700