用户名: 密码: 验证码:
基因识别和微阵列数据识别算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文首先阐述和分析了基因识别、必需基因识别和微阵列数据识别问题的研究现状、研究热点和发展趋势,并在此基础上,重点研究了基因识别算法和微阵列噪声数据识别算法。在基因识别问题上,提出了一种基于公平权值的径向基函数神经网络模型和算法,并将该算法应用于基因识别研究领域。在必需基因识别问题上,针对必需基因序列特征,采用人工神经网络和支持向量机来解决该问题,取得了较好的结果。在微阵列噪声数据识别问题中,主要研究微阵列基因表达癌症数据误标记样本和异常样本识别问题,并提出了两个有效的识别算法,分别是广义CL-stability和广义消融CL-stability。通过实验测试,验证了新方法的有效性和可行性。本文的主要贡献和研究内容如下:
     (1)对生物信息学中的基因识别、必需基因识别和微阵列噪声数据识别研究做了系统的综述。
     (2)阐述了基因识别和微阵列噪声数据识别的相关机器学习基础理论。
     (3)针对编码蛋白质基因,提出了一种基于公平权值的径向基函数神经网络基因识别模型和算法。
     (4)利用人工神经网络和支持向量机方法实现了对必需基因的识别。
     (5)针对微阵列基因表达癌症数据误标记样本和异常样本识别问题,基于支持向量机理论提出了广义CL-stability和广义消融CL-stability两个新算法。
     本文的研究结果丰富了机器学习理论的应用研究,在进化计算与神经网络结合、神经网络的结构设计和参数学习、以及改进和优化支持向量机学习等方面,做了具有理论意义和应用价值的研究工作。为基因识别、必需基因识别和微阵列噪声数据识别算法的实用化研究提供了有意义的方法和手段,为分子生物学和医学的相关研究起到了一定的促进作用。
Bioinformatics is an intersectional disciplinary approach drawing from specific disciplines such as biology, physic, computer science, statistics, mathematics, physics and chemistry. It is arisen with the human genome project in the end of 1980’. From naissance to development, it has come through three periods. They are called Pro-Genome-Era, Genome-Era and Post Genome Era respectively. The current bioinformatics research has changed from data storage, classification, searching into genome analysis, proteome analysis and compare analysis, conformity and approach of system biology. Annotation of genome is the aim of human in the future. Gene recognition and gene function recognition are both of genome annotation. Microarray is a powerful and useful tool for the research of gene function annotation. Microarray always has a small number of samples, high dimension and noise in the dataset. A powerful tool for recognizing microarray noise data is a good hand to promote the development of molecular biology. In this dissertation it is our aim to find a good solution about gene recognition, essential gene recognition and mislabel sample recognition.
     Machine learning studies how to simulate human learning. It is converging from several sources, such as artificial intelligence, computational intelligence, statistics, mathematics, psychological, philosophy, adaptive control theory, informatics, biology etc. It nearly includes all human cognition domains and has been evaluated well. Fusing correlation machine learning methods, supplementing their superiorities, and then proposing new models and algorithms will promote the development of gene recognition and microarray data recognition effectively. The task of this dissertation is gene recognition and microarray data recognition. The proposed algorithms belong to machine learing fields.
     Base upon comprehensively analyzing and understanding the present research status, opening topics and developing trendency in gene recognition and microarray data recognition of bioinformatics, we mainly focus on the research of recognition algorithms in gene recognition and noise data recognition of microarray. About gene recognition, we propose a novel algorithm model base on equitable weights, this model and machine learning algorithms are used to solve the problem of novel fields of gene recognition. About essential gene recognition, we use multi artificial neural network and support vector machine to solve this problem and got a better results. After the sufficient analysis and study on microarray, focusing on the features of microarray noise data, we propose two algorithms, called Generalized CL-stability and Generalized CL-Stability with exclusion, for recognizing noise data in microarray. The main contributions and contents are described as follows:
     (1) Sum up the research of gene recognition and microarray noise data recognition in bioinformatics. Introduce background, application, present research status, challenge and developing trendency of gene recognition and microarray noise data recognition respectively. All these works make the foundation of further research and study.
     (2) Introduce related basic machine learning theory in the gene prediction and the microarray noise data recognition, including the structure design and learning theory of RBF neural network, rationale of evolution computation as well as statistics theory.
     (3) About protein coding gene, we propose an algorithm model based on equitable weights and RBF neural network to recognize gene. This algorithm adopts a fusing strategy. It integrates three famous gene recognition programs. Three normal datasets are used to test this algorithm. The results show that the proposed algorithm is feasible and effective.
     (4) About essential gene, we apply two machine learning algorithms to solve these problems and obtain good results. These algorithms are artificial neural network and support vector machine. Six types of ANN and two types of SVM are used. The experiment results show these algorithms can be used to recognize essential gene.
     (5) About microarray noise data recognition, we propose two algorithm models to recognize and modify mislabeled samples and abnormal samples in microarray base on support vector machine. The names of the two algorithms are Generalized CL-stability and Generalized CL-stability with exclusion respectively. The ideas of the two proposed algorithms are based on the stability of each sample. The benefit of these algorithms is not only on the accuracy, but also that it can show more information regarding which sample is mislabeled and which sample is abnormal. In this dissertation, we use microarray datasets and synthetic dataset to test these algorithms. Experiment results show that these algorithms have a good accurate and better than that from other existing algorithms.
     The research of this dissertation has enriched the study of machine learning theory application. It has significance in applications, such as combination of evolution computational with neural network, design and parameter study of neural network structure, improvement and optimization of support vector machine etc. Furthermore, it provids significant method and strategy for the application of gene prediction and microarray noise data recognition. Hopefully, these algorithms could be benefit to improve the study of biology and medicine.
引文
[1] Francis S. Collins, Eric D. Green, Alan E. Guttmacher, Mark S. Guyer: A Vision for the Future of Genomics Research, Nature, 2003: 835-847
    [2] Francis S. Collins, Michael Morgan, Aristides Patrinos: The Human Genome Project: Lessons from Large-Scale Biology, Science, 2003, 300: 286-290
    [3] Collins F. S, Green E. D, Guttmacher A. E. et al., A vision for the future of genomics research, Nature, 2003, 422: 835-847.
    [4] Searls D.B, Bioinformatics tools for whole genomes, Annu Rev Genomics HumGenet, 2000, 1:251-279.
    [5] L. Wang, T. Jiang, On the ComPlexity of MultiPle Sequence Alignment, Comp. Biol.,1994,l(4):33
    [6] D.J.Lipman, S.F.Altschul, J.Kececioglu. A Tool for Multiple Sequence Alignment. Proc. Natl. Acad. Sci.,1989,86(12):4412-4415.
    [7] Dvaid.W.M, Bioinformatics: Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press,2001.
    [8] 张春霆. 生物信息学的现状与展望. 世界科技研究与发展, 2000, 22(6), 17-20.
    [9] 郝柏林. 张淑誉. 生物信息学手册. 上海: 上海科技出版社, 2000, 第一版, 1-80.
    [10] 赵国平. 生物信息学. 北京: 科学出版社, 2003, 第一版, 1-50.
    [11] Marshall E. The genome programs conscience. Science, 1996, 274:4882-4884.
    [12] Burris J, Cook-deegan R, Alberts B. The Human Genome Project after a decade: policy issues. Nat Genet, 1998, 20(4):333-335.
    [13] Collins F, Galas D, A new five-year plane for the U.S Human Genome Project. Science, 2993, 262(5130):43-46.
    [14] Rowen L, Mahairs G, Hood L. Sequencing the human genomes. Science, 1997, 278(5338):605-607.
    [15] Delloukas P, Schuler G, Gyapay G, et al. Aphysical map of 30000 human genes. Science, 1998, 282(5389):744-746.
    [16] Schuler G, Boguski M, Stewart E, et al. A gene map of the human genome. Science, 1996, 274:540-546.
    [17] Venter J, Adams M, Sutton G, et al. Shotgun sequencing of the human genome. Science, 1998, 280:1540-1542.
    [18] Luscombe N. M, Greenbaum D, Gerstein M. What is Bioinformatics? A proposed definition and overview of the field. Method of Information in Medicine, 2001, 40:346-358.
    [19] Berboard G. The human genome: Organization and evolutionary history. Ann.Rev.Genetics, 1995, 29:445-476.
    [20] 贺福初. 蛋白质组(proteome)研究—后基因组时代的生力军. 科学通报, 1999, 44(2):113-122.
    [21] 应嘉, 赵睿颖, 尚彤. 生物信息学在人类基因组计划中的应用. 北京大学学报(医学版), 2002, 34(4):389-392.
    [22] Benton D. Bioinformatics-Principle and Potential of a New Multidisplinery Tool. TIBTECH, 1996, 14:261-272.
    [23] Baxevanis A. D, Francis B. F. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley & Sons, New York, 1998.
    [24] Williams K.L, Gooley A. A, Packer N. H. Proteome: Not just a made-up name. Today’s Life Science, 1996, 6:16-21.
    [25] Sergio G. P, Liat R, Dan S. T. Evolution of new protein topologies through multistep gene rearrangements. Nature genetics, 2006, 38(2):168-174.
    [26] Huang Y. M. Bystroff C. Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics, 2006, 22(4):413-422.
    [27] Y Kuroki, et al. Comparative analysis of chimpanzee and human Y chromosomes unveils complex evolutionary pathway. Nature Genetics, 2006 38(2):158-167.
    [28] Haruna T, et al. Role of the silkworm argonate2 homolog gene in double-strand break repaire of extrachromosomal DNA. Nucleic Acids Research, 2006, 34(4):1092-1101.
    [29] Long M. Y, Betran E, Thornton K, Wang W. The origin of new genes: glimpses from the young and old. Nature review genetics, 2003,4:865-875.
    [30] Primrose S. B. Principles of genome analysis and genomics. Blackwell, 2003.
    [31] Gandhi T. K. B, et al. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nature Genetics, 2006, 38(6):285-293.
    [32] Folkert J.van Werven, et al. The use of iotin tagging in Saccharomyces cerevisiae improves the sensitivity of chromatin immunoprecipitation. Nuleic Acids Research, 2006, 34(4):e33.
    [33] Di Kim Nguyen & Christine M Disteche, Dosage compensation of the active X chromosome in mammals. Nature Genetics, 2006, 38(1):47-53.
    [34] Remo Rohs, Itai Bloch, Heinz Sklenar, Zippora Shakked. Molecular flexibility in ab initio drug docking to DNA: binding-site and binding-mode transitions in all-atom Monte Carlo simulations. Nuleic Acids Research, 2005, 33(22):7048-7057.
    [35] Alexander P, Stefano C, Arief G, Yudi P. Multidimensional local false discovery rate for microarray studies. Bioinformatics, 2006,22(5):556-565.
    [36] Koh-ichiro Yoshiura, et al. A SNP in the ABCC11 gene is the determinant of human earwax type. Nature Genetics, 2006, 38(3):324-330.
    [37] Suzanne Schubbert, et al. Germline KRAS mutations cause Noonan syndrome. Nature Genetics, 2006, 38(3):331-336.
    [38] John P. A. A road map for efficient and reliable human genome epidemiology. Nature Genetics, 2006, 38(1):3-5.
    [39] JD Watson, FHC Crick. Molecular structure of nucleic acids. Nature, 1953, 4356(171): 737-738.
    [40] Benson D. A, Karsch-Mizrachi I, et al. GenBank:update. Nucleic Acids Res, 2004, 32(Database issue):23-28.
    [41] Emmer D.B, Stoehr P. J, et al. The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research. 1994, 26(1):3445-3449.
    [42] Kanz C, et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Research. 2005, 33:29-33.
    [43] Cochrane G. R, Brooksbank C. A Quick Guide to the EMBL-Bank Nucleotide Sequence Database. OBBeC Life Science Computing and Bioinformatics, 2005, 2:34-35.
    [44] Cochrane G, et al. EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Research, 2006, 34(Database issue):10-14.
    [45] Kousaku Okubo, et al. DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Research, 2006, 34(Database Issue):6-9.
    [46] Michael Q. Zhang. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet, 2002, 3(9):698-709.
    [47] Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for microRNA genomics. NAR 2008 36(Database Issue):154-158.
    [48] Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. NAR 2006 34(Database Issue):140-144.
    [49] Barneche, F. et al. (2000) Fibrillarin genes encode both a conserved nucleolar protein and a novel small nucleolar RNA involved in ribosomal RNA methylation in Arabidopsis thaliana. J. Biol. Chem. 275 27212-27220.
    [50] Barneche, F. et al. (2001) Identification of 66 box C/D snoRNAs in Arabidopsis thaliana: Extensive gene duplications generated multiple isoforms predicting new ribosomal RNA 2'-O-methylation sites. J. Mol. Biol. 311 57-73.
    [51] 刘志宗,张映. 非编码 RNA 及其基因预测. 生物技术. 2006.03 pp:27-30.
    [52] Kobyaashi K, Ehrlich. S.D, Albertini. A,et al. Essential Bacillus subtilis genes. Proc Natl Acad Sci USA,2003,100:4678-4683.
    [53] 窦运涛. 基于必需基因数据库的微生物必需基因的分析. 天津理工大学学报, 2006, 22(2), 9-13.
    [54] Federico Abascal, Rafael Zardoya 和 David Posada(2006)GenDecoder: genetic code prediction for metazoan mitochondria. Nucleic Acids Research. Vol.W398-W393.
    [55] C Mathé, MF Sagot, T Schiex, P Rouzé. Current methods of gene prediction, their strengths and weaknesses Nucleic Acids Research. 2002, Vol.No.19:337-382.
    [56] Abascal, F., Posada, D., Knight, R.D., Zardoya, R. Parallel evolution of the genetic code in arthropod mitochondrial genomes PLoS Biol, 2006.
    [57] 孙啸等. 生物信息学基础. 北京: 清华大学出版社, 2005.
    [58] 马立人,蒋中华. 生物芯片. 北京:化学工业出版社,2000.
    [59] Wadlow R., Ramaswamy S. DNA microarrays in clinical cancer research. CorrMot Med, 2005, 5 (1): 111–120.
    [60] 郝柏林. 生物信息学.中国科学院院刊, 2000, 4: 260–264.
    [61] Lockhart D. J., Dong H., et al. Expression monitoring by hybridization to high-density oligonucleotide array. Nature Biotechnology, 1996, 14 (13): 1675–1680.
    [62] Schena M, Shalon D, Heller R, et al. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc Natl AcadSci USA, 1996, 93 (20): 10614–10619.
    [63] 孙啸, 王晔等. 基因芯片设计及数据分析软件系统. 东南大学学报自然科学版, 2005, 30 (5): 1–6.
    [64] S. Knudsen. A Biologist’s Guide to Analysis of DNA Microarray Data. Wiley-Interscience, 2002.
    [65] R. M. Simon et al. Design and Analysis of DNA Microarray Investigations.Springer, 2004.
    [66] M. P. S. Brown et al. Knowledge-based analysis of microarray gene expression data using support vecto machines. In Proceedings of the National Academy of Sciences of the United States of America, 2000, volume 97:1, 262–267.
    [67] K. Kadota et al. Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification. Chem-Bio Informatics Journal, 2003, 3(1):30–45.
    [68] Tu Y., Stolovitzky G., Klein U. Quantitative Noise Analysis for Gene Expression Microarray Experiments. Proceedings of the National Academy of Sciences of the United States of America, 2002, 99 (22): 14031–14036.
    [69] Mathe C.,Sagot M.F.,Schiex T.et al.,Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Res,2002,30:4103-4117.
    [70] Watanabe Y, Yokobori S, Inaba T. et al. Introns in protein-coding genes in Archaea, FEBS Lett, 2002, 510:27-30.
    [71] Fleischmann R.D, Adams M.D, White O, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 1995, 269:496-512.
    [72] Kyrpides N. Genomes OnLine Database (GOLD 1.0):a monitor of complete and ongoing genome projects world-wide. Bioinformatics, 1999, 15: 773-774.
    [73] Peterson J.D, Umayam L.A, Dickinson T, et al. The Comprehensive Microbial Resource. Nucleic Acids Res. 2001, 29:123-125.
    [74] Tatusov R.L, Natale D.A, Garkavtsev I.V, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucl.Acids.Res, 2001, 29:22-28.
    [75] Collins F.S, Green E.D, Guttmacher A.E, et al. A vision for the future of genomics research. Nature, 2003, 422:835-847.
    [76] Searls D.B. Bioinformatics tools for whole genomes. Annu Rev Genomics Hum Genet, 2000, 1:251-279.
    [77] Stein L.D, Cartinhour S, Thierry-Mieg D, et al. JADE: an approach for interconnecting bioinformatics databases. Gene, 1998, 209:GC39-GC43.
    [78] Stein L. Creating a bioinformatics nation. Nature, 2002, 417:119-120.
    [79] T. A. Brown 著. 袁建刚等译. 基因组. 北京:科学出版社. 2003.
    [80] Lewin B. Gene VII. New York. Oxford University Press and Cell Press,2000.
    [81] 孙树汉主编. 基因工程原理与方法. 北京:人民军医出版社. 2001.
    [82] I. Dunham, N. Shimizu, B. Roe, S. Chissoe. The DNA sequence of human chromosome 22, Nature, 1999, 402: 489–495.
    [83] E.C. Uberbacher, Y. Xu, R. Mural. Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol, 1996, 266: 259-281.
    [84] Altschul S. F, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997, 25:3389-3402.
    [85] Lipman D. J, Pearson W. R. Rapid and sensitive protein similarity searches, Science, 1985, 227:1435-1441.
    [86] Tiziana Castrignano. GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags. Bioinformatics, 2006, 22(4):497-499.
    [87] Thompson J. D, Higgins D.G, Gibson T. J. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22:4673-4680.
    [88] http://ww.gcg.com
    [89] Frishman D, Mironov A, et al. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res, 1998, 26:2941-2947.
    [90] Badger J. H, Olsen G. J. CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol, 1999, 16:512-524.
    [91] Weichun Huang, David M. U, Leping Li. Accurate anchoring alignment of divergent sequence. Bioinformatics, 2006, 22(1):29-34.
    [92] Jason Flannick, Serafim Batzoglou. Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Research, 2005, 33(14):4563-4577.
    [93] Miao Zhang, Warren Gish. Improved spliced alignment from an information theoretic approach. Bioinformatics, 2006, 22(1):13-20.
    [94] DoHoon Lee, et al. COMPAM: visualization of combining pairwise alignments for multiple genomes. Bioinformatics, 2006, 22(2):242-244.
    [95] Fickett J. W. Recognition of protein coding regions in DNA sequences, Nucleic Acids Res, 1982, 10:5303-5318.
    [96] Bibb M. J, Findlay P. R, Johnson M. W. The relationship between basecomposition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences, Gene, 1984, 30:157-166.
    [97] Staden R, Mclachlan A. D, Codon preference and its use in identifying protein coding regions in long DNA sequences, Nucleic Acids Res, 1982, 10:141-156.
    [98] D. Kulp, D. Haussler, M. Reese, F. Eeckman. Integrating database homology in a probabilistic gene structure model. In: Proceedings of the Paci.c Symposium on Biocomputing, Hawaii, World Scientific, 1997.
    [99] C. Burge, S. Karlin. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997, 268: 78-94.
    [100] A. Krogh. Two methods for improving performance of an HMM and their application for gene finding. in: Proceedings of the Fifth International Conference on Intelligent System for Molecular Biology, AAAI Press, Menlo Park, CA, 1997, 179-186.
    [101] A.L. Delcher, D. Harmon, S. Kasif, O. White, S.L. Salzberg. Improved microbial gene identi.cation with GLIMMER, Nucleic Acids Res. 1999, 27: 4636-4641.
    [102] Y. Xu, E.C. Uberbacher, Reference-based gene model prediction on DNA contigs, J. Comput. Biol. 1997, 325-338.
    [103] E. Snyder, G. Stormo, Identification of protein coding regions in genomic DNA, J. Mol. Biol. 1995, 248:1–18.
    [104] V.V. Solovyev, A.A. Salamov, C.B. Lawrence, Predicting external exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames, Nucleic Acids Res. 1994, 22: 5156–5163.
    [105] M. Zhang, Identi.cation of protein coding regions in the human genome based on quadratic discriminant analysis, Proc. Nat. Acad. Sci. USA, 1997, 94: 565–568.
    [106] R. Zhang, C.T. Zhang, Z curves, an intuitive tool for visualizing and analyzing DNA sequences, J. Biomol. Struct. Dynamics, 1994, 11: 767–782.
    [107] Borodovsky M, McIninch. GenMark: parallel gene recognition for both DNA strands, Comput Chem, 1993, 17:123-133.
    [108] Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions, Nucleic Acids Res, 2001, 29:2607-2618.
    [109] Yada T, Totoki Y, et al., A novel bacterial gene-finding system with improved accuracy in locating start condons, CAN Res, 2001, 8:97-106.
    [110] Lukashin A. V, Borodovsky M. GeneMark.hmm: new solution for gene finding, Nucleic Acids Res, 1998, 26:1107-1115.
    [111] Fleischmann R.D, Adams M.D, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd, Science, 1995, 269:496-512.
    [112] Shibuya T, Rigoutsos I. Dictionary-driven prokaryotic gene finding. Nucleic Acids Res, 2002, 30:2710-2725.
    [113] Guigo R, Knudsen S, Drake N, et al. Prediction of gene structure, J Mol Biol 1992, 226:141-157.
    [114] Majoros W. H, Pertea M, Antonescu C, et al. GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders, Nucleic Acids Res, 2003, 31:3601-3604.
    [115] Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel, Bioinformatics, 2003, 19, Suppl 2:II215-II225.
    [116] Stormo G. D. Gene-finding approaches for eukaryotes. Genome Res, 2000, 10:394-397.
    [117] Jonathan E. Aleen, Mihaela P, Steven L.S, Computational Gene Prediction Using Multiple Sources of Evidence. Genome Research, 2004, 14(1):142-148.
    [118] Guigo R, Reese M.G. EGASP: Collaboration through competition to find human genes. Nat. Methods, 2005, 2:575-577.
    [119] Gross S.S, Brent M.R. Using multiple alignments to improve gene prediction, Boston: In 9th Annual International Conference, RECOMB 2005, 374-388.
    [120] K. Murakami, T. Takagi, Gene recognition by combination of several gene-.nding programs, Bioinformatics, 1998, 14(8): 665–675.
    [121] S. Rogic, B.F.F. Ouellette, K. Mackworth, Improving gene recognition accuracy by combining predictions from two gene-finding programs, Bioinformatics, 2002, 18 (8):1034–1045.
    [122] V. Pavlovic, A. Garg, S. Kasif, A Bayesian framework for combining gene predictions, Bioinformatics, 2002, 18 (1) :19–27.
    [123] Kobyaashi K, Ehrlich. S.D, Albertini. A,et al. Essential Bacillus subtilis genes. Proc Natl Acad Sci USA,2003, 100:4678-4683.
    [124] R.Hasebeck, D.wall, B.Jiang, T.Ketela, J.Zyshind, H.Bussey, J.G.Foulkes, T.Roemer. Comprehensive Essential Gene Identification as a Platform forNovel Antiinfective Drug Discovery. Current Pharmaceutical Design, 2002, 8(13), 1155-1172.
    [125] Koonin. E. V. Comparative genomics minimal gene-sets and the last university common ancestor. Nat.Rev.Gene,2003,l:127-136.
    [126] Judson. N. Mekalanos. J.J. TnAraout. A transposon-based approach to identify and characterize essential bacterial genes. Nat Biotechnol. 2000.18:740-745.
    [127] Adam. M. G. Evan.S.S. Stephen.CJ.Parker. Charles.DeLisi. Simon.Kasif. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics, 2006, 7, 1-16.
    [128] Michael Seringhaus, Alberto Paccanaro, Anthony Borneman, Michael Snyder, Mark Gerstein. Predicting essential genes in fungal genomes. Genome Research, 2006, 16, 1126-1134.
    [129] Sanguinetti G., Milo M., Rattray M. and Lawrence N. D. Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics, 2005, 21 (19): 3748–3754.
    [130] Wren J. D., Yao M. H., Langer M., and Conway T. Simulated Annealing of Microarray Data Reduces Noise and Enables Cross-Experimental Comparisons. DNA and Cell Biology, 2004, 23 (10): 695–700.
    [131] Tran P. H., Peiffer D. A., Shin Y., Meek L. M., Brody J. P. and Cho K. W. Y. Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Research, 2002, 30 (12) e54.
    [132] Mills J. C., Gordon J. I. A New Approach for Filtering Noise from High-Density Oligonucleotide Microarray Datasets. Nucleic Acids Research, 2001, 29 (15): e72.
    [133] Smolka B., Plataniotis K. N., Lukac R, Venesanopoulos A. N. Noise Reduction in DNA Microarray Images Based on Digital Path Approach. 5th IFAC Symposium on Modeling and Control in Biomedical Systems, Melbourne Australia, 2003:17–22.
    [134] Aikaterini M., Evangelos D., Anastasios B. Robust Pre-Processing and Noise Reduction in Microarray Images. Proceedings of the 5th IASTED International Conference on Biomedical Engineering, Innsbruck Austria, 2007, 360–364.
    [135] Alon U. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotides array. Proc. Natl Acad. Sci. USA, 1999, 96: 6745–6750.
    [136] Brodley C. E. and Friedl M. A. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11: 131–166.
    [137] Muhlenbach F., Lallich S. and Zighed D.A. Identifying and handling mislabelled instances. J. Intell. Inform. Syst, 2004, 22, 89–109.
    [138] Sanchez J.S. et al. Analysis of new techniques to obtain quality training sets. Patt. Recogn. Lett, 2003, 24: 1015–1022.
    [139] Venkataraman S. et al. Distinguishing mislabeled data from correctly labeled data in classifier design. In 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), Boca Raton, FL, 2004, 668–672.
    [140] Malossini A, Blanzieri E and T Ng R. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics, 2006, 17, 2114-2121.
    [141] Furey TS, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000, 16:906–914.2.
    [142] Kadota K et al. Detecting outlying samples in microarray data: a critical assessment of the effect of outliers on sample classification. Chem-Bio Inform. J., 2003, 3:30–45.
    [143] Li L, et al. Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Comb. Chem. High Through. Scr. 2001, 4, 727–739.
    [144] Hawkin D. Identification of outlier. London: Chapman and Hall, 1980.
    [145] Knorr E. M. and Ng R. T. Algorithms for mining distance-based outliers in large datasets. In Proceedings 24th International Conference Very Large Data Bases, VLDB, NY, USA, 1998, 392–403.
    [146] Bamett V., Lewis T. Outliers in Statistical Data. New York: John Wiley & Sons, 1994.
    [147] Breunig M. M., Kriegel H. P., Ng R. T., Sander J. Identifying Density-based Local Outliers. Proc of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, Lecture Notes in Computer Science, 19991704: 262–270.
    [148] Aggarwal C. C. and Yu P. S. Outlier detection for high dimensional data. In Proceedings of ACM SIGMOD 2001, Santa Barbara, CA, 2001, 37–46.
    [149] Yang Song, Guo X., Yang Y. C., Papcunik D., Heckman C., Hooke J., Shriver C. D., Liebman M. N. and Hu H. Detecting Outlier Microarray Arrays by Correlation and Percentage of Outliers Spots. Cancer Informatics, 2006, 2: 351–360.
    [150] Golub T. R. et al. Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring. Science, 1999, 286 (5439): 531–537.
    [151] Edwin A. C. et al. Genomic analysis of metastasis reveals an essential role for RhoC. Nature, 2000, 406 (3): 532–535.
    [152] Schramm A. et al. Prediction of clinical outcome and biological characterization of neuroblastoma by expression profiling. Oncogene, 2005, 24 (53): 7902–7912.
    [153] West M. et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 2001, 98 (30): 11462–11467.
    [154] Rhodes D. R. et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia, 2004, 6 (1): 1–6.
    [155] 张智星,孙春在,水谷英二. 神经-模糊和软计算. 西安交通大学出版社,2000年 6 月.
    [156] 周春光,梁艳春. 计算智能. 吉林大学出版社,2001 年 11 月.
    [157] Simon Haykin. Neural Networks: A Comprehensive Foundation (Second Edition).Tsinghua Universty Press. 2001.10.
    [158] 焦李成. 神经网络系统理论. 西安电子科技大学出版社,1996 年 9 月.
    [159] 王正志,薄涛. 进化计算. .国防科技大学出版社,2000 年 11 月.
    [160] J.Kennedy and R.C.Eberhart.: Particle Swarm Optimization. In Proceeding of IEEE International Conference on Neural Networks, Volume IV, Perth, Australia, 1995:1942-1948.
    [161] V.Vapnik. The Nature of Statistical Learning Theory. Springer Verlag. 1995.
    [162] V.Vapnik. Statistical Learning Theory. Wiley.1998.
    [163] V.Vapnik 著,张学工 译. 统计学习理论的本质. 北京,清华大学出版社,2000年 9 月.
    [164] Stockholm, Sweden. Ander Holst: The Use of a Bayesian Neural Network Model for Classification Tasks. [PhD dissertation]. Department of Numerical Analysis and Computing Science, Royal Institute of Technology, September 1997.
    [165] Osuna E, Freund R, Girosi F.: Training support vector machines: An application to face detection. In Proceedings of CVPR’97, Puerto Rico, 1997.
    [166] John C P.: Fast training of support vector machines using sequential minimal optimization. In Scholkopf B. et al (ed.), Advances in Kernel Methods-Support Vector Learning, Cambridge, MA, MIT Press,1999:185-208.
    [167] T. Joachims.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. SchOlkopf and C. Burges and A. Smola (ed.), MIT Press, Cambridge, USA, 1999.
    [168] 刘国平, 姚莉秀, 杨杰, 王猛. 基于加权支持向量机的膜蛋白类型预测中的不平衡问题处理. 上海交通大学学报, 2005, 39(10), 1676-1679.
    [169] K. Murakami, T. Takagi, Gene recognition by combination of several gene-.nding programs, Bioinformatics, 1998, 14 (8):665-675.
    [170] F.L. Poole II., B.A. Gerwe, R.C. Hopkins, G.J. Schut, M.V. Weinberg, F.E. Jenney, M.W.W. Adams, De.ning genes in the genome of the hyperthermophilic archaeon pyrococcus furiosus: implications for all microbial genomes, J. Bacteriol. 2005, 7325–7332.
    [171] N. Pavy, S. Rombauts, P. Dehais, C. Mathe, Evaluation of gene prediction software using a genomic dataset: application to Arabiopsis thaliana sequences, Bioinformatics, 1999, 15 (11):887–899.
    [172] 景志忠, 才学鹏. 模式生物基因组研究进展. 生物医学工程学杂志, 2004, 21(3), 506-511.
    [173] 罗兵, 余光柱. 不平衡类分布下多分类问题的提升算法. 长江大学学报, 2007, 4(2), 50-54.
    [174] S.Y.Gerdes, M.D.Scholle, J.W.Campbell. Experimental Determination and System Level Analysis of Essential Genes in Escherichia coli MG1655. Journal of Bacteriology, 2003, 185(19), 5673-5684.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700