用户名: 密码: 验证码:
基于癌症基因测序数据的统计方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,随着新一代基因测序技术飞速的发展,好几款使用大规模平行循环芯片测序技术的测序仪在市面上出现并得到了广泛的应用。现在,在小型实验室里也能够独立完成以前只有大型测序中心才能够开展的项目。新一代基因测序技术已广泛应用于生物学研究,并取得了显着的科技成果。相比于传统的基因测序技术,新一代基因测序技术极大地降低了测序成本,并且在效率上得到了显著地提高。但是,新一代基因测序技术在测序长度以及测序准确率上仍有一定的劣势。
     我们将比以往更容易地获得大量的基因组测序数据,但如何基于这些巨量的基因组测序数据进行高效快速的统计分析,并得到准确的统计推断及有效的统计检验,对统计方法研究来说仍是一个极大的挑战。本文基于肺癌样本的外显子测序数据研究基因体细胞突变和癌症的关系,进行了如下工作:基于基因测序数据推断研究对象或个体的基因型;估计基因组各个位点的变异率或杂合丢失率;对基因组各个位点是否为体细胞突变进行相关检验;寻找突变后可能直接影响癌症病发的驱动基因以及探寻驱动基因间的交互作用。
     基于基因测序数据的基因型推断主要有两个难点:测序错误与样本混合。但是现有生物软件的的推断方法一般是基于二项分布的贝叶斯判别分析,往往并没有考虑样本混合这一因素,其可能的后果是低估了变异率以及漏判了真正的变异位点。我们的方法引入了包括每个位点的变异率、测序错误率以及每个肿瘤样本的混合率在内的若干参数,并分别基于染色体位点维度和肿瘤样本维度建立基于二项分布的似然模型,最后用EM方法得到各个参数的极大似然估计,并且用后验概率作为基因型的判断依据。模拟结果显示,我们的方法比传统贝叶斯方法有着更高的准确率,且EM算法相比于其他方法有着更短的运行时间。模拟结果同时也证实了引入样本混合率参数的必要性与合理性。真实数据的结果显示,在考虑样本混合率的条件下,我们的方法不仅找到了已有软件发现的大多数突变位点,还找到了更多可能突变的新位点。
     在基因型推断的同时,我们的模型也可以得到每个位点变异率的估计。我们通过变异率参数的似然比检验来判别某位点是否为体细胞测序变异(Somatic SNVs)。模拟结果展示了影响检验功效的各个因素,并且证实了该检验的合理性以及我们基于极大似然估计的循环迭代算法的有效性。真实数据的结果显示,我们找到了一些可能为体细胞突变的新SNV。类似地,当我们在似然中引入了杂合丢失率(LOH)后,我们也可以得到某位点杂合丢失率的估计并做相应的统计检验。但是引入LOH在真实数据中的有效性仍值得进一步地探讨与研究。
     为了寻找可能直接导致癌症的驱动基因,我们分别按照影响蛋白质功能及碱基对体细胞突变进行了分类,并统计了每个肿瘤样本在各个基因上的各个不同类型的突变个数并得到了相应的计数数据。我们同时考虑到了变异类型、基因长度以及不同样本的背景变异率,建立了基于泊松分布的似然模型,并引入了原假设下χ2混合分布的偏移系数,构造了一个边界条件下的多元似然比检验的方法。模拟结果显示,相比于现有的基于伯努利分布的方法,我们的方法有着更高的功效。真实数据的结果显示,我们能找到更多的驱动基因,并且在生物学上可解释。类似地,我们的驱动基因检验方法也可以灵活地应用于生物通路(pathway)或基因集合的检验。
     基因间的交互作用研究是近年来有挑战性的热点问题。我们用基于驱动基因检验所估计的参数进行蒙特卡洛模拟,进而得到了基因两两之间交互作用的检验。但是,3阶以上的基因交互效应检验是相对困难的,我们可以尝试应用多元降维法。模拟结果显示,我们的方法较简单置换检验方法更加有效地排除了基因长度的的混淆因素,并且真实数据分析的结果表明交互作用检验可以帮助我们有效地确定某些癌症pathway中的关键基因。
     本文所研究的数据包括249名肺癌患者的外显子测序数据(数据来源:TCGA),以及基于该数据所得到的基因变异位点个数的计数数据。
Over the past several years, with the development of next-generation se-quencing (NGS) technology, several platforms based on cyclic-array sequencing technology appeared and have been widely used. Individual investigators now can pursue their projects which were accessible only to major genome centers in the past. The next-generation sequencing technology has been widely used in biological research and made significant scientific achievements. Compared to the traditional sequencing technology, the next-generation sequencing technology dra-matically reduces the sequencing cost, and significantly improves the sequencing efficiency. Meanwhile, it still has some disadvantages such as shorter sequencing length and higher sequencing error rate.
     It is more easily than before for us to obtain high amount of sequencing data. As the NGS experiments continue to generate huge amount of cancer genome se-quencing data, substantial challenges exist for analyzing these NGS data. These challenges include how to perform efficient statistical analysis, and how to obtain accurate statistical inference and effective statistical tests. To identify the land-scape of somatic mutations in lung cancer from whole-exome sequencing data, we carried out the research work as following:inferring the genotype of a specified lo-cus in a given sample based on sequencing data; estimating the mutation rate and loss of heterozygosity rate at a specified locus; testing specified locus for somatic mutation; identifying driver genes whose somatic mutations may play important roles in the maintenance of cancer phenotype and exploring the interaction effect between driver genes.
     There are two main difficulties of genotype inferring based on gene sequencing data:sequencing error and mixture of DNA sample. However, existing software for genotype inference generally use Bayesian discriminant analysis based on bi- nomial distribution and do not take the sample mixture into account, which may underestimate the mutation rate and miss the real variant loci. Our approach introduces several parameters such as the mutation rate of each locus, sequence error rate of each locus, and the mixing ratio of each tumor sample, and devel-ops a likelihood model based on binomial distribution for each locus or for each sample. Then the maximum likelihood estimations of parameters are obtained by the expectation-maximization (EM) method, and the genotypes are inferred by posterior probabilities. The simulation results show that our method has a higher accuracy than traditional Bayesian method, and the EM algorithm has a shorter running time than other estimating methods. It also proves the necessity and rationality of estimating the composition rate parameter. To demonstrate the utility of our new method, we applied it to analyze whole-exome sequencing data from249lung tumor samples. When taking the composition rate into account, our method not only found most of somatic mutations that were also detected by existing methods, but also identified a large number of novel variants that may be lung tumorigenesis.
     Our approach can also get the maximum likelihood estimation of the mu-tation rate at the same time when it infers the genotype for each locus. Then a likelihood ratio test of the parameter of mutation rate is performed to test whether a loci is a somatic SNV (Sequence Nucleotide Variation). The simulation results show several factors influencing the power of our test, and also prove the rationality of our test and the effectivity of our iterative algorithm that based on maximum likelihood estimation. The real data results show that we found some new loci that might be somatic SNVs. Similarly, we introduce the loss of heterozygosity rate as a parameter of each locus in our model, and then the corre-sponding maximum likelihood estimation and likelihood ratio test are preformed. But it still need more research for taking the LOH rate into account in real data analysis.
     In order to find the driver gene that may drive the cancer phenotype, we first classify somatic mutations into different categories based on the functional consequences and the mutated base-pair, then count the number of all different types of gene mutations in each tumor sample. We take the mutation type, the gene length and the individual background mutation rate into account, and build a likelihood model based on Poisson distribution. We introduce a offset coefficient for the mixture χ2distribution, which is the distribution of the statistic of multiple variable likelihood ratio test under boundary condition under the null hypothesis. The simulation results show that, our method has a higher power than some existing methods based on Bernoulli distribution. Real data results show that our method find more driver genes, which could also be verified in biology. Similarly, our method is flexible and can be extend to test driven pathways or gene sets.
     It has been very challenging to study the interaction between genes in recent years. We introduce a Monte Carlo simulation procedure to study interaction between two genes. However, to find the interaction effects of third or higher order is relatively difficult. The multi-dimensionality reduction method may be introduced to handle this problem. The simulation results show that our method performs better than a permutation test method by excluding the confounding factor of gene length. The real data results show that the interaction analysis can help us finding the key gene in some cancer pathways.
     The data studied in this paper include the whole exome sequencing data of249lung cancer patients from The Cancer Genome Altas (TCGA), and the corresponding count data of mutation numbers of each gene and each sample.
引文
[1]新一代DNA测序技术的发展现状.生命奥秘,2009,20:3-23.
    [2j Sanger F, Coulson A R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology,1975,94(3):441-448.
    [3]Sanger F, Nicklen S, Coulson A R. DNA Sequencing with Chain-Terminating Inhibitors. Proceed-ings of The National Academy of Sciences,1977,74(12):5463-5467.
    [4]Sanger F. Determination of nucleotide sequences in DNA. Science,1981,214(4526):1205-1210.
    [5]Rothberg J M, Leamon J H. The development and impact of 454 sequencing. Nature Biotechnology, 2008,26(10):1117-1124.
    [6]Mitra R. D, Church G M. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Research,1999,27(24):e34-e39.
    [7]Mitra R D, Shendure J, Olcjnik J, et al. Fluorescent in situ sequencing on polymerase colonies. Analytical Biochemistry,2003,320(1):55-65.
    [8]Shendure J, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 2005,309(5741):1728-1732.
    [9]Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005,437:376-380.
    [10]Bentley D R. Whole-genome re-sequencing. Current Opinion in Genetics and Development,2006, 16(6):545-552.
    [11]Harris T D, Buzby P R, Babcock II, et al. Single-Molecule DNA Sequencing of a Viral Genome. Science,2008,320(5872):106-109.
    [12]Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology,2008,26(10):1135-1145.
    [13]Mardis E R. The impact of next-generation sequencing technology on genetics. Trends in Genetics, 2008; 24(3):133-141.
    [14]Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics,2009, 25(16):2078-2079.
    [15]Larson D E, Harris C C, Chen K, et al. SomaticSniper:Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics,2011, Advance Access..
    [16]Goya R, Sun M G, Morin R D, et al. SNVMix:predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics,2010,26(6):730-736.
    [17]Srivastava S, Chen L. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Research,2010,38(17):e170.
    [18]Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological),1977,39(1):1-38.
    [19]Geman S, Geman D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. Pattern Analysis and Machine Intelligence, IEEE Transactions on,1984, PAMI-6(6):721-741.
    [20]Andrieu C, Freitas N, Doucet A, et al. An Introduction to MCMC for Machine Learning. Machine Learning,2003,50(1):5-43.
    [21]Bishop C M. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA:Springer-Verlag New York, Inc.,2006.
    [22]Self S G, Liang K Y. Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions. Journal of the American Statistical Association,1987, 82(398):605-610.
    [23]Casella G. An Introduction to Empirical Bayes Data Analysis. Am. Stat.,1985,39(2):83-87.
    [24]Youn A, Simon R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinfor-matics,2011,27(2):175-181.
    [25]Wendl M C, Wallis J W, Lin L, et al. PathScan:a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics,2011,27(12):1595-1602.
    [26]Ding L, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature,2008, 455:1069-1075.
    [27]Benjamini Y, Hochberg Y. Controlling the False Discovery Rate:A Practical and Powerful Ap-proach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 1995; 57(1):289-300.
    [28]Weir B A, et al. Characterizing the cancer genome in lung adenocarcinoma. Nature,2007, 450(7171):893-898.
    [29]Pao W, Girard N. New driver mutations in non-small-cell lung cancer. The Lancet Oncology,2011, 12(2):175-180.
    [30]Gui Y, Guo G, Huang Y, et al. Frequent mutations of chromatin remodeling genes in transitional cell carcinoma of the bladder. Nature Genetics,2011,43(9):875-878.
    [31]Cui Q. A network of cancer genes with co-occurring and anti-co-occurring mutations. PLoS ONE, 2010,5(10):e13180.
    [32]Wang T, Elston R C. Improved Power by Use of a Weighted Score Test for Linkage Disequilibrium Mapping. The American Journal of Human Genetics,2007,80(2):353-360.
    [33]Claeskens G, Nguti R, Janssen P. One-sided tests in shared frailty models. TEST,2008,17(1):69-82.
    [34]Nelson M R, Kardia S L R, Ferrell R E, et al. A Combinatorial Partitioning Method to Identify Multilocus Genotypic Partitions That Predict Quantitative Trait Variation. Genome Res.,2001, 11:458-470.
    [35]Ritchie M D, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet.,2001,69(1):138-147.
    [36]Ritchie M D, Hahn L W, Moore J H. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error.missing data, phenocopy, and genetic heterogeneity. Genet. Epidemiol.,2003,24(2):150-157.
    [37]Hua X, Zhang H, Zhang H, et al. Testing multiple gene interactions by the ordered combinatorial partitioning method in case-control studies. Bioinformatics,2010,26(15):1871-1878.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700