基于数据降维技术的全基因组区域化关联分析统计推断方法研究

博士论文论文 硕士论文论文
论文详情
Many common human diseases, such as cancer, schizophrenia, essential hypertension, type 2 diabetes, and cardiovascular disease, are known to be complex diseases. Complex diseases, also known as multifactorial diseases, are controlled by multiple genetic and environmental factors. Although they often show a tendency for family aggregation, complex diseases do not have a clear-cut pattern of inheritance. This makes it difficult to determine one’s risk of inheriting or passing on these disorders. Recently with rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS), which genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) on thousands of participants, are emerging as promising approaches for the identification of SNPs that are marginally associated with complex diseases. On the other hand, researches on gene-gene interactions (epistasis) in GWAS have shed light on some disease-associated pathways and networks to some extent, and improved our understanding of the genetic basis of complex diseases despite the computational challenge. However, there are still many analytic and interpretation challenges in GWAS. It is customary to run SNP-based association or interaction tests in the whole genome to identify causal or associated SNPs with strong marginal or jointly epistasis effects on disease or traits.In other words, the unit of association is the SNP. However, such a SNP-based analysis usually leads to computational burden and the well-known multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. In the present study, higher units, such as gene or genome regions, were considered to deal with these and related challenges. Under the framework, we proposed four methods to detect disease-associated genes or gene-gene interactions in the genome, presented in four chapters as follows:Chapter 1 A new method to test the nonlinear feature in nonlinear principal component analysis Given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. As powerful multi-marker analysis methods, PCA-based methods are often applied in the gene- or region- based association study. PCA can capture linkage disequilibrium information and avoid multicolinearity between SNPs within a candidate gene/region. However, it only extracts the linear relationship between SNPs. For nonlinear situation, the PCA-based methods will lose power, and a nonlinear PCA model should be used. Therefore, in present study, we introduced a nonlinear measure determine whether the underlying relationship within a given variable set can be described by a linear PCA model or whether nonlinear PCA model must be utilized for further study. Applications to two simulated data and the data from GAW16 are described to demonstrate its performance. In the two simulated examples, as expected, no violations of the accuracy bounds arise in the linear example while some of the residual variances fall outside the accuracy bounds in the nonlinear example. For the real data, at least one of the residual variances fall outside any of the accuracy bounds, implying that a nonlinear PCA model is required for this data set. These results show that the new nonlinearity measure is effective to detect the relationships between variables in a given data set. With this measure, we can choose a more suitable model to make optimal use of all information available in the given data set.Chapter 2 Gene- or region- based association study via kernel principal component analysis For linear data, PCA-based methods are better choices for the following association study, while nonlinear approaches should be applied for nonlinear data. Among the modified nonlinear PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted. In this study, we proposed to combine KPCA with logistic regression test (LRT) to detect the association between multiple SNPs in a candidate gene or genome region and diseases or traits. The algorithm conducted KPCA first to account for between-SNP relationships in a candidate region, and then applied LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR=1.2, 1.3). Application to the four regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.Chapter 3 Exhaustive sliding-window scan approach for genome-wide association study via PCA-based logistic model The gene- or region-based approaches mentioned above, including our newly proposed KPCA-based method, will definitely improve our understanding of the genetic basis of complex diseases. However, all of these approaches only allow a gene or genome region of several to tens of markers. For a large number of SNPs across the candidate region or the human genome, the performance of these methods will not be satisfying. In recent years, sliding-window methods, in which several neighboring SNPs together included in a "window", have been a popular strategy of automated GWAS data analysis. In these sliding-window approaches, the candidate region or the whole genome is divided into many contiguous overlapping windows, followed by gene- or region-based multi-locus association methods in each window. Sliding-window approach can be implemented with the fixed window size or variable sizes. However, we are not certain whether the window sizes previously set or decided by specific methods are statistically sufficient to gain the optimal detection power. Lin et al proposed that an exhaustive search of all possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient to detect common or rare genetic-risk alleles. With the development as well as the extensive applications of multiprocessor and multithreading computational technique, the "exhaustive" methods have been more feasible in practice. At present study, under the framework of "exhaustive" search, we first conducted simulations to assess statistical powers with different window sizes, and then evaluated the performance via application to real data to test whether the exhaustive strategy can be extended in GWAS data analysis. Results from both simulation and real data analysis indicated that the powers and p-values with different window sizes were quite different. Furthermore, with the development of multiprocessor computational technique, the proposed exhaustive strategy combined with the cluster computer technique is computationally efficient and feasible for analyzing GWAS data. The exhaustive strategy is computationally efficient and feasible, so it should be popularized in GWAS data analysis. Chapter 4 A new gene- or region-based method for detecting gene-gene interactions between two unlinked loci via kernel canonical correlation analysis For GWAS data set, it is often of interest to identify SNPs that jointly have an epistatic (interaction) effect on complex diseases. However, most of the current methods consider SNP as the unit of association, which leads to several well-know limitations such as multiple testing. Under the gene- or region-based framework, our group have previously proposed a gene-based statistic (CCU statistic) for detecting gene-gene co-association based on canonical correlation analysis (CCA). Apparently, in the case that the two genes of interest are unlinked, the co-association between them is the same as their interaction effect. The CCU statistic has been proved to have good performance on detecting gene-gene co-associations or interactions. Despite that, CCA can only detect linear structure of the data set. If the genomic data contains nonlinear structure, CCA will not be able to detect it. In recent years, kernel CCA (KCCA), as a generalized CCA, has been studied intensively in the field of machine learning, face recognition and data classification, and has been claimed success in many applications. We, therefore, proposed to use KCCA rather than CCA to construct a revised version of CCU statistic-kernel CCU (KCCU) statistic-for detecting gene-gene interaction in association study. Simulation results showed that all the powers of KCCU statistic were higher than CCU statistic at given significant levels, sample sizes and relative risks. Application to RA data in GAW16 Problem 1 showed that CCU statistic only detected the interaction between PTPN22 and C5 genes, while KCCU statistics identified all the pairwise interactions among the four genes. In summary, KCCU statistic had better performance than CCU statistic.
ABSTRACT第6-9页
ABBREVIATION第10-11页
BACKGROUND第11-16页
Chapter 1 A new method to test the nonlinear feature in nonlinear principalcomponent analysis第16-26页
    Introduction第16-17页
    Methods第17-20页
        PCA第17页
        K-means cluster第17页
        Nonlinearity measure第17-20页
    Examples第20-24页
        Linear example第20-21页
        Nonlinear example第21-22页
        Real data第22-24页
    Discussion第24-26页
Chapter 2 Gene- or region- based association study via kernel principal componentanalysis第26-39页
    Introduction第26-28页
    Methods第28-32页
        PCA第28页
        KPCA第28-30页
        KPCA-LRT model第30-31页
        Data simulation第31-32页
        Application第32页
    Results第32-36页
        Data simulation第32-36页
        Application第36页
     Discussion第36-38页
     Conclusion第38-39页
Chapter 3 Exhaustive sliding-window scan approach for genome-wide associationstudy via PCA-based logistic model第39-50页
    Background第39-40页
    Methods第40-43页
        Exhaustive sliding-window procedure第40页
        PCA-based logistic regression procedure第40-41页
        Data simulation第41-42页
        Application第42-43页
    Results第43-48页
        Data simulation第43-48页
        Application第48页
    Discussion第48-50页
Chapter 4 A new gene- or region-based method for detecting gene-gene interactionsbetween two unlinked loci via kernel canonical correlation analysis第50-60页
    Introduction第50-51页
    Methods第51-54页
        Test statistic第51页
        Data simulation第51-53页
        Application第53-54页
    Results第54-57页
        Data simulation第54-57页
        Application第57页
    Discussion第57-60页
References第60-66页
Acknowledgement第66-67页
Publications第67-68页
学位论文评阅及答辩情况表第68页
论文购买
论文编号ABS725371,这篇论文共68页
会员购买按0.30元/页下载,共需支付20.4
不是会员,注册会员
会员更优惠充值送钱
直接购买按0.5元/页下载,共需要支付34
只需这篇论文,无需注册!
直接网上支付,方便快捷!
相关论文

点击收藏 | 在线购卡 | 站内搜索 | 网站地图
版权所有 艾博士论文 Copyright(C) All Rights Reserved
版权申明:本文摘要目录由会员***投稿,艾博士论文编辑,如作者需要删除论文目录请通过QQ告知我们,承诺24小时内删除。
联系方式: QQ:277865656