首页--医药、卫生--预防医学、卫生学--保健组织与事业（卫生事业管理）--卫生调查与统计--卫生统计学

基于数据降维技术的全基因组区域化关联分析统计推断方法研究

博士论文论文硕士论文论文

论文详情

Many common human diseases, such as cancer, schizophrenia, essential hypertension, type 2 diabetes, and cardiovascular disease, are known to be complex diseases. Complex diseases, also known as multifactorial diseases, are controlled by multiple genetic and environmental factors. Although they often show a tendency for family aggregation, complex diseases do not have a clear-cut pattern of inheritance. This makes it difficult to determine one’s risk of inheriting or passing on these disorders. Recently with rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS), which genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) on thousands of participants, are emerging as promising approaches for the identification of SNPs that are marginally associated with complex diseases. On the other hand, researches on gene-gene interactions (epistasis) in GWAS have shed light on some disease-associated pathways and networks to some extent, and improved our understanding of the genetic basis of complex diseases despite the computational challenge. However, there are still many analytic and interpretation challenges in GWAS. It is customary to run SNP-based association or interaction tests in the whole genome to identify causal or associated SNPs with strong marginal or jointly epistasis effects on disease or traits.In other words, the unit of association is the SNP. However, such a SNP-based analysis usually leads to computational burden and the well-known multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. In the present study, higher units, such as gene or genome regions, were considered to deal with these and related challenges. Under the framework, we proposed four methods to detect disease-associated genes or gene-gene interactions in the genome, presented in four chapters as follows:Chapter 1 A new method to test the nonlinear feature in nonlinear principal component analysis Given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. As powerful multi-marker analysis methods, PCA-based methods are often applied in the gene- or region- based association study. PCA can capture linkage disequilibrium information and avoid multicolinearity between SNPs within a candidate gene/region. However, it only extracts the linear relationship between SNPs. For nonlinear situation, the PCA-based methods will lose power, and a nonlinear PCA model should be used. Therefore, in present study, we introduced a nonlinear measure determine whether the underlying relationship within a given variable set can be described by a linear PCA model or whether nonlinear PCA model must be utilized for further study. Applications to two simulated data and the data from GAW16 are described to demonstrate its performance. In the two simulated examples, as expected, no violations of the accuracy bounds arise in the linear example while some of the residual variances fall outside the accuracy bounds in the nonlinear example. For the real data, at least one of the residual variances fall outside any of the accuracy bounds, implying that a nonlinear PCA model is required for this data set. These results show that the new nonlinearity measure is effective to detect the relationships between variables in a given data set. With this measure, we can choose a more suitable model to make optimal use of all information available in the given data set.Chapter 2 Gene- or region- based association study via kernel principal component analysis For linear data, PCA-based methods are better choices for the following association study, while nonlinear approaches should be applied for nonlinear data. Among the modified nonlinear PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted. In this study, we proposed to combine KPCA with logistic regression test (LRT) to detect the association between multiple SNPs in a candidate gene or genome region and diseases or traits. The algorithm conducted KPCA first to account for between-SNP relationships in a candidate region, and then applied LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR=1.2, 1.3). Application to the four regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.Chapter 3 Exhaustive sliding-window scan approach for genome-wide association study via PCA-based logistic model The gene- or region-based approaches mentioned above, including our newly proposed KPCA-based method, will definitely improve our understanding of the genetic basis of complex diseases. However, all of these approaches only allow a gene or genome region of several to tens of markers. For a large number of SNPs across the candidate region or the human genome, the performance of these methods will not be satisfying. In recent years, sliding-window methods, in which several neighboring SNPs together included in a "window", have been a popular strategy of automated GWAS data analysis. In these sliding-window approaches, the candidate region or the whole genome is divided into many contiguous overlapping windows, followed by gene- or region-based multi-locus association methods in each window. Sliding-window approach can be implemented with the fixed window size or variable sizes. However, we are not certain whether the window sizes previously set or decided by specific methods are statistically sufficient to gain the optimal detection power. Lin et al proposed that an exhaustive search of all possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient to detect common or rare genetic-risk alleles. With the development as well as the extensive applications of multiprocessor and multithreading computational technique, the "exhaustive" methods have been more feasible in practice. At present study, under the framework of "exhaustive" search, we first conducted simulations to assess statistical powers with different window sizes, and then evaluated the performance via application to real data to test whether the exhaustive strategy can be extended in GWAS data analysis. Results from both simulation and real data analysis indicated that the powers and p-values with different window sizes were quite different. Furthermore, with the development of multiprocessor computational technique, the proposed exhaustive strategy combined with the cluster computer technique is computationally efficient and feasible for analyzing GWAS data. The exhaustive strategy is computationally efficient and feasible, so it should be popularized in GWAS data analysis. Chapter 4 A new gene- or region-based method for detecting gene-gene interactions between two unlinked loci via kernel canonical correlation analysis For GWAS data set, it is often of interest to identify SNPs that jointly have an epistatic (interaction) effect on complex diseases. However, most of the current methods consider SNP as the unit of association, which leads to several well-know limitations such as multiple testing. Under the gene- or region-based framework, our group have previously proposed a gene-based statistic (CCU statistic) for detecting gene-gene co-association based on canonical correlation analysis (CCA). Apparently, in the case that the two genes of interest are unlinked, the co-association between them is the same as their interaction effect. The CCU statistic has been proved to have good performance on detecting gene-gene co-associations or interactions. Despite that, CCA can only detect linear structure of the data set. If the genomic data contains nonlinear structure, CCA will not be able to detect it. In recent years, kernel CCA (KCCA), as a generalized CCA, has been studied intensively in the field of machine learning, face recognition and data classification, and has been claimed success in many applications. We, therefore, proposed to use KCCA rather than CCA to construct a revised version of CCU statistic-kernel CCU (KCCU) statistic-for detecting gene-gene interaction in association study. Simulation results showed that all the powers of KCCU statistic were higher than CCU statistic at given significant levels, sample sizes and relative risks. Application to RA data in GAW16 Problem 1 showed that CCU statistic only detected the interaction between PTPN22 and C5 genes, while KCCU statistics identified all the pairwise interactions among the four genes. In summary, KCCU statistic had better performance than CCU statistic.

ABSTRACT	第6-9页
ABBREVIATION	第10-11页
BACKGROUND	第11-16页
Chapter 1 A new method to test the nonlinear feature in nonlinear principalcomponent analysis	第16-26页
Introduction	第16-17页
Methods	第17-20页
PCA	第17页
K-means cluster	第17页
Nonlinearity measure	第17-20页
Examples	第20-24页
Linear example	第20-21页
Nonlinear example	第21-22页
Real data	第22-24页
Discussion	第24-26页
Chapter 2 Gene- or region- based association study via kernel principal componentanalysis	第26-39页
Introduction	第26-28页
Methods	第28-32页
PCA	第28页
KPCA	第28-30页
KPCA-LRT model	第30-31页
Data simulation	第31-32页
Application	第32页
Results	第32-36页
Data simulation	第32-36页
Application	第36页
Discussion	第36-38页
Conclusion	第38-39页
Chapter 3 Exhaustive sliding-window scan approach for genome-wide associationstudy via PCA-based logistic model	第39-50页
Background	第39-40页
Methods	第40-43页
Exhaustive sliding-window procedure	第40页
PCA-based logistic regression procedure	第40-41页
Data simulation	第41-42页
Application	第42-43页
Results	第43-48页
Data simulation	第43-48页
Application	第48页
Discussion	第48-50页
Chapter 4 A new gene- or region-based method for detecting gene-gene interactionsbetween two unlinked loci via kernel canonical correlation analysis	第50-60页
Introduction	第50-51页
Methods	第51-54页
Test statistic	第51页
Data simulation	第51-53页
Application	第53-54页
Results	第54-57页
Data simulation	第54-57页
Application	第57页
Discussion	第57-60页
References	第60-66页
Acknowledgement	第66-67页
Publications	第67-68页
学位论文评阅及答辩情况表	第68页

论文购买

论文编号ABS725371，这篇论文共68页

会员购买按0.30元/页下载，共需支付20.4。

会员购买

不是会员，注册会员！
会员更优惠充值送钱！

直接购买按0.5元/页下载，共需要支付34。

直接购买

只需这篇论文，无需注册！
直接网上支付，方便快捷！