Type

Text

Type

Dissertation

Advisor

Wu, Song | Zhu, Wei , Ahn, Hongshik | Li, Ellen.

Date

2011-12-01

Keywords

Statistics--Biostatistics | Canonical correlation, Clustering analysis, Network analysis, Pearson residuals, SNP

Department

Department of Applied Mathematics and Statistics

Language

en_US

Source

This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.

Identifier

http://hdl.handle.net/11401/71191

Publisher

The Graduate School, Stony Brook University: Stony Brook, NY.

Format

application/pdf

Abstract

The goal of the genome-wide association studies (GWAS) is to investigate the relationships between disease phenotypes and genotypes, which are usually determined by a large number of single nucleotide polymorphisms (SNPs). Currently GWAS are often underpowered to identify SNPs with small to moderate effect sizes. In order to overcome this difficulty, two major approaches, (1) meta-analysis by increasing sample size and (2) SNP pre-selection by dimension reduction, are often adopted. Dimension reduction for SNP data has been arduous due to the categorical nature of SNP that renders most association measures such as the Pearson correlation or the Euclidean distance inappropriate. In this thesis, we propose a novel (partial) canonical correlation association measure for categorical data that can be implemented to major dimension reduction approaches including: cluster analysis (CA) and partial correlation network analysis (PCNA) towards the analysis of GWAS data. Its performance is examined and comparison is made to other existing association measures. Network analysis methods such as PCNA and the Bayesian network serve as not only dimension reduction approaches but also data driven pathway discovery tools. A key objective in modern genetic studies is to discover the regulatory causal relationships between genetic mutations measured by SNPs and the resulting functional changes often gauged by gene expression levels. With the former being categorical and the latter continuous numerical data, we now face the problem of mixed data types. Our novel partial canonical correlation measure developed for categorical data can be readily extended to PCNA with mixed variables. This new approach is illustrated by using a real data example from a study on inflammatory bowel diseases conducted at Stony brook University Medical Center and the Washington University at St. Louis. Comparison is also made to Bayesian network analysis for mixed data and guidelines provided on the pros and cons of each method. | 102 pages

Recommended Citation

Chen, Hongyan, "Clustering and Network Analysis with Single Nucleotide Polymorphism (SNP)" (2011). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 398.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/398

Download

COinS

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Clustering and Network Analysis with Single Nucleotide Polymorphism (SNP)

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Browse

Search

Author Corner

Academic Commons

Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions)

Clustering and Network Analysis with Single Nucleotide Polymorphism (SNP)

Authors

Type

Type

Advisor

Date

Keywords

Department

Language

Source

Identifier

Publisher

Format

Abstract

Recommended Citation

Share

Browse

Search

Author Corner