Type
Text
Type
Dissertation
Advisor
Finch, Stephen | Ahn, Hongshik | Xing, Haipeng | Hong, Sangjin.
Date
2016-12-01
Keywords
Statistics
Department
Department of Applied Mathematics and Statistics
Language
en_US
Source
This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.
Identifier
http://hdl.handle.net/11401/77177
Publisher
The Graduate School, Stony Brook University: Stony Brook, NY.
Format
application/pdf
Abstract
The purpose of this study is to develop a statistical model to predict the risk for developing disease. In order to enrich our general understanding of schizophrenia disorder, several clustering techniques are used as a preliminary study. Schizophrenia is a heterogeneous decease with great variability in symptoms, cognition, biology and course of illness. Some of this variability may be explained by latent subgroups that differ in etiology and key features. Individuals with paternal age related schizophrenia (PARS) may represent such a subgroup as evidence suggests a distinct symptom profile. Using K-means and hierarchical clustering on a large sample of schizophrenia patients, this study examines demographic, clinical and the distinctiveness of latent PARS subgroups. Despite the wide use of K-means clustering, there remain several issues about how best to implement it. One of the main problems in K-means clustering is how to determine the number of clusters in a data set. We propose to develop a method for choosing the optimal number of clusters. The performance of the proposed method is compared to other existing methods by simulation experiments. In this study, the performance of several classification models with the same schizophrenia data set is evaluated. Four predictive classification models including Random Forest (RF), Support Vector Machines (SVM), Linear Discriminant Analysis and Adaboost are trained and their performances are compared. These models are then used to predict a patient who might have more risk of developing schizophrenia. For RF and SVM, adjusted decision threshold is used for a fair comparison. One of the most critical factors in medical diagnosis is individual’s condition to a given disease which varies from one to another. It is difficult to make appropriate medical decision about treatment that works on every patient. This study focuses on to develop a statistical method to classify the data into these two groups: ones that have a risk at potential disease and others who don’t. The successful completion of this study will lead to dramatic improvement in the medical diagnosis which will help the development of decision support system and personalized treatments that focus on specific patient needs. | The purpose of this study is to develop a statistical model to predict the risk for developing disease. In order to enrich our general understanding of schizophrenia disorder, several clustering techniques are used as a preliminary study. Schizophrenia is a heterogeneous decease with great variability in symptoms, cognition, biology and course of illness. Some of this variability may be explained by latent subgroups that differ in etiology and key features. Individuals with paternal age related schizophrenia (PARS) may represent such a subgroup as evidence suggests a distinct symptom profile. Using K-means and hierarchical clustering on a large sample of schizophrenia patients, this study examines demographic, clinical and the distinctiveness of latent PARS subgroups. Despite the wide use of K-means clustering, there remain several issues about how best to implement it. One of the main problems in K-means clustering is how to determine the number of clusters in a data set. We propose to develop a method for choosing the optimal number of clusters. The performance of the proposed method is compared to other existing methods by simulation experiments. In this study, the performance of several classification models with the same schizophrenia data set is evaluated. Four predictive classification models including Random Forest (RF), Support Vector Machines (SVM), Linear Discriminant Analysis and Adaboost are trained and their performances are compared. These models are then used to predict a patient who might have more risk of developing schizophrenia. For RF and SVM, adjusted decision threshold is used for a fair comparison. One of the most critical factors in medical diagnosis is individual’s condition to a given disease which varies from one to another. It is difficult to make appropriate medical decision about treatment that works on every patient. This study focuses on to develop a statistical method to classify the data into these two groups: ones that have a risk at potential disease and others who don’t. The successful completion of this study will lead to dramatic improvement in the medical diagnosis which will help the development of decision support system and personalized treatments that focus on specific patient needs. | 135 pages
Recommended Citation
Lee, Hyejoo, "Clustering and Classification Methods for Prediction of the risk for Developing Disease" (2016). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 3011.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/3011