Type
Text
Type
Dissertation
Advisor
Zhu, Wei | Wu, Song | Gao, Yi | Li, Ellen.
Date
2015-05-01
Keywords
Statistics | classification, random forest, ROC analysis, supervised learning
Department
Department of Applied Mathematics and Statistics.
Language
en_US
Source
This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.
Identifier
http://hdl.handle.net/11401/77477
Publisher
The Graduate School, Stony Brook University: Stony Brook, NY.
Format
application/pdf
Abstract
Classification algorithms that optimize the overall accuracy or class distribution purity often suffer from difficulties in classifying class imbalanced data, in which most cases in the testing set will be classified to the majority class. However for imbalanced data classification, one usually cares more about the accuracy for identifying the minority class (e.g. diseased samples), that is, the sensitivity, other than the overall accuracy and therefore low sensitivity is highly undesirable. Receiver operating characteristic (ROC) is a 2 dimensional graph by plotting sensitivity versus specificity, i.e. | accuracy in identifying the majority class (e.g. normal samples). A curve is formed by varying the decision threshold and the area under ROC (AUC) is employed as an accuracy measurement to evaluate the performance of classification. Random Forest, a modern ensemble classifier, is gaining increasing attention in the community because of its good classification capability. Each single learner is a decision tree, built on a bagging data with each node split based on a randomly selected feature subset. As a result, each base learner is relatively " independent" to the others and thus the ensemble's classification accuracy improves overall. In this dissertation, we combine the ROC analysis and the Random Forest to establish the proposed ROC Random Forest algorithm. There are two goals to this algorithm: (1) improving the AUC value, and (2) producing balanced classification result. Verification was carried out using 18 public data sets from the UCI and the results show that the ROC Random Forest not only improves the classification accuracy in terms of higher AUC value but also delivers a more balanced classification result comparing to other Random Forest settings. One draw-back of the ROC Random Forest lies in its difficulty in processing categorical predictors. Given the importance of categorical predictors in many classification problems, we have further combined the ROC Random Forest with optimal node splitting algorithms other than ROC for categorical predictors. The resulting Hybrid ROC Random Forest is further evaluated on 8 UCI data sets. | 139 pages
Recommended Citation
Song, Bowen, "ROC Random Forest and Its Application" (2015). Stony Brook Theses and Dissertations Collection, 2006-2020 (closed to submissions). 3289.
https://commons.library.stonybrook.edu/stony-brook-theses-and-dissertations-collection/3289