Authors

Bowen Song

Type

Text

Type

Dissertation

Advisor

Zhu, Wei | Wu, Song | Gao, Yi | Li, Ellen.

Date

2015-05-01

Keywords

Statistics | classification, random forest, ROC analysis, supervised learning

Department

Department of Applied Mathematics and Statistics.

Language

en_US

Source

This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.

Identifier

http://hdl.handle.net/11401/77477

Publisher

The Graduate School, Stony Brook University: Stony Brook, NY.

Format

application/pdf

Abstract

Classification algorithms that optimize the overall accuracy or class distribution purity often suffer from difficulties in classifying class imbalanced data, in which most cases in the testing set will be classified to the majority class. However for imbalanced data classification, one usually cares more about the accuracy for identifying the minority class (e.g. diseased samples), that is, the sensitivity, other than the overall accuracy and therefore low sensitivity is highly undesirable. Receiver operating characteristic (ROC) is a 2 dimensional graph by plotting sensitivity versus specificity, i.e. | accuracy in identifying the majority class (e.g. normal samples). A curve is formed by varying the decision threshold and the area under ROC (AUC) is employed as an accuracy measurement to evaluate the performance of classification. Random Forest, a modern ensemble classifier, is gaining increasing attention in the community because of its good classification capability. Each single learner is a decision tree, built on a bagging data with each node split based on a randomly selected feature subset. As a result, each base learner is relatively " independent" to the others and thus the ensemble's classification accuracy improves overall. In this dissertation, we combine the ROC analysis and the Random Forest to establish the proposed ROC Random Forest algorithm. There are two goals to this algorithm: (1) improving the AUC value, and (2) producing balanced classification result. Verification was carried out using 18 public data sets from the UCI and the results show that the ROC Random Forest not only improves the classification accuracy in terms of higher AUC value but also delivers a more balanced classification result comparing to other Random Forest settings. One draw-back of the ROC Random Forest lies in its difficulty in processing categorical predictors. Given the importance of categorical predictors in many classification problems, we have further combined the ROC Random Forest with optimal node splitting algorithms other than ROC for categorical predictors. The resulting Hybrid ROC Random Forest is further evaluated on 8 UCI data sets. | 139 pages

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.