<-- Go back

Comparative evaluation of three statistical data mining techniques (CART, MARS, and Random Forest) and their applications in analyzing the NHANES VOC project data

S.W. Wang, Y. Yan, P.G. Georgopoulos

Environmental & Occupational Health Sciences Institute, Piscataway, NJ

This study presents the application of three statistical data mining techniques, Classification And Regression Trees (CART), Multivariate Adaptive Regression Splines (MARS), and Random Forest, to identify “best predictors” of personal exposures to VOCs using data collected in the 1999-2000 NHANES VOC Project. The three statistical data mining techniques are employed to address limitations and challenges in the complex NHANES VOC dataset, such as missing values, collinearity, nonlinearity, interaction effects etc. CART models consist of threshold functions of individual predictors applied in sequence for predicting values of a response variable. Instead of using threshold values of predictors like CART, MARS conducts the binary splits by “smooth” basis functions. The Random Forest approach is an ensemble technique that can improve the accuracy of tree-based models such as CART on classification and regression. The performances of these three statistical data mining techniques are examined in characterizing the relationship between personal exposures to selected VOCs and demographic, socioeconomic, and behavioral variables using the 1999-2000 NHANES VOC Project dataset. This dataset contains the measurements of personal exposures to 10 VOCs for 659 subjects between the ages of 20 and 59 years. Data on individual demographic and socioeconomic status, as well as time and activity patterns for the exposure period are also available for these subjects. The data analysis outcomes provide valuable information for identifying significant exposure factors among demographic, socioeconomic, and activity variables that affect personal exposures to VOCs.

This work is funded in part by the Mickey Leland National Urban Air Toxics Research Center (NUATRC) and the U.S. Environmental Protection Agency (Cooperative Agreement CR- 83162501). Viewpoints expressed here are the responsibility of the authors and do not necessarily reflect the views of NUATRC, USEPA, or their contractors.