首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Addressing the problem of missing data in decision tree modeling
Authors:Saiedeh Haji-Maghsoudi  Azam Rastegari  Behshid Garrusi
Institution:1. Dept. of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran;2. Modeling in Health Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran;3. Kerman Neurososcience Research Center, Institute of Neuropharmacology, Kerman University of Medical Sciences, Kerman, Iran
Abstract:Tree-based models (TBMs) can substitute missing data using the surrogate approach (SUR). The aim of this study is to compare the performance of statistical imputation against the performance of SUR in TBMs. Employing empirical data, a TBM was constructed. Thereafter, 10%, 20%, and 40% of variable values appeared as the first split was deleted, and imputed with and without the use of outcome variables in the imputation model (IMP? and IMP+). This was repeated one thousand times. Absolute relative bias above 0.10 was defined as sever (SARB). Subsequently, in a series of simulations, the following parameters were changed: the degree of correlation among variables, the number of variables truly associated with the outcome, and the missing rate. At a 10% missing rate, the proportion of times SARB was observed in either SUR or IMP? was two times higher than in IMP+ (28% versus 13%). When the missing rate was increased to 20%, all these proportions were approximately doubled. Irrespective of the missing rate, IMP+ was about 65% less likely to produce SARB than SUR. Results of IMP? and SUR were comparable up to a 20% missing rate. At a high missing rate, IMP? was 76% more likely to provide SARB estimates. Statistical imputation of missing data and the use of outcome variable in the imputation model is recommended, even in the content of TBM.
Keywords:Tree  missing  surrogate  imputation  prediction
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号