首页 | 本学科首页   官方微博 | 高级检索  
     检索      


A new variable importance measure for random forests with missing data
Authors:Alexander Hapfelmeier  Torsten Hothorn  Kurt Ulm  Carolin Strobl
Institution:1. Institut für Medizinische Statistik und Epidemiologie, Technische Universit?t München, Ismaninger Str. 22, 81675, München, Germany
2. Institut für Statistik, Ludwig-Maximilians-Universit?t, Ludwigstra?e 33, 80539, München, Germany
3. Department of Psychology, University of Zurich, Binzmühlestrasse 14, 8050, Zurich, Switzerland
Abstract:Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号