首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于随机森林模型的分类数据缺失值插补
引用本文:孟杰,;李春林.基于随机森林模型的分类数据缺失值插补[J].统计与信息论坛,2014(9):86-90.
作者姓名:孟杰  ;李春林
作者单位:[1]天津财经大学中国经济统计研究中心,天津300222; [2]河北经贸大学数学与统计学院,河北石家庄050061
基金项目:国家社会科学基金项目《基于数据挖掘技术的调查数据质量控制研究》(13BTJ007)
摘    要:缺失数据是影响调查问卷数据质量的重要因素,对调查问卷中的缺失值进行插补可以显著提高调查数据的质量。调查问卷的数据类型多以分类型数据为主,数据挖掘技术中的分类算法是处理属性分类问题的常用方法,随机森林模型是众多分类算法中精度较高的方法之一。将随机森林模型引入调查问卷缺失数据的插补研究中,提出了基于随机森林模型的分类数据缺失值插补方法,并根据不同的缺失模式探讨了相应的插补步骤。通过与其它方法的实证模拟比较,表明随机森林插补法得到的插补值准确度更优、可信度更高。

关 键 词:缺失值插补  调查问卷  分类数据  随机森林  数据挖掘

Missing Data Imputation for Categorical Data Based on Random Forest Model
Institution:MENG Jie, LI Chun-lin (1. China Center of Economics and Statistics Research, Tianjin University of Finance and Economics, Tianjin 30022, China; 2. School of Mathematics and Statistic, Hebei University of Economic and Business, Shijiazhuang 050061, China )
Abstract:Missing data is a important factor which has bad effect on the data quality of survey questionnaire, missing data imputation can significantly improve the data quality. Categorical data is the main data type of survey data. Classification algorithms of data mining can be often dealt with classification problem, random forest modeling is one of the high predictive accuracy classification models. This paper introduces the random forest model into the missing data imputation research of survey data, and proposes the missing data imputation method for categorical data based on random forest model. Imputation process is also designed according to different pattern of missing data. Empirical simulation shows that the proposed new method can obtain more accuracy and reliable results by comparing with other imputation methods.
Keywords:missing data imputation  survey questionnaire  categorical data  random forest  data mining
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号