首页 | 本学科首页   官方微博 | 高级检索  
     检索      

仿EM的多变量缺失数据填补算法及其在信用评估中的应用
引用本文:蒋辉,马超群,许旭庆,兰秋军.仿EM的多变量缺失数据填补算法及其在信用评估中的应用[J].中国管理科学,2019,27(3):11-19.
作者姓名:蒋辉  马超群  许旭庆  兰秋军
作者单位:湖南大学工商管理学院, 湖南 长沙 410082
基金项目:国家自然科学基金重点资助项目(71431008);国家自然科学基金应急项目(71850012);教育部人文社会科学研究规划基金资助项目(18YJAZH038)
摘    要:数据缺失会显著降低信用评估模型的准确性和可用性,尤其是多变量同时有数据缺失时。本文针对模型应用阶段的多变量数据缺失问题,提出了一种新的数据填补算法。该算法由两阶段构成:准备阶段和数据填补阶段。在准备阶段,算法基于朴素贝叶斯方法以初始数据集进行训练,对每个可能缺失的变量构建起相应的单变量预测估计模型;而数据填补阶段则借鉴了EM算法的思想,利用前期的单变量预测估计模型,对给定的多变量数据缺失样本进行交替迭代,逐步填补更新。理论证明,该算法具有单调收敛性。以人人贷数据集和UCI提供的德国和澳大利亚两个信用评估基准数据集为例,将其与众数填补法、EM填补法进行性能对比实验,结果表明本文方法的数据还原性能和填补后信用评估准确性都明显更优。这为解决信用评估时的数据多变量缺失问题提供了一种更好的处理方法。

关 键 词:EM算法  信用评估  数据缺失  数据挖掘  
收稿时间:2018-01-17
修稿时间:2018-05-25

An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring
JIANG Hui,MA Chao-qun,XU Xu-qing,LAN Qiu-jun.An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring[J].Chinese Journal of Management Science,2019,27(3):11-19.
Authors:JIANG Hui  MA Chao-qun  XU Xu-qing  LAN Qiu-jun
Institution:Business School of Hunan University, Changsha 410082, China
Abstract:Data missing can significantly reduce the accuracy and usability of the credit scoring model, especially in multivariate missing situations. The classical method to fill missing data is the substitution of mean and mode. And EM algorithm becomes popular recently.
Aiming at the data missing in the phase of credit scoring, a new multivariable data filling method is proposed in this paper, whose idea is similar to EM algorithm. However, it has wider applicability because the distribution functions of the missing variables are not required. The method consists of two stages:models preparation stage and data filling stage. At the models preparation stage, Naive Bayes method is used to train prediction models based on the initial data set for all variables with missing possibility. At the second stage, the variables of a sample with missed data arefilled using prediction models built at the previous stage and by a way of alternately iteration. It is proved that the algorithm is monotonically convergent.
Three data sets are collected for experiments. One is downloaded from Renrendai, a famous P2P financial company, and two of them (German and Australia), are the benchmark data sets provided by UCI. Experimental results show that both the accuracy of data recovery and the accuracy of credit evaluation of the proposed method are obviously better than that of mode filling and EM methods for all three experimental data sets. This significantly indicates that the proposed method has better capability to solve the problem of multivariable data missing in credit evaluation.
Keywords:EM algorithm  credit scoring  data missing  data mining  
点击此处可从《中国管理科学》浏览原始摘要信息
点击此处可从《中国管理科学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号