一种基于改进马田系统的不平衡数据分类方法 Development of a Methodology for Imbalanced Data Classification using Improved Mahalanobis-Taguchi System期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种基于改进马田系统的不平衡数据分类方法

引用本文：	牛俊磊,程龙生.一种基于改进马田系统的不平衡数据分类方法[J].管理工程学报,2012,26(2):85-93.

作者姓名：	牛俊磊程龙生

作者单位：	南京理工大学经济管理学院,江苏南京,210094

基金项目：	教育部人文社会科学研究规划基金资助项目，国家自然科学基金资助项目，江苏省社会科学基金资助项目，南京理工大学自主科研专项计划资助项目

摘要：	在分类问题中,类别不平衡问题将引起分类器训练偏差,导致少数类样本诊断敏感性降低.马田系统是一种多元数据诊断和预测技术,它通过构建一个连续的测量尺度而非直接对训练样本进行学习,该性质有望不受数据分布的影响,克服分类不平衡问题.本文针对马田系统阈值计算缺陷和不平衡数据分类要求,研究一种概率阈值模型计算马田系统阈值；还针对马田系统的若干不足,采用优化模型替代正交表和信噪比筛选关键变量,并使用了一种全方位优化算法求解.通过对8个UCI数据集的实验分析表明,改进的马田系统不仅对不平衡数据有较好的分类效果,且能筛选关键变量,降维效果明显.
关键词：	马田系统分类不平衡数据概率阈值模型全方位优化算法
Development of a Methodology for Imbalanced Data Classification using Improved Mahalanobis-Taguchi System

NIU Jun-lei , CHENG Long-sheng.Development of a Methodology for Imbalanced Data Classification using Improved Mahalanobis-Taguchi System[J].Journal of Industrial Engineering and Engineering Management,2012,26(2):85-93.

Authors:	NIU Jun-lei CHENG Long-sheng

Institution:	(School of Economics and Management,Nanjing University of Science and Technology,Nanjing 210094,China)

Abstract:	The classification of imbalanced data is that one class may be represented by a large number of examples,and the other class,usually the more important class,is represented by only a few in the binary classification problem.Traditional classification techniques always assume that the training examples are evenly distributed among different classes,which will cause bias.The classifier has the tendency of poorly predicting the minority class.Several researchers have studied the data and algorithm levels to cope with the class imbalance problem.However,the methods at the data level can potentially remove certain important information or introduce noise and the methods at algorithm level.Since the method lacks the systematic foundation,it may end up with rules overfitting the training data.The Mahalanobis-Taguchi System(MTS) is a collection of methods proposed for a diagnostic and forecasting technique using multivariate data.MTS combines Mahalanobis distance(MD) and Taguchi’s robust engineering.MD is used to construct a multidimensional measurement scale,whereas Taguchi’s robust engineering is applied to determine important variables and optimize the system.MTS establishes a classification model by constructing a continuous measurement scale using single class samples rather than directly learning from the whole training data set.This property seems useful in solving the class imbalance problems.This study is carried out in order to investigate whether or not MTS has better classification ability than other classification techniques when facing class imbalance problems.This paper develops a probabilistic threshold model(PTM) to determine the classification threshold of MTS.Aiming at the inadequacy of MTS,the authors propose an improved MTS optimization model.The core idea is that a number of optimization objectives are proposed based on the purpose the classification problem,and optimization model is used for screening important variables instead of orthogonal arrays and signal-noise-ratio. In the first section,a PTM of MTS for imbalanced data classification is studied.In MTS,the MD distributions of normal and abnormal examples usually overlap.An effective threshold can enhance the diagnostic and forecasting ability of MTS.However,how to find an appropriate threshold to effectively distinguish the normal and abnormal examples is an important issue.Traditional quadratic loss function is impractical.Instead,real applications always use the exhaustive search method,which may cause overfitting and lower the classification reliability.This study develops the PTM,employing Chebyshev’s theorem to estimate the probability of getting a value that deviates from two different classification errors.The PTM balances the probability of two types errors to build an optimization model and then compute threshold values. In the second section,this study develops a Probabilistic-Optimization-MTS(POMTS) to screen the useful variables from original variable set for imbalanced data classification problem instead of orthogonal arrays and signal-noise-ratio.The g-means and F-value metric has been used for evaluating classifiers on the imbalanced data sets.The goal for optimization model is to max the g and F metric,dimensionality reduction and the SN ratios of normal and abnormal group samples.This optimization model is a multi-objective,nonlinear,0~1 programming.An omni-optimization algorithm,which is desirable to handle any number of conflicting objectives,constraints,and variables under the category of generational genetic algorithms,is employed to solve the model.We use global criterion method,which is designed to search a solution closest the ideal,to integrate the multi objectives. In the next section,eight different UCI data sets are utilized in order to evaluate capability of POMTS on the imbalanced data sets.We compare the performance of POMTS with other popular classification techniques such as MTS,the logistic regression,support vector machines,multilayerpatron,decision tree analysis and also MTS.The experiment uses five folds cross validation and g-means and F-value metric are employed for evaluating classifiers on the imbalanced data sets.The experimental results reveal that POMTS can almost make the highest value of the two metric in all data sets,and verify the good ability of POMTS to deal with imbalanced data.Besides,remarkable selection of a suitable lower dimension subset of variables for dimensionality reduction by applying an omni-optimization can help get the key important variables to improve the imbalanced data classification and lower the cost.The results also indicated that using the PTM to determine a threshold can help MTS to attain better classification. In the summary,the proposed method is very effective not only for imbalanced data classification but also in dimensionality reduction.As further development,research could be carried out regarding similarity metric combining Mahalanobi-Distance for broader applications and enhance the method for multi-class recognition problem.

Keywords:	Mahalanobis-Taguchi System data classification imbalanced data probabilistic thresholding model Omni-optimizer
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏