首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
信用评分模型的建模样本是由坏客户这一稀有事件和好客户这一大众事件组成的不平衡数据,故从模型残差的方差这一角度刻画稀有事件识别的难度,借鉴机器学习领域处理不平衡数据的方法,对建模样本中的稀有事件做特殊采样处理然后再建模,并证明对建模样本做特殊采样处理后必须用经验公式校正样本偏差。实证分析表明这是提高信用评分模型准确性的有效方法。  相似文献   

2.
针对不平衡数据集中的少数类样本在实际应用中分类准确率较低的问题,提出一种利用多数类样本的自然最近邻进行欠采样的数据处理方法。自然最近邻算法根据每个样本的分布特征动态地为样本选择数量不同的自然最近邻样本,通过自然最近邻的个数反映样本分布的疏密程度。文章所提方法先计算多数类样本在整体数据集中的自然最近邻,根据自然最近邻情况移除多数类中的噪声样本和局部密度较小的样本,再计算剩余样本的相似度,保留密集区域中的代表性样本,去掉部分冗余样本,获得平衡数据集。该方法的计算无须预先指定参数,减少了欠采样过程中多数类分类信息的损失。对比实验利用支持向量机对不同欠采样方法平衡后的12个数据集进行分类,结果表明此方法在大多数数据集上具有较优的分类性能,提升了少数类样本的分类准确率。  相似文献   

3.
将AdaBoost组合算法应用于信用评分模型中的分类问题,并针对该算法在解决不平衡分类问题上的一些不足,对算法进行了改进。应用此改进的AdaBoost算法,创建了新的信用评分模型,并进行了实证分析。实证结果表明,基于改进的AdaBoost算法的信用评分模型可以有效降低由于模型错判而导致的损失。  相似文献   

4.
基于模糊支持向量机的客户信用评估研究   总被引:2,自引:0,他引:2  
文章首先比较了支持向量机与传统分类方法在银行客户信用评估中的效果,结果表明支持向量机更适于目前中国商业银行对个人信用的评价;其次引入模糊支持向量机处理银行客户信用样本中的不平衡问题。实证结果显示,带有模糊关系的支持向量机方法能够减小训练数据类别大小差异对决策机器造成的影响,有效地提高了信用评估的精度。  相似文献   

5.
非平衡数据集的改进SMOTE再抽样算法   总被引:1,自引:0,他引:1       下载免费PDF全文
薛薇 《统计研究》2012,29(6):95-98
非平衡数据集的不均衡学习特点通常表现为负类的分类效果不理想。改进SMOTE再抽样算法,将过抽样和欠抽样方式有机结合,有针对性地选择近邻并采用不同策略合成样本。实验表明,分类器在经此算法处理后的非平衡数据集的正负两类上,均可获得较理想的分类效果。  相似文献   

6.
在数据挖掘的分类问题中,经常出现数据集内类别不平衡现象。大部分分类方法对于不平衡数据集内的小类数据,分类精度并不理想。文章分析了多目标线性规划分类方法(简称MCLP)在不平衡数据集上的表现;然后从模型角度,提出了面向不平衡数据集的加权MCLP分类模型。从理论上分析了加权MCLP分类模型的有效性,并从实证角度,与其他方法进行了比较。  相似文献   

7.
在大数据时代,网贷平台每天流动着海量交易数据。为充分利用这些数据控制信用风险,运用数据挖掘算法建立了信用风险评估模型。由于网贷数据多为非平衡数据,所以通过多次尝试使用SMOTE算法进行处理,提高了模型评估性能。研究发现:随机森林模型更适合用于信用风险评估,其次是CART、ANN、C4.5。用户的婚姻、房/车产(贷)等信息重要程度较低,而公司规模、工作时间等信息,历史借款、信用评分等信用档案信息在信用风险评估中尤为重要。  相似文献   

8.
文章分三种情形说明了信用评分模型的开发和应用存在样本偏差,需要使用拒绝推断来校正样本偏差,并提出了核函数推断法来做拒绝推断。在此基础上文章还做了相应的实证分析,获得了比较理想的结果。根据文章的研究,人行征信这类外部数据是拒绝推断最有效的方法,如果此类数据缺乏,则核函数推断法是一种有效的拒绝推断方法。  相似文献   

9.
在回归问题中,惩罚特征即正则化是特征处理的常用方式。但在集成分类问题中,惩罚特征以改善训练效果的研究较少。文章提出一种基于GBDT模型训练的SHAP值对各样本特征惩罚加权,进而提升分类精度的集成模型;其中,对于测试样本的SHAP值估计,通过其与训练集的样本距离权重结合训练集的SHAP矩阵近似得到。实验结果表明:选择GBDT_SHAP值惩罚特征后,模型的预测精度均有显著提升,验证了该算法的有效性。以GBDT_SHAP_GBDT模型为例,其在多组经典数据集上的分类效果良好,且在不平衡数据集上性能突出;若干组仿真实验表明,该方法能使模型快速达到较优且较为稳定的拟合效果,鲁棒性较强。  相似文献   

10.
为解决马田系统多分类算法存在的样本重复训练以及分类准确率下降等问题,文章提出了一种基于改进的类间相似方向数(Number of Inter-class Similarity Direction,NISD)的偏二叉树马田系统多分类算法。该算法利用马氏距离改进类间相似方向数,获得更为科学的样本分类顺序,依此顺序自上而下生成整个偏二叉树,在非叶子节点构造马田系统二分类器,生成最终的分类模型。对于含k个类别的待分类样本,该算法只用训练k-1个二分类器,便可得到马田系统多分类模型,与此同时,层层剥离样本减少了样本的重复训练。UCI数据集实验结果表明,该算法分类效率更高,分类准确率也较高。  相似文献   

11.
The problem of constructing classification methods based on both labeled and unlabeled data sets is considered for analyzing data with complex structures. We introduce a semi-supervised logistic discriminant model with Gaussian basis expansions. Unknown parameters included in the logistic model are estimated by regularization method along with the technique of EM algorithm. For selection of adjusted parameters, we derive a model selection criterion from Bayesian viewpoints. Numerical studies are conducted to investigate the effectiveness of our proposed modeling procedures.  相似文献   

12.
陈凯 《统计教育》2008,(12):3-7
目前集成学习算法已经成为机器学习研究的一大热点,已有人提出许多改进的集成学习算法。本文提出了一种综合了Boosting和Bagging算法特点的选择性集成学习算法--SE-BagBoosting Trees算法。并将其与几种常用的机器学习算法比较研究得出,该算法往往比其它算法具有更小的模型推广误差和更高的预测精度的优点。  相似文献   

13.
We motivate the success of AdaBoost (ADA) in classification problems by appealing to an importance sampling perspective. Based on this insight, we propose the Weighted Bagging (WB) algorithm, a regularization method that naturally extends ADA to solve both classification and regression problems. WB uses a part of the available data to build models, and a separate part to modify the weights of observations. The method is used with categorical and regression tress and is compared with ADA, Boosting, Bagging, Random Forest and Support Vector Machine. We apply these methods to some real data sets and report some results of simulations. These applications and simulations show the effectiveness of WB.  相似文献   

14.
方匡南  赵梦峦 《统计研究》2018,35(12):92-101
随着信息技术的发展,数据来源越来越多,一方面可以更加精准、科学地刻画个人信用状况,但另一方面,由于数据来源多、结构复杂等问题,对传统的征信技术带来了挑战。本文提出了基于多源数据融合的个人信用模型,可以同时对多个数据集进行建模和变量选择,同时考虑了数据集间的相似性和异质性。通过模拟实验发现,本文所提出的整合模型在变量选择和分类效果方面都具有明显的优势。最后,将整合模型应用于城市和农村两个数据集的个人信用评分中。  相似文献   

15.
Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter.  相似文献   

16.
ABSTRACT

Traditional credit risk assessment models do not consider the time factor; they only think of whether a customer will default, but not the when to default. The result cannot provide a manager to make the profit-maximum decision. Actually, even if a customer defaults, the financial institution still can gain profit in some conditions. Nowadays, most research applied the Cox proportional hazards model into their credit scoring models, predicting the time when a customer is most likely to default, to solve the credit risk assessment problem. However, in order to fully utilize the fully dynamic capability of the Cox proportional hazards model, time-varying macroeconomic variables are required which involve more advanced data collection. Since short-term default cases are the ones that bring a great loss for a financial institution, instead of predicting when a loan will default, a loan manager is more interested in identifying those applications which may default within a short period of time when approving loan applications. This paper proposes a decision tree-based short-term default credit risk assessment model to assess the credit risk. The goal is to use the decision tree to filter the short-term default to produce a highly accurate model that could distinguish default lending. This paper integrates bootstrap aggregating (Bagging) with a synthetic minority over-sampling technique (SMOTE) into the credit risk model to improve the decision tree stability and its performance on unbalanced data. Finally, a real case of small and medium enterprise loan data that has been drawn from a local financial institution located in Taiwan is presented to further illustrate the proposed approach. After comparing the result that was obtained from the proposed approach with the logistic regression and Cox proportional hazards models, it was found that the classifying recall rate and precision rate of the proposed model was obviously superior to the logistic regression and Cox proportional hazards models.  相似文献   

17.
The mixture distribution models are more useful than pure distributions in modeling of heterogeneous data sets. The aim of this paper is to propose mixture of Weibull–Poisson (WP) distributions to model heterogeneous data sets for the first time. So, a powerful alternative mixture distribution is created for modeling of the heterogeneous data sets. In the study, many features of the proposed mixture of WP distributions are examined. Also, the expectation maximization (EM) algorithm is used to determine the maximum-likelihood estimates of the parameters, and the simulation study is conducted for evaluating the performance of the proposed EM scheme. Applications for two real heterogeneous data sets are given to show the flexibility and potentiality of the new mixture distribution.  相似文献   

18.
In this study, our aim was to investigate the changes of different data structures and different sample sizes on the structural equation modeling and the influence of these factors on the model fit measures. Examining the created structural equation modeling under different data structures and sample sizes, the evaluation of model fit measures were performed with a simulation study. As a result of the simulation study, optimization and negative variance estimation problems have been encountered depending on the sample size and changing correlations. It was observed that these problems disappeared either by increasing the sample size or the correlations between the variables in factor. For upcoming studies, the choice of RMSEA and IFI model fit measures can be suggested in all sample sizes and the correlation values for data sets are ensured the multivariate normal distribution assumption.  相似文献   

19.
侯成琪  王频 《统计研究》2008,25(11):73-78
 本文利用连接函数(Copula)解决整合风险管理中不同类型风险的联合分布建模问题,提出了基于连接函数的整合风险度量Copula-VaR及其蒙特卡洛模拟算法;以深圳发展银行和上海浦东发展银行为研究对象,将Copula-VaR与N-VaR和Add-VaR这两种业界常用的近似整合风险度量方法进行了实证比较分析,发现:与Copula-VaR相比,N-VaR和Add-VaR存在高估风险的倾向,而其主要原因则是由于N-VaR和Add-VaR对信用收益率与市场收益率之间的相关结构进行了不符合实际的假设。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号