首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
It is well known that statistical classifiers trained from imbalanced data lead to low true positive rates and select inconsistent significant variables. In this article, an improved method is proposed to enhance the classification accuracy for the minority class by differentiating misclassification cost for each group. The overall error rate is replaced by an alternative composite criterion. Furthermore, we propose an approach to estimate the tuning parameter, the composite criterion, and the cut-point simultaneously. Simulations show that the proposed method achieves a high true positive rate on prediction and a good performance on variable selection for both continuous and categorical predictors, even with highly imbalanced data. An illustrative example of the analysis of the suboptimal health state data in traditional Chinese medicine is discussed to show the reasonable application of the proposed method.  相似文献   

2.
Imbalanced data brings biased classification and causes the low accuracy of the classification of the minority class. In this article, we propose a methodology to select grouped variables using the area under the ROC with an adjustable prediction cut point. The proposed method enhance the accuracy of classification for the minority class by maximizing the true positive rate. Simulation results show that the proposed method is appropriate for both the categorical and continuous covariates. An illustrative example of the analysis of the SHS data in TCM is discussed to show the reasonable application of the proposed method.  相似文献   

3.
A method for the national assessment of the biological quality of river sites is developed. Multivariate discrimination, based on site environmental characteristics, is used on a biological classification of reference sites to derive a procedure to predict the fauna to be expected in the absence of environmental stress. Various quality indices, based on a comparison of the observed with the expected fauna, are proposed. The sizes of the various sources of error and variation, and their effects on the rates of misclassification to quality bands, are examined.  相似文献   

4.
A method for the national assessment of the biological quality of river sites is developed. Multivariate discrimination, based on site environmental characteristics, is used on a biological classification of reference sites to derive a procedure to predict the fauna to be expected in the absence of environmental stress. Various quality indices, based on a comparison of the observed with the expected fauna, are proposed. The sizes of the various sources of error and variation, and their effects on the rates of misclassification to quality bands, are examined.  相似文献   

5.
ABSTRACT

In this study, Monte Carlo simulation experiments were employed to examine the performance of four statistical two-group classification methods when the data distributions are skewed and misclassification costs are unequal, conditions frequently encountered in business and economic applications. The classification methods studied are linear and quadratic parametric, nearest neighbor and logistic regression methods. It was found that when skewness is moderate, the parametric methods tend to give best results. Depending on the specific data condition, when skewness is high, either the linear parametric, logistic regression, or the nearest-neighbor method gives the best results. When misclassification costs differ widely across groups, the linear parametric method is favored over the other methods for many of the data conditions studied.  相似文献   

6.
In this paper, we consider the classification of high-dimensional vectors based on a small number of training samples from each class. The proposed method follows the Bayesian paradigm, and it is based on a small vector which can be viewed as the regression of the new observation on the space spanned by the training samples. The classification method provides posterior probabilities that the new vector belongs to each of the classes, hence it adapts naturally to any number of classes. Furthermore, we show a direct similarity between the proposed method and the multicategory linear support vector machine introduced in Lee et al. [2004. Multicategory support vector machines: theory and applications to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99 (465), 67–81]. We compare the performance of the technique proposed in this paper with the SVM classifier using real-life military and microarray datasets. The study shows that the misclassification errors of both methods are very similar, and that the posterior probabilities assigned to each class are fairly accurate.  相似文献   

7.
将AdaBoost组合算法应用于信用评分模型中的分类问题,并针对该算法在解决不平衡分类问题上的一些不足,对算法进行了改进。应用此改进的AdaBoost算法,创建了新的信用评分模型,并进行了实证分析。实证结果表明,基于改进的AdaBoost算法的信用评分模型可以有效降低由于模型错判而导致的损失。  相似文献   

8.
We study multiple-class classification problems. Both ordinal and categorical labeled cases are discussed. The common approaches for multiple-class classification are built on binary classifiers, in which one-versus-one and one-versus-rest are typical approaches. When the number of classes is large, then these binary-classifier-based methods may suffer from either computational costs or the highly imbalanced sample sizes in their training stage. In order to alleviate the computational burden and the imbalanced training data issue in multiple-class classification problems, we propose a method that has competitive performance and retains the ease of model interpretation, which is essential for a prognostic/predictive model.  相似文献   

9.
The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods.  相似文献   

10.
In discriminant analysis it is often desirable to find a small subset of the variables that were measured on the individuals of known origin, to be used for classifying individuals of unknown origin. In this paper a Bayesian approach to variable selection is used that includes an additional subset of variables for future classification if the additional measurement costs for this subsst are lower than the resulting reduction in expected misclassification costs.  相似文献   

11.
Estimated associations between an outcome variable and misclassified covariates tend to be biased when the methods of estimation that ignore the classification error are applied. Available methods to account for misclassification often require the use of a validation sample (i.e. a gold standard). In practice, however, such a gold standard may be unavailable or impractical. We propose a Bayesian approach to adjust for misclassification in a binary covariate in the random effect logistic model when a gold standard is not available. This Markov Chain Monte Carlo (MCMC) approach uses two imperfect measures of a dichotomous exposure under the assumptions of conditional independence and non-differential misclassification. A simulated numerical example and a real clinical example are given to illustrate the proposed approach. Our results suggest that the estimated log odds of inpatient care and the corresponding standard deviation are much larger in our proposed method compared with the models ignoring misclassification. Ignoring misclassification produces downwardly biased estimates and underestimate uncertainty.  相似文献   

12.
Multiple discriminant analysis (MDA) is a frequently used statistical technique. Although the dependence of this technique on the underlying assumptions concerning population priors and misclassification costs is well known, the assumption most often made by researchers is that both population priors and misclassification costs are equal. The purpose of this paper is to demonstrate the magnitude of the effect of these assumptions on statistical results. In the savings and loan case used here, the population priors are known:however, the relative misclassification costs are not. To test the sensitivity of the results to the unknown misclassification costs several different misclassification cost assumptions are used.  相似文献   

13.
Using 1998 and 1999 singleton birth data of the State of Florida, we study the stability of classification trees. Tree stability depends on both the learning algorithm and the specific data set. In this study, test samples are used in statistical learning to evaluate both stability and predictive performance. We also use the resampling technique bootstrap, which can be regarded as data self-perturbation, to evaluate the sensitivity of the modeling algorithm with respect to the specific data set. We demonstrate that the selection of the cost function plays an important role in stability. In particular, classifiers with equal misclassification costs and equal priors are less stable compared to those with unequal misclassification costs and equal priors.  相似文献   

14.
ABSTRACT

In this paper, we propose an adaptive stochastic gradient boosting tree for classification studies with imbalanced data. The adjustment of cost-sensitivity and the predictive threshold are integrated together with a composite criterion into the original stochastic gradient boosting tree to deal with the issues of the imbalanced data structure. Numerical study shows that the proposed method can significantly enhance the classification accuracy for the minority class with only a small loss in the true negative rate for the majority class. We discuss the relation of the cost-sensitivity to the threshold manipulation using simulations. An illustrative example of the analysis of suboptimal health-state data in traditional Chinese medicine is discussed.  相似文献   

15.
We study the problem of classifying an individual into one of several populations based on mixed nominal, continuous, and ordinal data. Specifically, we obtain a classification procedure as an extension to the so-called location linear discriminant function, by specifying a general mixed-data model for the joint distribution of the mixed discrete and continuous variables. We outline methods for estimating misclassification error rates. Results of simulations of the performance of proposed classification rules in various settings vis-à-vis a robust mixed-data discrimination method are reported as well. We give an example utilizing data on croup in children.  相似文献   

16.
The problem of classification into two univariate normal populations with a common mean is considered. Several classification rules are proposed based on efficient estimators of the common mean. Detailed numerical comparisons of probabilities of misclassifications using these rules have been carried out. It is shown that the classification rule based on the Graybill-Deal estimator of the common mean performs the best. Classification rules are also proposed for the case when variances are assumed to be ordered. Comparison of these rules with the rule based on the Graybill-Deal estimator has been done with respect to individual probabilities of misclassification.  相似文献   

17.
Consider classifying an n × I observation vector as coming from one of two multivariate normal distributions which differ both in mean vectors and covariance matrices. A class of dis-crimination rules based upon n independent univariate discrim-inate functions is developed yielding exact misclassification probabilities when the population parameters are known. An efficient search of this class to select the procedure with minimum expected misclassification is made by employing an algorithm of the implicit enumeration type used in integer programming. The procedure is applied to the classification of male twins as either monozygotic or dizygotic.  相似文献   

18.
Summary.  An authentic food is one that is what it purports to be. Food processors and consumers need to be assured that, when they pay for a specific product or ingredient, they are receiving exactly what they pay for. Classification methods are an important tool in food authenticity studies where they are used to assign food samples of unknown type to known types. A classification method is developed where the classification rule is estimated by using both the labelled and the unlabelled data, in contrast with many classical methods which use only the labelled data for estimation. This methodology models the data as arising from a Gaussian mixture model with parsimonious covariance structure, as is done in model-based clustering. A missing data formulation of the mixture model is used and the models are fitted by using the EM and classification EM algorithms. The methods are applied to the analysis of spectra of food-stuffs recorded over the visible and near infra-red wavelength range in food authenticity studies. A comparison of the performance of model-based discriminant analysis and the method of classification proposed is given. The classification method proposed is shown to yield very good misclassification rates. The correct classification rate was observed to be as much as 15% higher than the correct classification rate for model-based discriminant analysis.  相似文献   

19.
Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets.  相似文献   

20.
We develop classification rules for data that have an autoregressive circulant covariance structure under the assumption of multivariate normality. We also develop classification rules assuming a general circulant covariance structure. The new classification rules are efficient in reducing the misclassification error rates when the number of observations is not large enough to estimate the unknown variance–covariance matrix. The proposed classification rules are demonstrated by simulation study for their validity and illustrated by a real data analysis for their use. Analyses of both simulated data and real data show the effectiveness of our new classification rules.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号