期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Regularized receiver operating characteristic-based logistic regression for grouped variable selection with composite criterion

Yang Li Chenqun Yu Yichen Qin Limin Wang Jiaxu Chen Danhui Yi 《Journal of Statistical Computation and Simulation》2015,85(13):2582-2595

It is well known that statistical classifiers trained from imbalanced data lead to low true positive rates and select inconsistent significant variables. In this article, an improved method is proposed to enhance the classification accuracy for the minority class by differentiating misclassification cost for each group. The overall error rate is replaced by an alternative composite criterion. Furthermore, we propose an approach to estimate the tuning parameter, the composite criterion, and the cut-point simultaneously. Simulations show that the proposed method achieves a high true positive rate on prediction and a good performance on variable selection for both continuous and categorical predictors, even with highly imbalanced data. An illustrative example of the analysis of the suboptimal health state data in traditional Chinese medicine is discussed to show the reasonable application of the proposed method. 相似文献

2.

Grouped Variable Selection Using Area under the ROC with Imbalanced Data

Yang Li Yichen Qin Limin Wang Jiaxu Chen Shuangge Ma 《统计学通讯:模拟与计算》2016,45(4):1268-1280

Imbalanced data brings biased classification and causes the low accuracy of the classification of the minority class. In this article, we propose a methodology to select grouped variables using the area under the ROC with an adjustable prediction cut point. The proposed method enhance the accuracy of classification for the minority class by maximizing the true positive rate. Simulation results show that the proposed method is appropriate for both the categorical and continuous covariates. An illustrative example of the analysis of the SHS data in TCM is discussed to show the reasonable application of the proposed method. 相似文献

3.

Derivation of a biological quality index for river sites: Comparison of the observed with the expected fauna

R. T. Clarke M. T. Furse J. F. Wright D. Moss 《Journal of applied statistics》1996,23(2-3):311-332

A method for the national assessment of the biological quality of river sites is developed. Multivariate discrimination, based on site environmental characteristics, is used on a biological classification of reference sites to derive a procedure to predict the fauna to be expected in the absence of environmental stress. Various quality indices, based on a comparison of the observed with the expected fauna, are proposed. The sizes of the various sources of error and variation, and their effects on the rates of misclassification to quality bands, are examined. 相似文献

4.

Derivation of a biological quality index for river sites: comparison of the observed with the expected fauna 总被引：11，自引：0，他引：11

R. T. Clarke M. T. Furse J. F. Wright D. Moss 《Journal of applied statistics》1996,23(2):311-332

A method for the national assessment of the biological quality of river sites is developed. Multivariate discrimination, based on site environmental characteristics, is used on a biological classification of reference sites to derive a procedure to predict the fauna to be expected in the absence of environmental stress. Various quality indices, based on a comparison of the observed with the expected fauna, are proposed. The sizes of the various sources of error and variation, and their effects on the rates of misclassification to quality bands, are examined. 相似文献

5.

THE EFFECTS OF MISCLASSIFICATION COSTS AND SKEWED DISTRIBUTIONS IN TWO-GROUP CLASSIFICATION

《统计学通讯:模拟与计算》2013,42(3):401-423

ABSTRACT

In this study, Monte Carlo simulation experiments were employed to examine the performance of four statistical two-group classification methods when the data distributions are skewed and misclassification costs are unequal, conditions frequently encountered in business and economic applications. The classification methods studied are linear and quadratic parametric, nearest neighbor and logistic regression methods. It was found that when skewness is moderate, the parametric methods tend to give best results. Depending on the specific data condition, when skewness is high, either the linear parametric, logistic regression, or the nearest-neighbor method gives the best results. When misclassification costs differ widely across groups, the linear parametric method is favored over the other methods for many of the data conditions studied. 相似文献

6.

SVM-like decision theoretical classification of high-dimensional vectors

David J. Bradshaw Marianna Pensky 《Journal of statistical planning and inference》2010

In this paper, we consider the classification of high-dimensional vectors based on a small number of training samples from each class. The proposed method follows the Bayesian paradigm, and it is based on a small vector which can be viewed as the regression of the new observation on the space spanned by the training samples. The classification method provides posterior probabilities that the new vector belongs to each of the classes, hence it adapts naturally to any number of classes. Furthermore, we show a direct similarity between the proposed method and the multicategory linear support vector machine introduced in Lee et al. [2004. Multicategory support vector machines: theory and applications to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99 (465), 67–81]. We compare the performance of the technique proposed in this paper with the SVM classifier using real-life military and microarray datasets. The study shows that the misclassification errors of both methods are very similar, and that the posterior probabilities assigned to each class are fairly accurate. 相似文献

7.

基于改进的AdaBoost算法的信用评分模型

杨海江魏秋萍张景肖《统计与信息论坛》2011,26(2):27-31

将AdaBoost组合算法应用于信用评分模型中的分类问题,并针对该算法在解决不平衡分类问题上的一些不足,对算法进行了改进。应用此改进的AdaBoost算法,创建了新的信用评分模型,并进行了实证分析。实证结果表明,基于改进的AdaBoost算法的信用评分模型可以有效降低由于模型错判而导致的损失。相似文献

8.

Multiple-class classification: Ordinal and categorical labels

Yuan-chin Ivan Chang 《统计学通讯:模拟与计算》2017,46(10):7561-7581

We study multiple-class classification problems. Both ordinal and categorical labeled cases are discussed. The common approaches for multiple-class classification are built on binary classifiers, in which one-versus-one and one-versus-rest are typical approaches. When the number of classes is large, then these binary-classifier-based methods may suffer from either computational costs or the highly imbalanced sample sizes in their training stage. In order to alleviate the computational burden and the imbalanced training data issue in multiple-class classification problems, we propose a method that has competitive performance and retains the ease of model interpretation, which is essential for a prognostic/predictive model. 相似文献

9.

Weighted Support Vector Machine Using k-Means Clustering

Sungwan Bang 《统计学通讯:模拟与计算》2013,42(10):2307-2324

The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods. 相似文献

10.

A decision-theoretic approach to variable selection in discriminant analysis

Ulrich Menzefricke 《统计学通讯:理论与方法》2013,42(7):669-686

In discriminant analysis it is often desirable to find a small subset of the variables that were measured on the individuals of known origin, to be used for classifying individuals of unknown origin. In this paper a Bayesian approach to variable selection is used that includes an additional subset of variables for future classification if the additional measurement costs for this subsst are lower than the resulting reduction in expected misclassification costs. 相似文献

11.

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang Trent Geisler Herman Ray Ying Xie 《Journal of applied statistics》2022,49(13):3257

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective. 相似文献

12.

A Bayesian Adjustment for Covariate Misclassification with Correlated Binary Outcome Data

Dianxu Ren Roslyn A. Stone 《Journal of applied statistics》2007,34(9):1019-1034

Estimated associations between an outcome variable and misclassified covariates tend to be biased when the methods of estimation that ignore the classification error are applied. Available methods to account for misclassification often require the use of a validation sample (i.e. a gold standard). In practice, however, such a gold standard may be unavailable or impractical. We propose a Bayesian approach to adjust for misclassification in a binary covariate in the random effect logistic model when a gold standard is not available. This Markov Chain Monte Carlo (MCMC) approach uses two imperfect measures of a dichotomous exposure under the assumptions of conditional independence and non-differential misclassification. A simulated numerical example and a real clinical example are given to illustrate the proposed approach. Our results suggest that the estimated log odds of inpatient care and the corresponding standard deviation are much larger in our proposed method compared with the models ignoring misclassification. Ignoring misclassification produces downwardly biased estimates and underestimate uncertainty. 相似文献

13.

The effect of unequal priors and unequal misclassification costs on MDA

Patricia M. Rudolph Marvin Karson 《Journal of applied statistics》1988,15(1):69-83

Multiple discriminant analysis (MDA) is a frequently used statistical technique. Although the dependence of this technique on the underlying assumptions concerning population priors and misclassification costs is well known, the assumption most often made by researchers is that both population priors and misclassification costs are equal. The purpose of this paper is to demonstrate the magnitude of the effect of these assumptions on statistical results. In the savings and loan case used here, the population priors are known:however, the relative misclassification costs are not. To test the sensitivity of the results to the unknown misclassification costs several different misclassification cost assumptions are used. 相似文献

14.

Assessing the stability of classification trees using Florida birth data

Panagiota Kitsantas Myles Hollander Lei M. Li 《Journal of statistical planning and inference》2007

Using 1998 and 1999 singleton birth data of the State of Florida, we study the stability of classification trees. Tree stability depends on both the learning algorithm and the specific data set. In this study, test samples are used in statistical learning to evaluate both stability and predictive performance. We also use the resampling technique bootstrap, which can be regarded as data self-perturbation, to evaluate the sensitivity of the modeling algorithm with respect to the specific data set. We demonstrate that the selection of the cost function plays an important role in stability. In particular, classifiers with equal misclassification costs and equal priors are less stable compared to those with unequal misclassification costs and equal priors. 相似文献

15.

Adaptive stochastic gradient boosting tree with composite criterion

《Journal of Statistical Computation and Simulation》2012,82(10):1901-1911

ABSTRACT

In this paper, we propose an adaptive stochastic gradient boosting tree for classification studies with imbalanced data. The adjustment of cost-sensitivity and the predictive threshold are integrated together with a composite criterion into the original stochastic gradient boosting tree to deal with the issues of the imbalanced data structure. Numerical study shows that the proposed method can significantly enhance the classification accuracy for the minority class with only a small loss in the true negative rate for the majority class. We discuss the relation of the cost-sensitivity to the threshold manipulation using simulations. An illustrative example of the analysis of suboptimal health-state data in traditional Chinese medicine is discussed. 相似文献

16.

Classification with discrete and continuous variables via general mixed-data models

A. R. de Leon A. Soo T. Williamson 《Journal of applied statistics》2011,38(5):1021-1032

We study the problem of classifying an individual into one of several populations based on mixed nominal, continuous, and ordinal data. Specifically, we obtain a classification procedure as an extension to the so-called location linear discriminant function, by specifying a general mixed-data model for the joint distribution of the mixed discrete and continuous variables. We outline methods for estimating misclassification error rates. Results of simulations of the performance of proposed classification rules in various settings vis-à-vis a robust mixed-data discrimination method are reported as well. We give an example utilizing data on croup in children. 相似文献

17.

Classification into two normal populations with a common mean and unequal variances

Nabakumar Jana Somesh Kumar 《统计学通讯:模拟与计算》2017,46(1):546-558

The problem of classification into two univariate normal populations with a common mean is considered. Several classification rules are proposed based on efficient estimators of the common mean. Detailed numerical comparisons of probabilities of misclassifications using these rules have been carried out. It is shown that the classification rule based on the Graybill-Deal estimator of the common mean performs the best. Classification rules are also proposed for the case when variances are assumed to be ordered. Comparison of these rules with the rule based on the Graybill-Deal estimator has been done with respect to individual probabilities of misclassification. 相似文献

18.

Combinatoric classification of multivariate normal variates

C.L. Dunn W.B. Smith 《统计学通讯:理论与方法》2013,42(13):1317-1340

Consider classifying an n × I observation vector as coming from one of two multivariate normal distributions which differ both in mean vectors and covariance matrices. A class of dis-crimination rules based upon n independent univariate discrim-inate functions is developed yielding exact misclassification probabilities when the population parameters are known. An efficient search of this class to select the procedure with minimum expected misclassification is made by employing an algorithm of the implicit enumeration type used in integer programming. The procedure is applied to the classification of male twins as either monozygotic or dizygotic. 相似文献

19.

Using unlabelled data to update classification rules with applications in food authenticity studies

Nema Dean Thomas Brendan Murphy Gerard Downey 《Journal of the Royal Statistical Society. Series C, Applied statistics》2006,55(1):1-14

Summary. An authentic food is one that is what it purports to be. Food processors and consumers need to be assured that, when they pay for a specific product or ingredient, they are receiving exactly what they pay for. Classification methods are an important tool in food authenticity studies where they are used to assign food samples of unknown type to known types. A classification method is developed where the classification rule is estimated by using both the labelled and the unlabelled data, in contrast with many classical methods which use only the labelled data for estimation. This methodology models the data as arising from a Gaussian mixture model with parsimonious covariance structure, as is done in model-based clustering. A missing data formulation of the mixture model is used and the models are fitted by using the EM and classification EM algorithms. The methods are applied to the analysis of spectra of food-stuffs recorded over the visible and near infra-red wavelength range in food authenticity studies. A comparison of the performance of model-based discriminant analysis and the method of classification proposed is given. The classification method proposed is shown to yield very good misclassification rates. The correct classification rate was observed to be as much as 15% higher than the correct classification rate for model-based discriminant analysis. 相似文献

20.

A classifier under the strongly spiked eigenvalue model in high-dimension,low-sample-size context

Aki Ishii 《统计学通讯:理论与方法》2020,49(7):1561-1577

Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets. 相似文献