期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A methodology for quantifying the effect of missing data on decision quality in classification problems

Michael Feldman Adir Even Yisrael Parmet 《统计学通讯:理论与方法》2018,47(11):2643-2663

Decision making is often supported by decision models. This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level, and then extended further toward understanding the effect of incomplete datasets on Linear Discriminant Analysis (LDA) classifiers. The interplay between the three quality levels is evaluated analytically – initially for a one-dimensional case, and then for multiple dimensions. The impact is then further analyzed through several simulative experiments with artificial and real-world datasets. The experiment results support the analytical development and reveal nearly-exponential decline in the decision error as the completeness level increases. To conclude, we discuss the framework and the empirical findings, elaborate on the implications of our model on the data quality management, and the use of data for decision-models estimation. 相似文献

2.

Predictive performance of missing data methods for logistic regression,classification trees and neural networks

《Journal of Statistical Computation and Simulation》2012,82(2):115-140

Although the effect of missing data on regression estimates has received considerable attention, their effect on predictive performance has been neglected. We studied the performance of three missing data strategies—omission of records with missing values, replacement with a mean and imputation based on regression—on the predictive performance of logistic regression (LR), classification tree (CT) and neural network (NN) models in the presence of data missing completely at random (MCAR). Models were constructed using datasets of size 500 simulated from a joint distribution of binary and continuous predictors including nonlinearities, collinearity and interactions between variables. Though omission produced models that fit better on the data from which the models were developed, imputation was superior on average to omission for all models when evaluating the receiver operating characteristic (ROC) curve area, mean squared error (MSE), pooled variance across outcome categories and calibration X ² on an independently generated test set. However, in about one-third of simulations, omission performed better. Performance was also more variable with omission including quite a few instances of extremely poor performance. Replacement and imputation generally produced similar results except with neural networks for which replacement, the strategy typically used in neural network algorithms, was inferior to imputation. Missing data affected simpler models much less than they did more complex models such as generalized additive models that focus on local structure For moderate sized datasets, logistic regressions that use simple nonlinear structures such as quadratic terms and piecewise linear splines appear to be at least as robust to randomly missing values as neural networks and classification trees. 相似文献

3.

Multi-Step Classification Trees

Youngjae Chang 《统计学通讯:模拟与计算》2013,42(9):1728-1744

Many algorithms originated from decision trees have been developed for classification problems. Although they are regarded as good algorithms, most of them suffer from loss of prediction accuracy, namely high misclassification rates when there are many irrelevant variables. We propose multi-step classification trees with adaptive variable selection (the multi-step GUIDE classification tree (MG) and the multi-step CRUISE classification tree (MC) to handle this problem. The variable selection step and the fitting step comprise the multi-step method.

We compare the performance of classification trees in the presence of irrelevant variables. MG and MC perform better than Random Forest and C4.5 with an extremely noisy dataset. Furthermore, the prediction accuracy of our proposed algorithm is relatively stable even when the number of irrelevant variables increases, while that of other algorithms worsens. 相似文献

4.

Time Series Classification Based on Spectral Analysis

Shuen-Lin Jeng Ya-Ti Huang 《统计学通讯:模拟与计算》2013,42(1):132-142

For time series data with obvious periodicity (e.g., electric motor systems and cardiac monitor) or vague periodicity (e.g., earthquake and explosion, speech, and stock data), frequency-based techniques using the spectral analysis can usually capture the features of the series. By this approach, we are able not only to reduce the data dimensions into frequency domain but also utilize these frequencies by general classification methods such as linear discriminant analysis (LDA) and k-nearest-neighbor (KNN) to classify the time series. This is a combination of two classical approaches. However, there is a difficulty in using LDA and KNN in frequency domain due to excessive dimensions of data. We overcome the obstacle by using Singular Value Decomposition to select essential frequencies. Two data sets are used to illustrate our approach. The classification error rates of our simple approach are comparable to those of several more complicated methods. 相似文献

5.

A reflected feature space for CART

D. C. Wickramarachchi B. L. Robertson M. Reale C. J. Price J. A. Brown 《Australian & New Zealand Journal of Statistics》2019,61(3):380-391

We present an algorithm for learning oblique decision trees, called HHCART(G). Our decision tree combines learning concepts from two classification trees, HHCART and Geometric Decision Tree (GDT). HHCART(G) is a simplified HHCART algorithm that uses linear structure in the training examples, captured by a modified GDT angle bisector, to define splitting directions. At each node, we reflect the training examples with respect to the modified angle bisector to align this linear structure with the coordinate axes. Searching axis parallel splits in this reflected feature space provides an efficient and effective way of finding oblique splits in the original feature space. Our method is much simpler than HHCART because it only considers one reflected feature space for node splitting. HHCART considers multiple reflected feature spaces for node splitting making it more computationally intensive to build. Experimental results show that HHCART(G) is an effective classifier, producing compact trees with similar or better results than several other decision trees, including GDT and HHCART trees. 相似文献

6.

Comparisons of Some Graphical Methods for Exploratory Multivariate Data Analysis

Lambertina W.J. Freni-Titulaer William C. Louv 《The American statistician》2013,67(3):184-188

Kleiner and Hartigan (1981) introduced trees and castles (graphical techniques for representing multidimensional data) in which the variables are assigned to components of the display on the basis of hierarchical clustering. An experiment was performed to assess the efficacy of trees and castles in discrimination. The graphs were compared to two types of histograms: one with variables assigned randomly, the other with variables assigned according to hierarchical clustering. Trees tended to give the best results. 相似文献

7.

Boosting with Bayesian stumps

David G. T. Denison 《Statistics and Computing》2001,11(2):171-178

Boosting is a new, powerful method for classification. It is an iterative procedure which successively classifies a weighted version of the sample, and then reweights this sample dependent on how successful the classification was. In this paper we review some of the commonly used methods for performing boosting and show how they can be fit into a Bayesian setup at each iteration of the algorithm. We demonstrate how this formulation gives rise to a new splitting criterion when using a domain-partitioning classification method such as a decision tree. Further we can improve the predictive performance of simple decision trees, known as stumps, by using a posterior weighted average of them to classify at each step of the algorithm, rather than just a single stump. The main advantage of this approach is to reduce the number of boosting iterations required to produce a good classifier with only a minimal increase in the computational complexity of the algorithm. 相似文献

8.

Why do we observe misclassification errors smaller than the Bayes error?

《Journal of Statistical Computation and Simulation》2012,82(5):717-722

In simulation studies for discriminant analysis, misclassification errors are often computed using the Monte Carlo method, by testing a classifier on large samples generated from known populations. Although large samples are expected to behave closely to the underlying distributions, they may not do so in a small interval or region, and thus may lead to unexpected results. We demonstrate with an example that the LDA misclassification error computed via the Monte Carlo method may often be smaller than the Bayes error. We give a rigorous explanation and recommend a method to properly compute misclassification errors. 相似文献

9.

两种数据挖掘技术在预测顾客满意度中的对比研究

郑明超李振东《统计与信息论坛》2006,21(4):103-107

分类发现是数据挖掘的重要内容,贝叶斯分类和决策树在数据挖掘中应用相当广泛,它们是生成分类器的两种有效方法。文章分别用两种方法对顾客满意度进行分类及预测,并将两种方法进行比较分析,认为用决策树分类法来预测顾客满意度具有简洁、高效等特点。相似文献

10.

Phylogenetic tree selection by the adjusted k-means approach

Hsiuying Wang Shan-Lin Hung 《Journal of applied statistics》2012,39(3):643-655

The reconstruction of phylogenetic trees is one of the most important and interesting problems of the evolutionary study. There are many methods proposed in the literature for constructing phylogenetic trees. Each approach is based on different criteria and evolutionary models. However, the topologies of trees constructed from different methods may be quite different. The topological errors may be due to unsuitable criterions or evolutionary models. Since there are many tree construction approaches, we are interested in selecting a better tree to fit the true model. In this study, we propose an adjusted k-means approach and a misclassification error score criterion to solve the problem. The simulation study shows this method can select better trees among the potential candidates, which can provide a useful way in phylogenetic tree selection. 相似文献

11.

Robustness of the linear discriminant function to nonnormality: Edgeworth series distribution

Kocherlakota Subrahmaniam Enock F. Chingánda 《Journal of statistical planning and inference》1978,2(1):79-91

The effects of applying the normal classificatory rule to a nonnormal population are studied here. These are assessed through the distribution of the misclassification errors in the case of the Edgeworth type distribution. Both theoretical and empirical results are presented. An examination of the latter shows that the effects of this type of nonnormality are marginal. The probability of misclassification of an observation from ∏₁, using the appropriate LR rule, is always larger than one using the normal approximation (μ₁<μ₂). Converse condition holds for the misclassification of an observation from ∏₂. Overall error rates are not affected by the skewness factor to any great extent. 相似文献

12.

A Clipped Gaussian Geo-Classification model for poverty mapping

Richard Puurbalanta 《Journal of applied statistics》2021,48(10):1882

The importance of discrete spatial models cannot be overemphasized, especially when measuring living standards. The battery of measurements is generally categorical with nearer geo-referenced observations featuring stronger dependencies. This study presents a Clipped Gaussian Geo-Classification (CGG-C) model for spatially-dependent ordered data, and compares its performance with existing methods to classify household poverty using Ghana living standards survey (GLSS 6) data. Bayesian inference was performed on data sampled by MCMC. Model evaluation was based on measures of classification and prediction accuracy. Spatial associations, given some household features, were quantified, and a poverty classification map for Ghana was developed. Overall, the results of estimation showed that many of the statistically significant covariates were generally strongly related with the ordered response variable. Households at specific locations tended to uniformly experience specific levels of poverty, thus, providing an empirical spatial character of poverty in Ghana. A comparative analysis of validation results showed that the CGG-C model (with 14.2% misclassification rate) outperformed the Cumulative Probit (CP) model with misclassification rate of 17.4%. This approach to poverty analysis is relevant for policy design and the implementation of cost-effective programmes to reduce category and site-specific poverty incidence, and monitor changes in both category and geographical trends thereof.KEYWORDS: Ordered responses, spatial correlation, Bayesian estimation via MCMC, Gaussian random fields, poverty classification 相似文献

13.

A note on using the F-measure for evaluating record linkage algorithms 总被引：1，自引：0，他引：1

David Hand Peter Christen 《Statistics and Computing》2018,28(3):539-547

Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw. 相似文献

14.

Estimation of misclassification probabilities by bootstrap methods

Samprit Chatterjee Sangit Chatterjee 《统计学通讯:模拟与计算》2013,42(6):645-656

Several methods have been proposed to estimate the misclassification probabilities when a linear discriminant function is used to classify an observation into one of several populations. We describe the application of bootstrap sampling to the above problem. The proposed method has the advantage of not only furnishing the estimates of misclassification probabilities but also provides an estimate of the standard error of estimate. The method is illustrated by a small simulation experiment. It is then applied to three published, well accessible data sets, which are typical of large, medium and small data sets encountered in practice. 相似文献

15.

Decision tree approaches for zero-inflated count data

Seong-Keon Lee Seohoon Jin 《Journal of applied statistics》2006,33(8):853-865

There have been many methodologies developed about zero-inflated data in the field of statistics. However, there is little literature in the data mining fields, even though zero-inflated data could be easily found in real application fields. In fact, there is no decision tree method that is suitable for zero-inflated responses. To analyze continuous target variable with decision trees as one of data mining techniques, we use F-statistics (CHAID) or variance reduction (CART) criteria to find the best split. But these methods are only appropriate to a continuous target variable. If the target variable is rare events or zero-inflated count data, the above criteria could not give a good result because of its attributes. In this paper, we will propose a decision tree for zero-inflated count data, using a maximum of zero-inflated Poisson likelihood as the split criterion. In addition, using well-known data sets we will compare the performance of the split criteria. In the case when the analyst is interested in lower value groups (e.g. no defect areas, customers who do not claim), the suggested ZIP tree would be more efficient. 相似文献

16.

A comparison of machine learning techniques for taxonomic classification of teeth from the Family Bovidae

Gregory J. Matthews Juliet K. Brophy Maxwell Luetkemeier Hongie Gu George K. Thiruvathukal 《Journal of applied statistics》2018,45(15):2773-2787

相似文献

17.

Errors of misclassification associated with the inverse gaussion distribution

R.K Amoh K Kocherlakota 《统计学通讯:理论与方法》2013,42(2):589-612

Errors of misclassification and their probabilities are studied for classification problems associated with univariate inverse Gaussian distributions. The effects of applying the linear discriminant function (LDF), based on normality, to inverse Gaussian populations are assessed by comparing probabilities (optimum and conditional) based on the LDF with those based on the likelihood ratio rule (LR) for the inverse Gaussian, Both theoretical and empirical results are presented 相似文献

18.

Optimal cut-off for rare events and unbalanced misclassification costs

Raffaella Calabrese 《Journal of applied statistics》2014,41(8):1678-1693

This paper develops a method for handling two-class classification problems with highly unbalanced class sizes and misclassification costs. When the class sizes are highly unbalanced and the minority class represents a rare event, conventional classification methods tend to strongly favour the majority class, resulting in very low detection of the minority class. A method is proposed to determine the optimal cut-off for asymmetric misclassification costs and for unbalanced class sizes. Monte Carlo simulations show that this proposal performs better than the method based on the notion of classification accuracy. Finally, the proposed method is applied to empirical data on Italian small and medium enterprises to classify them into default and non-default groups. 相似文献

19.

Misclassification of current status data

Karen McKeown Nicholas P. Jewell 《Lifetime data analysis》2010,16(2):215-230

We describe a simple method for nonparametric estimation of a distribution function based on current status data where observations of current status information are subject to misclassification. Nonparametric maximum likelihood techniques lead to use of a straightforward set of adjustments to the familiar pool-adjacent-violators estimator used when misclassification is assumed absent. The methods consider alternative misclassification models and are extended to regression models for the underlying survival time. The ideas are motivated by and applied to an example on human papilloma virus (HPV) infection status of a sample of women examined in San Francisco. 相似文献

20.

Model-based classification using latent Gaussian mixture models 总被引：1，自引：0，他引：1

Paul D. McNicholas 《Journal of statistical planning and inference》2010

A novel model-based classification technique is introduced based on parsimonious Gaussian mixture models (PGMMs). PGMMs, which were introduced recently as a model-based clustering technique, arise from a generalization of the mixtures of factor analyzers model and are based on a latent Gaussian mixture model. In this paper, this mixture modelling structure is used for model-based classification and the particular area of application is food authenticity. Model-based classification is performed by jointly modelling data with known and unknown group memberships within a likelihood framework and then estimating parameters, including the unknown group memberships, within an alternating expectation-conditional maximization framework. Model selection is carried out using the Bayesian information criteria and the quality of the maximum a posteriori classifications is summarized using the misclassification rate and the adjusted Rand index. This new model-based classification technique gives excellent classification performance when applied to real food authenticity data on the chemical properties of olive oils from nine areas of Italy. 相似文献