期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Learning classification trees 总被引：11，自引：0，他引：11

Wray Buntine 《Statistics and Computing》1992,2(2):63-73

Algorithms for learning classification trees have had successes in artificial intelligence and statistics over many years. This paper outlines how a tree learning algorithm can be derived using Bayesian statistics. This introduces Bayesian techniques for splitting, smoothing, and tree averaging. The splitting rule is similar to Quinlan's information gain, while smoothing and averaging replace pruning. Comparative experiments with reimplementations of a minimum encoding approach,c4 (Quinlanet al., 1987) andcart (Breimanet al., 1984), show that the full Bayesian algorithm can produce more accurate predictions than versions of these other approaches, though pays a computational price. 相似文献

2.

Families of splitting criteria for classification trees 总被引：6，自引：0，他引：6

Shih Y.-S. 《Statistics and Computing》1999,9(4):309-315

Several splitting criteria for binary classification trees are shown to be written as weighted sums of two values of divergence measures. This weighted sum approach is then used to form two families of splitting criteria. One of them contains the chi-squared and entropy criterion, the other contains the mean posterior improvement criterion. Both family members are shown to have the property of exclusive preference. Furthermore, the optimal splits based on the proposed families are studied. We find that the best splits depend on the parameters in the families. The results reveal interesting differences among various criteria. Examples are given to demonstrate the usefulness of both families. 相似文献

3.

Block diagrams and splitting criteria for classification trees

Paul C. Taylor Bernard W. Silverman 《Statistics and Computing》1993,3(4):147-161

Various aspects of the classification tree methodology of Breiman et al., (1984) are discussed. A method of displaying classification trees, called block diagrams, is developed. Block diagrams give a clear presentation of the classification, and are useful both to point out features of the particular data set under consideration and also to highlight deficiencies in the classification method being used. Various splitting criteria are discussed; the usual Gini-Simpson criterion presents difficulties when there is a relatively large number of classes and improved splitting criteria are obtained. One particular improvement is the introduction of adaptive anti-end-cut factors that take advantage of highly asymmetrical splits where appropriate. They use the number and mix of classes in the current node of the tree to identify whether or not it is likely to be advantageous to create a very small offspring node. A number of data sets are used as examples. 相似文献

4.

Predictive performance of missing data methods for logistic regression,classification trees and neural networks

《Journal of Statistical Computation and Simulation》2012,82(2):115-140

Although the effect of missing data on regression estimates has received considerable attention, their effect on predictive performance has been neglected. We studied the performance of three missing data strategies—omission of records with missing values, replacement with a mean and imputation based on regression—on the predictive performance of logistic regression (LR), classification tree (CT) and neural network (NN) models in the presence of data missing completely at random (MCAR). Models were constructed using datasets of size 500 simulated from a joint distribution of binary and continuous predictors including nonlinearities, collinearity and interactions between variables. Though omission produced models that fit better on the data from which the models were developed, imputation was superior on average to omission for all models when evaluating the receiver operating characteristic (ROC) curve area, mean squared error (MSE), pooled variance across outcome categories and calibration X ² on an independently generated test set. However, in about one-third of simulations, omission performed better. Performance was also more variable with omission including quite a few instances of extremely poor performance. Replacement and imputation generally produced similar results except with neural networks for which replacement, the strategy typically used in neural network algorithms, was inferior to imputation. Missing data affected simpler models much less than they did more complex models such as generalized additive models that focus on local structure For moderate sized datasets, logistic regressions that use simple nonlinear structures such as quadratic terms and piecewise linear splines appear to be at least as robust to randomly missing values as neural networks and classification trees. 相似文献

5.

Covariance-regularized regression and classification for high dimensional problems

Daniela M. Witten Robert Tibshirani 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2009,71(3):615-636

Summary. We propose covariance-regularized regression, a family of methods for prediction in high dimensional settings that uses a shrunken estimate of the inverse covariance matrix of the features to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing the log-likelihood of the data, under a multivariate normal model, subject to a penalty; it is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyse gene expression data sets with multiple class and survival outcomes. 相似文献

6.

Calibration approach for estimating population total with subsampling of non respondents under single- and two-phase sampling

Rohan Kumar Raman U.C. Sud 《统计学通讯:理论与方法》2013,42(10):2842-2856

ABSTRACT

Using the calibration approach, the Hansen and Hurwitz (1946 Hansen, M.H., Hurwitz, W.N. (1946). The problem of non response in sample surveys. J. Am. Stat. Assoc. 46:147–190.[Taylor & Francis Online] , [Google Scholar]) technique-based estimator is developed for the situation where the information on auxiliary variable is assumed known for the entire population units. The double-sampling case has also been dealt with. Expressions for the estimator of population total, its variance, and variance estimator are developed. The theoretical results are illustrated with the help of simulation studies. Simulation results show that the proposed calibration approach-based estimator outperforms the Hansen and Hurwitz estimator. 相似文献