首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
The quadratic discriminant function is commonly used for the two group classification problem when the covariance matrices in the two populations are substantially unequal. This procedure is optimal when both populations are multivariate normal with known means and covariance matrices. This study examined the robustness of the QDF to non-normality. Sampling experiments were conducted to estimate expected actual error rates for the QDF when sampling from a variety of non-normal distributions. Results indicated that the QDF was robust to non-normality except when the distributions were highly skewed, in which case relatively large deviations from optimal were observed. In all cases studied the average probabilities of misclassification were relatively stable while the individual population error rates exhibited considerable variability.  相似文献   

In this paper, we propose an asymptotic approximation for the expected probabilities of misclassification (EPMC) in the linear discriminant function on the basis of k-step monotone missing training data for general k. We derive certain relations of the statistics in order to obtain the approximation. Finally, we perform Monte Carlo simulation to evaluate the accuracy of our result and to compare it with existing approximations.  相似文献   

We present results of a Monte Carlo study comparing four methods of estimating the parameters of the logistic model logit (pr (Y = 1 | X, Z)) = α0 + α 1 X + α 2 Z where X and Z are continuous covariates and X is always observed but Z is sometimes missing. The four methods examined are 1) logistic regression using complete cases, 2) logistic regression with filled-in values of Z obtained from the regression of Z on X and Y, 3) logistic regression with filled-in values of Z and random error added, and 4) maximum likelihood estimation assuming the distribution of Z given X and Y is normal. Effects of different percent missing for Z and different missing value mechanisms on the bias and mean absolute deviation of the estimators are examined for data sets of N = 200 and N = 400.  相似文献   

Multiple imputation has emerged as a popular approach to handling data sets with missing values. For incomplete continuous variables, imputations are usually produced using multivariate normal models. However, this approach might be problematic for variables with a strong non-normal shape, as it would generate imputations incoherent with actual distributions and thus lead to incorrect inferences. For non-normal data, we consider a multivariate extension of Tukey's gh distribution/transformation [38] to accommodate skewness and/or kurtosis and capture the correlation among the variables. We propose an algorithm to fit the incomplete data with the model and generate imputations. We apply the method to a national data set for hospital performance on several standard quality measures, which are highly skewed to the left and substantially correlated with each other. We use Monte Carlo studies to assess the performance of the proposed approach. We discuss possible generalizations and give some advices to practitioners on how to handle non-normal incomplete data.  相似文献   

We propose an 1-regularized likelihood method for estimating the inverse covariance matrix in the high-dimensional multivariate normal model in presence of missing data. Our method is based on the assumption that the data are missing at random (MAR) which entails also the completely missing at random case. The implementation of the method is non-trivial as the observed negative log-likelihood generally is a complicated and non-convex function. We propose an efficient EM algorithm for optimization with provable numerical convergence properties. Furthermore, we extend the methodology to handle missing values in a sparse regression context. We demonstrate both methods on simulated and real data.  相似文献   

Although the effect of missing data on regression estimates has received considerable attention, their effect on predictive performance has been neglected. We studied the performance of three missing data strategies—omission of records with missing values, replacement with a mean and imputation based on regression—on the predictive performance of logistic regression (LR), classification tree (CT) and neural network (NN) models in the presence of data missing completely at random (MCAR). Models were constructed using datasets of size 500 simulated from a joint distribution of binary and continuous predictors including nonlinearities, collinearity and interactions between variables. Though omission produced models that fit better on the data from which the models were developed, imputation was superior on average to omission for all models when evaluating the receiver operating characteristic (ROC) curve area, mean squared error (MSE), pooled variance across outcome categories and calibration X 2 on an independently generated test set. However, in about one-third of simulations, omission performed better. Performance was also more variable with omission including quite a few instances of extremely poor performance. Replacement and imputation generally produced similar results except with neural networks for which replacement, the strategy typically used in neural network algorithms, was inferior to imputation. Missing data affected simpler models much less than they did more complex models such as generalized additive models that focus on local structure For moderate sized datasets, logistic regressions that use simple nonlinear structures such as quadratic terms and piecewise linear splines appear to be at least as robust to randomly missing values as neural networks and classification trees.  相似文献   

Restricted maximum likelihood (REML) methods are traditionally used for analyzing mixed models. Based on a multivariate normal likelihood, these analyses are sensitive to outliers. Recently developed robust rank-based procedures offer a complete analysis of mixed model: estimation of fixed effects, standard errors, and estimation of variance components. The results of a large Monte Carlo study are presented, comparing these two analyses for many situations over multivariate normal and contaminated normal distributions. The rank-based analyses are much more powerful and efficient than the REML analyses over all non-normal situations, while losing little power for normal errors.  相似文献   


Missing data are commonly encountered in self-reported measurements and questionnaires. It is crucial to treat missing values using appropriate method to avoid bias and reduction of power. Various types of imputation methods exist, but it is not always clear which method is preferred for imputation of data with non-normal variables. In this paper, we compared four imputation methods: mean imputation, quantile imputation, multiple imputation, and quantile regression multiple imputation (QRMI), using both simulated and real data investigating factors affecting self-efficacy in breast cancer survivors. The results displayed an advantage of using multiple imputation, especially QRMI when data are not normal.  相似文献   

Marginal posterior distributions, when not available ana­lytically, can be at present numerically inaccessible if the number of parameters for intergration exeeeds 7 to 10. For the normal multivariate regression model, with data absent (missing)in a monotone pattern, some integrations have been accomplished analytically (Guttman and Menzefricke, 1983; Bartlett, 1983; for example).

In this note we show how monotely missing data support an extended prior-likelihood factorization and the needed posterior extended prior-likelihood factorization and the can be obtained directly using standard results.  相似文献   

We derive the optimal regression function (i.e., the best approximation in the L2 sense) when the vector of covariates has a random dimension. Furthermore, we consider applications of these results to problems in statistical regression and classification with missing covariates. It will be seen, perhaps surprisingly, that the correct regression function for the case with missing covariates can sometimes perform better than the usual regression function corresponding to the case with no missing covariates. This is because even if some of the covariates are missing, an indicator random variable δδ, which is always observable, and is equal to 1 if there are no missing values (and 0 otherwise), may have far more information and predictive power about the response variable Y than the missing covariates do. We also propose kernel-based procedures for estimating the correct regression function nonparametrically. As an alternative estimation procedure, we also consider the least-squares method.  相似文献   

Multivariate mixture regression models can be used to investigate the relationships between two or more response variables and a set of predictor variables by taking into consideration unobserved population heterogeneity. It is common to take multivariate normal distributions as mixing components, but this mixing model is sensitive to heavy-tailed errors and outliers. Although normal mixture models can approximate any distribution in principle, the number of components needed to account for heavy-tailed distributions can be very large. Mixture regression models based on the multivariate t distributions can be considered as a robust alternative approach. Missing data are inevitable in many situations and parameter estimates could be biased if the missing values are not handled properly. In this paper, we propose a multivariate t mixture regression model with missing information to model heterogeneity in regression function in the presence of outliers and missing values. Along with the robust parameter estimation, our proposed method can be used for (i) visualization of the partial correlation between response variables across latent classes and heterogeneous regressions, and (ii) outlier detection and robust clustering even under the presence of missing values. We also propose a multivariate t mixture regression model using MM-estimation with missing information that is robust to high-leverage outliers. The proposed methodologies are illustrated through simulation studies and real data analysis.  相似文献   


Fisher's linear discriminant analysis (FLDA) is known as a method to find a discriminative feature space for multi-class classification. As a theory of extending FLDA to an ultimate nonlinear form, optimal nonlinear discriminant analysis (ONDA) has been proposed. ONDA indicates that the best theoretical nonlinear map for maximizing the Fisher's discriminant criterion is formulated by using the Bayesian a posterior probabilities. In addition, the theory proves that FLDA is equivalent to ONDA when the Bayesian a posterior probabilities are approximated by linear regression (LR). Due to some limitations of the linear model, there is room to modify FLDA by using stronger approximation/estimation methods. For the purpose of probability estimation, multi-nominal logistic regression (MLR) is more suitable than LR. Along this line, in this paper, we develop a nonlinear discriminant analysis (NDA) in which the posterior probabilities in ONDA are estimated by MLR. In addition, in this paper, we develop a way to introduce sparseness into discriminant analysis. By applying L1 or L2 regularization to LR or MLR, we can incorporate sparseness in FLDA and our NDA to increase generalization performance. The performance of these methods is evaluated by benchmark experiments using last_exam17 standard datasets and a face classification experiment.  相似文献   

This article derives the likelihood ratio statistic to test the independence between (X 1,…,X r ) and (X r+1,…,X k ) under the assumption that (X 1,…,X k ) has a multivariate normal distribution and that a sample of size n is available, where for N observation vectors all components are available, while for M = (n + N) observation vectors, the data on the last q components, (Xk-q+1,…,X k ) are missing (k+q≥r).  相似文献   

A random vector is assumed to have one of three known multivariate normal distributions with equal covariance matrices. It is desired to separate the three distributions by means of a single linear discriminant function. Such a function can lead to a classification rule. The function whose classification rule minimizes the average of the three probabilities of misclassification is found. Also the function is found whose rule minimizes the maximum of the three probabilities of misclassification.  相似文献   

The study of multivariate outliers raises many problems of definition, principle and manipulation. Well-authenticated tests of discordancy exist only for the multivariate normal distribution. Detection of outliers in non-normal distributions involves the adoption of appropriate criteria to represent 'extremeness' of observations in a sample; corresponding tests of discordancy usually require tedious, or even intractable, distributional and computational manipulations. A class of transformations of the data is considered with a view of transferring some of the familiar and desirable features of discordancy tests for normal samples to non-normal situations.  相似文献   

We present an algorithm for multivariate robust Bayesian linear regression with missing data. The iterative algorithm computes an approximative posterior for the model parameters based on the variational Bayes (VB) method. Compared to the EM algorithm, the VB method has the advantage that the variance for the model parameters is also computed directly by the algorithm. We consider three families of Gaussian scale mixture models for the measurements, which include as special cases the multivariate t distribution, the multivariate Laplace distribution, and the contaminated normal model. The observations can contain missing values, assuming that the missing data mechanism can be ignored. A Matlab/Octave implementation of the algorithm is presented and applied to solve three reference examples from the literature.  相似文献   

Tukey proposed a class of distributions, the g-and-h family (gh family), based on a transformation of a standard normal variable to accommodate different skewness and elongation in the distribution of variables arising in practical applications. It is easy to draw values from this distribution even though it is hard to explicitly state the probability density function. Given this flexibility, the gh family may be extremely useful in creating multiple imputations for missing data. This article demonstrates how this family, as well as its generalizations, can be used in the multiple imputation analysis of incomplete data. The focus of this article is on a scalar variable with missing values. In the absence of any additional information, data are missing completely at random, and hence the correct analysis is the complete-case analysis. Thus, the application of the gh multiple imputation to the scalar cases affords comparison with the correct analysis and with other model-based multiple imputation methods. Comparisons are made using simulated datasets and the data from a survey of adolescents ascertaining driving after drinking alcohol.  相似文献   

This paper proposes two asymptotic expansions relating to discrimination based on two-step monotone missing samples. These asymptotic expansions have been obtained by Okamoto (1963) and McLachlan (1973) for complete data under multivariate normality. This paper extends the results up to the terms of the first order in the case of two-step monotone missing samples, respectively. Especially, these asymptotic expansions play important roles in obtaining the asymptotic approximations for the probabilities of misclassification in discriminant analysis. The simulation studies have been also conducted in order to evaluate the accuracy of the approximation derived in this paper.  相似文献   

This article describes a method for simulating n-dimensional multivariate non-normal data, with emphasis on count-valued data. Dependence is characterized by either Pearson correlations or Spearman correlations. The simulation is accomplished by simulating a vector of correlated standard normal variates. The elements of this vector are then transformed to achieve the target marginal distributions. We prove that the method corresponds to simulating data from a multivariate Gaussian copula. The simulation method does not restrict pairwise dependence beyond the limits imposed by the marginal distributions and can achieve any Pearson or Spearman correlation within those limits. Two examples are included. In the first example, marginal means, variances, Pearson correlations, and Spearman correlations are estimated from the epileptic seizure data set of Diggle et al. [P. Diggle, P. Heagerty, K.Y. Liang, and S. Zeger, Analysis of Longitudinal Data, Oxford University Press, Oxford, 2002]. Data with these means and variances are simulated to first achieve the estimated Pearson correlations and then achieve the estimated Spearman correlations. The second example is of a hypothetical time series of Poisson counts with seasonal mean ranging between 1 and 9 and an autoregressive(1) dependence structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号