期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Stability of feature selection in classification issues for high-dimensional correlated data

Émeline Perthame Chloé Friguet David Causeur 《Statistics and Computing》2016,26(4):783-796

Handling dependence or not in feature selection is still an open question in supervised classification issues where the number of covariates exceeds the number of observations. Some recent papers surprisingly show the superiority of naive Bayes approaches based on an obviously erroneous assumption of independence, whereas others recommend to infer on the dependence structure in order to decorrelate the selection statistics. In the classical linear discriminant analysis (LDA) framework, the present paper first highlights the impact of dependence in terms of instability of feature selection. A second objective is to revisit the above issue using a flexible factor modeling for the covariance. This framework introduces latent components of dependence, conditionally on which a new Bayes consistency is defined. A procedure is then proposed for the joint estimation of the expectation and variance parameters of the model. The present method is compared to recent regularized diagonal discriminant analysis approaches, assuming independence among features, and regularized LDA procedures, both in terms of classification performance and stability of feature selection. The proposed method is implemented in the R package FADA, freely available from the R repository CRAN. 相似文献

2.

Identification of Influential Cases in Kernel Fisher Discriminant Analysis

Nelmarie Louw Morne M. C. Lamont 《统计学通讯:模拟与计算》2013,42(10):2050-2062

We study the influence of a single data case on the results of a statistical analysis. This problem has been addressed in several articles for linear discriminant analysis (LDA). Kernel Fisher discriminant analysis (KFDA) is a kernel based extension of LDA. In this article, we study the effect of atypical data points on KFDA and develop criteria for identification of cases having a detrimental effect on the classification performance of the KFDA classifier. We find that the criteria are successful in identifying cases whose omission from the training data prior to obtaining the KFDA classifier results in reduced error rates. 相似文献

3.

Asymptotic Optimality of Sparse Linear Discriminant Analysis with Arbitrary Number of Classes

下载免费PDF全文

Ruiyan Luo Xin Qi 《Scandinavian Journal of Statistics》2017,44(3):598-616

Many sparse linear discriminant analysis (LDA) methods have been proposed to overcome the major problems of the classic LDA in high‐dimensional settings. However, the asymptotic optimality results are limited to the case with only two classes. When there are more than two classes, the classification boundary is complicated and no explicit formulas for the classification errors exist. We consider the asymptotic optimality in the high‐dimensional settings for a large family of linear classification rules with arbitrary number of classes. Our main theorem provides easy‐to‐check criteria for the asymptotic optimality of a general classification rule in this family as dimensionality and sample size both go to infinity and the number of classes is arbitrary. We establish the corresponding convergence rates. The general theory is applied to the classic LDA and the extensions of two recently proposed sparse LDA methods to obtain the asymptotic optimality. 相似文献

4.

Asymptotic properties of the EPMC for modified linear discriminant analysis when sample size and dimension are both large

Masashi Hyodo Takayuki Yamada 《Journal of statistical planning and inference》2010

We deal with the problem of classifying a new observation vector into one of two known multivariate normal distributions when the dimension p and training sample size N are both large with p<N

p < N

. Modified linear discriminant analysis (MLDA) was suggested by Xu et al. [10]. Error rate of MLDA is smaller than the one of LDA. However, if p and N are moderately large, error rate of MLDA is close to the one of LDA. These results are conditional ones, so we should investigate whether they hold unconditionally. In this paper, we give two types of asymptotic approximations of expected probability of misclassification (EPMC) for MLDA as n→∞

n \to \infty

with p=O(n^δ)

p = O (n^{δ})

, 0<δ<1

0 < δ < 1

. The one of two is the same as the asymptotic approximation of LDA, and the other is corrected version of the approximation. Simulation reveals that the modified version of approximation has good accuracy for the case in which p and N are moderately large. 相似文献

5.

Penalized classification using Fisher's linear discriminant

Witten DM Tibshirani R 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2011,73(5):753-772

We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ? n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher's discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L(1) and fused lasso penalties. Our proposal is equivalent to recasting Fisher's discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal. 相似文献

6.

Robust rank screening for ultrahigh dimensional discriminant analysis

Guosheng Cheng Xingxiang Li Peng Lai Fengli Song Jun Yu 《Statistics and Computing》2017,27(2):535-545

In this paper, we consider sure independence feature screening for ultrahigh dimensional discriminant analysis. We propose a new method named robust rank screening based on the conditional expectation of the rank of predictor’s samples. We also establish the sure screening property for the proposed procedure under simple assumptions. The new procedure has some additional desirable characters. First, it is robust against heavy-tailed distributions, potential outliers and the sample shortage for some categories. Second, it is model-free without any specification of a regression model and directly applicable to the situation with many categories. Third, it is simple in theoretical derivation due to the boundedness of the resulting statistics. Forth, it is relatively inexpensive in computational cost because of the simple structure of the screening index. Monte Carlo simulations and real data examples are used to demonstrate the finite sample performance. 相似文献

7.

Comparison of the restricted mean survival time with the hazard ratio in superiority trials with a time‐to‐event end point

下载免费PDF全文

Bo Huang Pei‐Fen Kuan 《Pharmaceutical statistics》2018,17(3):202-213

With the emergence of novel therapies exhibiting distinct mechanisms of action compared to traditional treatments, departure from the proportional hazard (PH) assumption in clinical trials with a time‐to‐event end point is increasingly common. In these situations, the hazard ratio may not be a valid statistical measurement of treatment effect, and the log‐rank test may no longer be the most powerful statistical test. The restricted mean survival time (RMST) is an alternative robust and clinically interpretable summary measure that does not rely on the PH assumption. We conduct extensive simulations to evaluate the performance and operating characteristics of the RMST‐based inference and against the hazard ratio–based inference, under various scenarios and design parameter setups. The log‐rank test is generally a powerful test when there is evident separation favoring 1 treatment arm at most of the time points across the Kaplan‐Meier survival curves, but the performance of the RMST test is similar. Under non‐PH scenarios where late separation of survival curves is observed, the RMST‐based test has better performance than the log‐rank test when the truncation time is reasonably close to the tail of the observed curves. Furthermore, when flat survival tail (or low event rate) in the experimental arm is expected, selecting the minimum of the maximum observed event time as the truncation timepoint for the RMST is not recommended. In addition, we recommend the inclusion of analysis based on the RMST curve over the truncation time in clinical settings where there is suspicion of substantial departure from the PH assumption. 相似文献

8.

On some classifiers based on multivariate ranks

Olusola Makinde Biman Chakraborty 《统计学通讯:理论与方法》2018,47(16):3955-3969

Non parametric approaches to classification have gained significant attention in the last two decades. In this paper, we propose a classification methodology based on the multivariate rank functions and show that it is a Bayes rule for spherically symmetric distributions with a location shift. We show that a rank-based classifier is equivalent to optimal Bayes rule under suitable conditions. We also present an affine invariant version of the classifier. To accommodate different covariance structures, we construct a classifier based on the central rank region. Asymptotic properties of these classification methods are studied. We illustrate the performance of our proposed methods in comparison to some other depth-based classifiers using simulated and real data sets. 相似文献

9.

A test for the composite hypothesis that a population has a gamma distribution

by Charles Locke 《统计学通讯:理论与方法》2013,42(4):351-384

A test of the composite hypothesis that a population has a gamma distribution is presented. The test is conducted by using a rank test of bivariate independence, such as the one .based on Kendallfs sample tau coefficient. The performance of the test is examined by means of a Monte Carlo study. 相似文献

10.

Linear Signed Rank Test for Model Selection

Abdolreza Sayyareh 《统计学通讯:理论与方法》2014,43(21):4492-4502

In this article, we consider a linear signed rank test for non-nested distributions in the context of the model selection. Introducing a new test, we show that, it is asymptotically more efficient than the Vuong test and the test statistic based on B statistic introduced by Clarke. However, here, we let the magnitude of the data give a better performance to the test statistic. We have shown that this test is an unbiased one. The results of simulations show that the rank test has the greater statistical power than the Vuong test where the underline distributions is symmetric. 相似文献

11.

Sequential correction of linear classifiers

T. Górecki 《Journal of applied statistics》2013,40(4):763-776

In this article, a sequential correction of two linear methods: linear discriminant analysis (LDA) and perceptron is proposed. This correction relies on sequential joining of additional features on which the classifier is trained. These new features are posterior probabilities determined by a basic classification method such as LDA and perceptron. In each step, we add the probabilities obtained on a slightly different data set, because the vector of added probabilities varies at each step. We therefore have many classifiers of the same type trained on slightly different data sets. Four different sequential correction methods are presented based on different combining schemas (e.g. mean rule and product rule). Experimental results on different data sets demonstrate that the improvements are efficient, and that this approach outperforms classical linear methods, providing a significant reduction in the mean classification error rate. 相似文献

12.

Rank correlation methods for missing data

Mayer Alvo Paul Cabilio 《Revue canadienne de statistique》1995,23(4):345-358

The subject of rank correlation has had a rich history. It has been used in numerous applications in tests for trend and for independence. However, little has been said about how to define rank correlation when the data are incomplete. The practice has often been to ignore missing observations and to define rank correlation for the smaller complete record. We propose a new class of measures of rank correlation which are based on a notion of distance between incomplete rankings. There is the potential for a significant increase in efficiency over the approach which ignores missing observations as demonstrated by a specific case. 相似文献

13.

A class of multivariate distribution-free tests of independence based on graphs

R. Heller M. Gorfine Y. Heller 《Journal of statistical planning and inference》2012

A class of distribution-free tests is proposed for the independence of two subsets of response coordinates. The tests are based on the pairwise distances across subjects within each subset of the response. A complete graph is induced by each subset of response coordinates, with the sample points as nodes and the pairwise distances as the edge weights. The proposed test statistic depends only on the rank order of edges in these complete graphs. The response vector may be of any dimensions. In particular, the number of samples may be smaller than the dimensions of the response. The test statistic is shown to have a normal limiting distribution with known expectation and variance under the null hypothesis of independence. The exact distribution free null distribution of the test statistic is given for a sample of size 14, and its Monte-Carlo approximation is considered for larger sample sizes. We demonstrate in simulations that this new class of tests has good power properties for very general alternatives. 相似文献

14.

Mixture Model Analysis of Partially Rank‐Ordered Set Samples: Age Groups of Fish from Length‐Frequency Data

下载免费PDF全文

Armin Hatefi Mohammad Jafari Jozani Omer Ozturk 《Scandinavian Journal of Statistics》2015,42(3):848-871

We present a novel methodology for estimating the parameters of a finite mixture model (FMM) based on partially rank‐ordered set (PROS) sampling and use it in a fishery application. A PROS sampling design first selects a simple random sample of fish and creates partially rank‐ordered judgement subsets by dividing units into subsets of prespecified sizes. The final measurements are then obtained from these partially ordered judgement subsets. The traditional expectation–maximization algorithm is not directly applicable for these observations. We propose a suitable expectation–maximization algorithm to estimate the parameters of the FMMs based on PROS samples. We also study the problem of classification of the PROS sample into the components of the FMM. We show that the maximum likelihood estimators based on PROS samples perform substantially better than their simple random sample counterparts even with small samples. The results are used to classify a fish population using the length‐frequency data. 相似文献

15.

Risk Prediction for Prostate Cancer Recurrence Through Regularized Estimation with Simultaneous Adjustment for Nonlinear Clinical Effects

Long Q Chung M Moreno CS Johnson BA 《The annals of applied statistics》2011,5(3):2003-2023

In biomedical studies, it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful tool for the analysis of censored outcome data, it assumes that covariate effects on the logarithm of time-to-event are linear, which is often unrealistic in practice. We propose to build risk prediction scores through regularized rank estimation in partly linear AFT models, where high-dimensional data such as gene expression data are modeled linearly and important clinical variables are modeled nonlinearly using penalized regression splines. We show through simulation studies that our model has better operating characteristics compared to several existing models. In particular, we show that there is a non-negligible effect on prediction as well as feature selection when nonlinear clinical effects are misspecified as linear. This work is motivated by a recent prostate cancer study, where investigators collected gene expression data along with established prognostic clinical variables and the primary endpoint is time to prostate cancer recurrence. We analyzed the prostate cancer data and evaluated prediction performance of several models based on the extended c statistic for censored data, showing that 1) the relationship between the clinical variable, prostate specific antigen, and the prostate cancer recurrence is likely nonlinear, i.e., the time to recurrence decreases as PSA increases and it starts to level off when PSA becomes greater than 11; 2) correct specification of this nonlinear effect improves performance in prediction and feature selection; and 3) addition of gene expression data does not seem to further improve the performance of the resultant risk prediction scores. 相似文献

16.

Rank-based outlier detection

Huaming Huang Chilukuri K. Mohan 《Journal of Statistical Computation and Simulation》2013,83(3):518-531

We propose a new approach for outlier detection, based on a ranking measure that focuses on the question of whether a point is ‘central’ for its nearest neighbours. Using our notations, a low cumulative rank implies that the point is central. For instance, a point centrally located in a cluster has a relatively low cumulative sum of ranks because it is among the nearest neighbours of its own nearest neighbours, but a point at the periphery of a cluster has a high cumulative sum of ranks because its nearest neighbours are closer to each other than the point. Use of ranks eliminates the problem of density calculation in the neighbourhood of the point and this improves the performance. Our method performs better than several density-based methods on some synthetic data sets as well as on some real data sets. 相似文献

17.

Time Series Classification Based on Spectral Analysis

Shuen-Lin Jeng Ya-Ti Huang 《统计学通讯:模拟与计算》2013,42(1):132-142

For time series data with obvious periodicity (e.g., electric motor systems and cardiac monitor) or vague periodicity (e.g., earthquake and explosion, speech, and stock data), frequency-based techniques using the spectral analysis can usually capture the features of the series. By this approach, we are able not only to reduce the data dimensions into frequency domain but also utilize these frequencies by general classification methods such as linear discriminant analysis (LDA) and k-nearest-neighbor (KNN) to classify the time series. This is a combination of two classical approaches. However, there is a difficulty in using LDA and KNN in frequency domain due to excessive dimensions of data. We overcome the obstacle by using Singular Value Decomposition to select essential frequencies. Two data sets are used to illustrate our approach. The classification error rates of our simple approach are comparable to those of several more complicated methods. 相似文献

18.

Rank tests for independence based on partially right-censored pairs

Shu-Chen Wu 《统计学通讯:理论与方法》2013,42(19):2207-2216

Linear rank procedures are developed for testing independence with right-censored matched pairs. It is assumed that censoring Is Independent of the random variables under study. The test statistics are derived as score statistics (Hajek and Sidak, 1967) based on the probability of the generalised rank vectors (Prentice, 1978). Applications to survival data analysis are also discussed. 相似文献

19.

Sparse subspace linear discriminant analysis

Yanfang Li Jing Lei 《Statistics》2018,52(4):782-800

We study high dimensional multigroup classification from a sparse subspace estimation perspective, unifying the linear discriminant analysis (LDA) with other recent developments in high dimensional multivariate analysis using similar tools, such as penalization method. We develop two two-stage sparse LDA models, where in the first stage, convex relaxation is used to convert two classical formulations of LDA to semidefinite programs, and furthermore subspace perspective allows for straightforward regularization and estimation. After the initial convex relaxation, we use a refinement stage to improve the accuracy. For the first model, a penalized quadratic program with group lasso penalty is used for refinement, whereas a sparse version of the power method is used for the second model. We carefully examine the theoretical properties of both methods, alongside with simulations and real data analysis. 相似文献

20.

Robust Model-Free Multiclass Probability Estimation

Wu Y Zhang HH Liu Y 《Journal of the American Statistical Association》2010,105(489):424-436

Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods. 相似文献