首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Biplots represent a widely used statistical tool for visualizing the resulting loadings and scores of a dimension reduction technique applied to multivariate data. If the underlying data carry only relative information (i.e. compositional data expressed in proportions, mg/kg, etc.) they have to be pre-processed with a logratio transformation before the dimension reduction is carried out. In the context of principal component analysis, the resulting biplot is called compositional biplot. We introduce an alternative, the ilr biplot, which is based on a special choice of orthonormal coordinates resulting from an isometric logratio (ilr) transformation. This allows to incorporate also external non-compositional variables, and to study the relations to the compositional variables. The methodology is demonstrated on real data sets.  相似文献   

2.
In high-dimensional data, one often seeks a few interesting low-dimensional projections which reveal important aspects of the data. Projection pursuit for classification finds projections that reveal differences between classes. Even though projection pursuit is used to bypass the curse of dimensionality, most indexes will not work well when there are a small number of observations relative to the number of variables, known as a large p (dimension) small n (sample size) problem. This paper discusses the relationship between the sample size and dimensionality on classification and proposes a new projection pursuit index that overcomes the problem of small sample size for exploratory classification.  相似文献   

3.
Multivariate mixture regression models can be used to investigate the relationships between two or more response variables and a set of predictor variables by taking into consideration unobserved population heterogeneity. It is common to take multivariate normal distributions as mixing components, but this mixing model is sensitive to heavy-tailed errors and outliers. Although normal mixture models can approximate any distribution in principle, the number of components needed to account for heavy-tailed distributions can be very large. Mixture regression models based on the multivariate t distributions can be considered as a robust alternative approach. Missing data are inevitable in many situations and parameter estimates could be biased if the missing values are not handled properly. In this paper, we propose a multivariate t mixture regression model with missing information to model heterogeneity in regression function in the presence of outliers and missing values. Along with the robust parameter estimation, our proposed method can be used for (i) visualization of the partial correlation between response variables across latent classes and heterogeneous regressions, and (ii) outlier detection and robust clustering even under the presence of missing values. We also propose a multivariate t mixture regression model using MM-estimation with missing information that is robust to high-leverage outliers. The proposed methodologies are illustrated through simulation studies and real data analysis.  相似文献   

4.
This work focuses on the estimation of distribution functions with incomplete data, where the variable of interest Y has ignorable missingness but the covariate X is always observed. When X is high dimensional, parametric approaches to incorporate X—information is encumbered by the risk of model misspecification and nonparametric approaches by the curse of dimensionality. We propose a semiparametric approach, which is developed under a nonparametric kernel regression framework, but with a parametric working index to condense the high dimensional X—information for reduced dimension. This kernel dimension reduction estimator has double robustness to model misspecification and is most efficient if the working index adequately conveys the X—information about the distribution of Y. Numerical studies indicate better performance of the semiparametric estimator over its parametric and nonparametric counterparts. We apply the kernel dimension reduction estimation to an HIV study for the effect of antiretroviral therapy on HIV virologic suppression.  相似文献   

5.
In the past decades, the number of variables explaining observations in different practical applications increased gradually. This has led to heavy computational tasks, despite of widely using provisional variable selection methods in data processing. Therefore, more methodological techniques have appeared to reduce the number of explanatory variables without losing much of the information. In these techniques, two distinct approaches are apparent: ‘shrinkage regression’ and ‘sufficient dimension reduction’. Surprisingly, there has not been any communication or comparison between these two methodological categories, and it is not clear when each of these two approaches are appropriate. In this paper, we fill some of this gap by first reviewing each category in brief, paying special attention to the most commonly used methods in each category. We then compare commonly used methods from both categories based on their accuracy, computation time, and their ability to select effective variables. A simulation study on the performance of the methods in each category is generated as well. The selected methods are concurrently tested on two sets of real data which allows us to recommend conditions under which one approach is more appropriate to be applied to high-dimensional data.  相似文献   

6.
A Gaussian process (GP) can be thought of as an infinite collection of random variables with the property that any subset, say of dimension n, of these variables have a multivariate normal distribution of dimension n, mean vector β and covariance matrix Σ [O'Hagan, A., 1994, Kendall's Advanced Theory of Statistics, Vol. 2B, Bayesian Inference (John Wiley & Sons, Inc.)]. The elements of the covariance matrix are routinely specified through the multiplication of a common variance by a correlation function. It is important to use a correlation function that provides a valid covariance matrix (positive definite). Further, it is well known that the smoothness of a GP is directly related to the specification of its correlation function. Also, from a Bayesian point of view, a prior distribution must be assigned to the unknowns of the model. Therefore, when using a GP to model a phenomenon, the researcher faces two challenges: the need of specifying a correlation function and a prior distribution for its parameters. In the literature there are many classes of correlation functions which provide a valid covariance structure. Also, there are many suggestions of prior distributions to be used for the parameters involved in these functions. We aim to investigate how sensitive the GPs are to the (sometimes arbitrary) choices of their correlation functions. For this, we have simulated 25 sets of data each of size 64 over the square [0, 5]×[0, 5] with a specific correlation function and fixed values of the GP's parameters. We then fit different correlation structures to these data, with different prior specifications and check the performance of the adjusted models using different model comparison criteria.  相似文献   

7.
ABSTRACT

In recent years, effective monitoring of data quality has increasingly attracted attention of researchers in the area of statistical process control. Among the relevant research on this topic, none used multivariate methods to control the multidimensional data quality process, but instead relied on multiple univariate control charts. Based on a novel one-sided multivariate exponentially weighted moving average (MEWMA) chart, we propose a conditional false discovery rate-adjusted scheme to on-line monitor the data quality of high-dimensional data streams. With thousands of input data streams, the average run length loses its usefulness because one will likely have out-of-control signals at each time period. Hence, we first control the percentage of signals that are false alarms. Then, we compare the power of the proposed MEWMA scheme with that of two alternative methods. Compared with two competitors, numerical results show that the proposed MEWMA scheme has higher average power.  相似文献   

8.
The effect of nonstationarity in time series columns of input data in principal components analysis is examined. Nonstationarity are very common among economic indicators collected over time. They are subsequently summarized into fewer indices for purposes of monitoring. Due to the simultaneous drifting of the nonstationary time series usually caused by the trend, the first component averages all the variables without necessarily reducing dimensionality. Sparse principal components analysis can be used, but attainment of sparsity among the loadings (hence, dimension-reduction is achieved) is influenced by the choice of parameter(s) (λ 1,i ). Simulated data with more variables than the number of observations and with different patterns of cross-correlations and autocorrelations were used to illustrate the advantages of sparse principal components analysis over ordinary principal components analysis. Sparse component loadings for nonstationary time series data can be achieved provided that appropriate values of λ 1,j are used. We provide the range of values of λ 1,j that will ensure convergence of the sparse principal components algorithm and consequently achieve sparsity of component loadings.  相似文献   

9.
We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A “newbie” algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a smoothing spline ANOVA penalized likelihood model, a support vector machine, or any model that will admit reproducing kernel Hilbert space components, for nonparametric regression, supervised learning, or semisupervised learning. Future work and open questions are discussed. The papers are:  相似文献   

10.
Uniform scores test is a rank-based method that tests the homogeneity of k-populations in circular data problems. The influence of ties on the uniform scores test has been emphasized by several authors in several articles and books. Moreover, it is suggested that the uniform scores test should be used with caution if ties are present in the data. This paper investigates the influence of ties on the uniform scores test by computing the power of the test using average, randomization, permutation, minimum, and maximum methods to break ties. Monte Carlo simulation is performed to compute the power of the test under several scenarios such as having 5% or 10% of ties and tie group structures in the data. The simulation study shows no significant difference among the methods under the existence of ties but the test loses its power when there are many ties or complicated group structures. Thus, randomization or average methods are equally powerful to break ties when applying uniform scores test. Also, it can be concluded that k-sample uniform scores test can be used safely without sacrificing the power if there are only less than 5% of ties or at most two groups of a few ties.  相似文献   

11.
Most methods for describing the relationship among random variables require specific probability distributions and some assumptions concerning random variables. Mutual information, based on entropy to measure the dependency among random variables, does not need any specific distribution and assumptions. Redundancy, which is an analogous version of mutual information, is also proposed as a method. In this paper, the concepts of redundancy and mutual information are explored as applied to multi-dimensional categorical data. We found that mutual information and redundancy for categorical data can be expressed as a function of the generalized likelihood ratio statistic under several kinds of independent log-linear models. As a consequence, mutual information and redundancy can also be used to analyze contingency tables stochastically. Whereas the generalized likelihood ratio statistic to test the goodness-of-fit of the log-linear models is sensitive to the sample size, the redundancy for categorical data does not depend on sample size but depends on its cell probabilities.  相似文献   

12.
Most methods for survival prediction from high-dimensional genomic data combine the Cox proportional hazards model with some technique of dimension reduction, such as partial least squares regression (PLS). Applying PLS to the Cox model is not entirely straightforward, and multiple approaches have been proposed. The method of Park et al. (Bioinformatics 18(Suppl. 1):S120–S127, 2002) uses a reformulation of the Cox likelihood to a Poisson type likelihood, thereby enabling estimation by iteratively reweighted partial least squares for generalized linear models. We propose a modification of the method of park et al. (2002) such that estimates of the baseline hazard and the gene effects are obtained in separate steps. The resulting method has several advantages over the method of park et al. (2002) and other existing Cox PLS approaches, as it allows for estimation of survival probabilities for new patients, enables a less memory-demanding estimation procedure, and allows for incorporation of lower-dimensional non-genomic variables like disease grade and tumor thickness. We also propose to combine our Cox PLS method with an initial gene selection step in which genes are ordered by their Cox score and only the highest-ranking k% of the genes are retained, obtaining a so-called supervised partial least squares regression method. In simulations, both the unsupervised and the supervised version outperform other Cox PLS methods.  相似文献   

13.
ABSTRACT

This paper proposes an exponential class of dynamic binary choice panel data models for the analysis of short T (time dimension) large N (cross section dimension) panel data sets that allow for unobserved heterogeneity (fixed effects) to be arbitrarily correlated with the covariates. The paper derives moment conditions that are invariant to the fixed effects which are then used to identify and estimate the parameters of the model. Accordingly, generalized method of moments (GMM) estimators are proposed that are consistent and asymptotically normally distributed at the root-N rate. We also study the conditional likelihood approach and show that under exponential specification, it can identify the effect of state dependence but not the effects of other covariates. Monte Carlo experiments show satisfactory finite sample performance for the proposed estimators and investigate their robustness to misspecification.  相似文献   

14.
Left-truncation often arises when patient information, such as time of diagnosis, is gathered retrospectively. In some cases, the distribution function, say G(x), of left-truncated variables can be parameterized as G(x; θ), where θ∈Θ?Rq and θ is a q-dimensional vector. Under semiparametric transformation models, we demonstrated that the approach of Chen et al. (Semiparametric analysis of transformation models with censored data. Biometrika. 2002;89:659–668) can be used to analyse this type of data. The asymptotic properties of the proposed estimators are derived. A simulation study is conducted to investigate the performance of the proposed estimators.  相似文献   

15.
The Cox proportional frailty model with a random effect has been proposed for the analysis of right-censored data which consist of a large number of small clusters of correlated failure time observations. For right-censored data, Cai et al. [3] proposed a class of semiparametric mixed-effects models which provides useful alternatives to the Cox model. We demonstrate that the approach of Cai et al. [3] can be used to analyze clustered doubly censored data when both left- and right-censoring variables are always observed. The asymptotic properties of the proposed estimator are derived. A simulation study is conducted to investigate the performance of the proposed estimator.  相似文献   

16.
We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X . A new variable B is expected to enhance the prediction of Y. A dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for Y| X ,B that uses both the available individual level data and some summary information obtained from the known model for Y| X . We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n + m to estimate the parameters of the Y| X ,B model. This combined dataset of size n + m now has missing values of B for m of the observations, and is analyzed using methods that can handle missing data (e.g., multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variances of the parameter estimates in the Y| X ,B model are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios. The Canadian Journal of Statistics 47: 580–603; 2019 © 2019 Statistical Society of Canada  相似文献   

17.
We describe inferactive data analysis, so-named to denote an interactive approach to data analysis with an emphasis on inference after data analysis. Our approach is a compromise between Tukey's exploratory and confirmatory data analysis allowing also for Bayesian data analysis. We see this as a useful step in concrete providing tools (with statistical guarantees) for current data scientists. The basis of inference we use is (a conditional approach to) selective inference, in particular its randomized form. The relevant reference distributions are constructed from what we call a DAG-DAG—a Data Analysis Generative DAG, and a selective change of variables formula is crucial to any practical implementation of inferactive data analysis via sampling these distributions. We discuss a canonical example of an incomplete cross-validation test statistic to discriminate between black box models, and a real HIV dataset example to illustrate inference after making multiple queries on data.  相似文献   

18.
In this paper, we consider the ultrahigh-dimensional sufficient dimension reduction (SDR) for censored data and measurement error in covariates. We first propose the feature screening procedure based on censored data and the covariates subject to measurement error. With the suitable correction of mismeasurement, the error-contaminated variables detected by the proposed feature screening procedure are the same as the truly important variables. Based on the selected active variables, we develop the SDR method to estimate the central subspace and the structural dimension with both censored data and measurement error incorporated. The theoretical results of the proposed method are established. Simulation studies are reported to assess the performance of the proposed method. The proposed method is implemented to NKI breast cancer data.  相似文献   

19.
Panel data models with factor structures in both the errors and the regressors have received considerable attention recently. In these models, the errors and the regressors are correlated and the standard estimators are inconsistent. This paper shows that, for such models, a modified first-difference estimator (in which the time and the cross-sectional dimensions are interchanged) is consistent as the cross-sectional dimension grows but the time dimension is small. Although the estimator has a non standard asymptotic distribution, t and F tests have standard asymptotic distribution under the null hypothesis.  相似文献   

20.
Mihyun Kim 《Statistics》2019,53(4):699-720
Functional principal component scores are commonly used to reduce mathematically infinitely dimensional functional data to finite dimensional vectors. In certain applications, most notably in finance, these scores exhibit tail behaviour consistent with the assumption of regular variation. Knowledge of the index of the regular variation, α, is needed to apply methods of extreme value theory. The most commonly used method of the estimation of α is the Hill estimator. We derive conditions under which the Hill estimator computed from the sample scores is consistent for the tail index of the unobservable population scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号