首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The idea of searching for orthogonal projections, from a multidimensional space into a linear subspace, as an aid to detecting non-linear structure has been named exploratory projection pursuit.Most approaches are tied to the idea of searching for interesting projections. Typically, an interesting projection is one where the distribution of the projected data differs from the normal distribution. In this paper we define two projection indices which are aimed specifically at finding projections that best show grouped structure in the plane, if this exists in the multi-dimensional space. These involve a numerical optimization problem which is tackled in two stages, the projection and the pursuit; the first is based on a procedure to generate pseudo-random rotation matrices in the sense of the grand tour by D. Asimov (1985), and the second is a local numerical optimization procedure. One artificial and one real example illustrate the performance of the suggested indices.  相似文献   

2.
In this paper, a new method for robust principal component analysis (PCA) is proposed. PCA is a widely used tool for dimension reduction without substantial loss of information. However, the classical PCA is vulnerable to outliers due to its dependence on the empirical covariance matrix. To avoid such weakness, several alternative approaches based on robust scatter matrix were suggested. A popular choice is ROBPCA that combines projection pursuit ideas with robust covariance estimation via variance maximization criterion. Our approach is based on the fact that PCA can be formulated as a regression-type optimization problem, which is the main difference from the previous approaches. The proposed robust PCA is derived by substituting square loss function with a robust penalty function, Huber loss function. A practical algorithm is proposed in order to implement an optimization computation, and furthermore, convergence properties of the algorithm are investigated. Results from a simulation study and a real data example demonstrate the promising empirical properties of the proposed method.  相似文献   

3.
Projection techniques for nonlinear principal component analysis   总被引:4,自引:0,他引:4  
Principal Components Analysis (PCA) is traditionally a linear technique for projecting multidimensional data onto lower dimensional subspaces with minimal loss of variance. However, there are several applications where the data lie in a lower dimensional subspace that is not linear; in these cases linear PCA is not the optimal method to recover this subspace and thus account for the largest proportion of variance in the data.Nonlinear PCA addresses the nonlinearity problem by relaxing the linear restrictions on standard PCA. We investigate both linear and nonlinear approaches to PCA both exclusively and in combination. In particular we introduce a combination of projection pursuit and nonlinear regression for nonlinear PCA. We compare the success of PCA techniques in variance recovery by applying linear, nonlinear and hybrid methods to some simulated and real data sets.We show that the best linear projection that captures the structure in the data (in the sense that the original data can be reconstructed from the projection) is not necessarily a (linear) principal component. We also show that the ability of certain nonlinear projections to capture data structure is affected by the choice of constraint in the eigendecomposition of a nonlinear transform of the data. Similar success in recovering data structure was observed for both linear and nonlinear projections.  相似文献   

4.
Concerning the task of integrating census and survey data from different sources as it is carried out by supranational statistical agencies, a formal metadata approach is investigated which supports data integration and table processing simultaneously. To this end, a metadata model is devised such that statistical query processing is accomplished by means of symbolic reasoning on machine-readable, operative metadata. As in databases, statistical queries are stated as formal expressions specifying declaratively what the intended output is; the operations necessary to retrieve appropriate available source data and to aggregate source data into the requested macrodata are derived mechanically. Using simple mathematics, this paper focuses particularly on the metadata model devised to harmonize semantically related data sources as well as the table model providing the principal data structure of the proposed system. Only an outline of the general design of a statistical information system based on the proposed metadata model is given and the state of development is summarized briefly.  相似文献   

5.
Recent evidence indicates that using multiple forward rates sharply predicts future excess returns on U.S. Treasury Bonds, with the R2's being around 30%. The projection coefficients in these regressions exhibit a distinct pattern that relates to the maturity of the forward rate. These dimensions of the data, in conjunction with the transition dynamics of bond yields, offer a serious challenge to term structure models. In this article we show that a regime-shifting term structure model can empirically account for these challenging data features. Alternative models, such as affine specification, fail to account for these important features. We find that regimes in the model are intimately related to bond risk premia and real business cycles.  相似文献   

6.
This article considers an approach to estimating and testing a new Kronecker product covariance structure for three-level (multiple time points (p), multiple sites (u), and multiple response variables (q)) multivariate data. Testing of such covariance structure is potentially important for high dimensional multi-level multivariate data. The hypothesis testing procedure developed in this article can not only test the hypothesis for three-level multivariate data, but also can test many different hypotheses, such as blocked compound symmetry, for two-level multivariate data as special cases. The tests are implemented with two real data sets.  相似文献   

7.
Six years of rainfall-event pH measurements from the nine-station MAP3S/PCN monitoring network in the eastern United States were analyzed. The initial objective was an attempted validation of the model developed by Eynon and Switzer (1983, Canad. J. Statist. 11, 11–24) on this independent data set. Because some features of the structure presumed in the model are not evident in this data set, the underlying structure of the data was then explored in some detail. Both aspects of the investigation confirmed that the identification of an appropriate statistical model for such data is a difficult undertaking; anticipated structure may not be evident, and the data corresponding to specific stations or years may exhibit anomalous behavior.  相似文献   

8.
Recurrent events involve the occurrences of the same type of event repeatedly over time and are commonly encountered in longitudinal studies. Examples include seizures in epileptic studies or occurrence of cancer tumors. In such studies, interest lies in the number of events that occur over a fixed period of time. One considerable challenge in analyzing such data arises when a large proportion of patients discontinues before the end of the study, for example, because of adverse events, leading to partially observed data. In this situation, data are often modeled using a negative binomial distribution with time‐in‐study as offset. Such an analysis assumes that data are missing at random (MAR). As we cannot test the adequacy of MAR, sensitivity analyses that assess the robustness of conclusions across a range of different assumptions need to be performed. Sophisticated sensitivity analyses for continuous data are being frequently performed. However, this is less the case for recurrent event or count data. We will present a flexible approach to perform clinically interpretable sensitivity analyses for recurrent event data. Our approach fits into the framework of reference‐based imputations, where information from reference arms can be borrowed to impute post‐discontinuation data. Different assumptions about the future behavior of dropouts dependent on reasons for dropout and received treatment can be made. The imputation model is based on a flexible model that allows for time‐varying baseline intensities. We assess the performance in a simulation study and provide an illustration with a clinical trial in patients who suffer from bladder cancer. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

9.
A supersaturated design (SSD) is a design whose run size is not enough for estimating all the main effects. The goal in conducting such a design is to identify, presumably only a few, relatively dominant active effects with a cost as low as possible. However, data analysis of such designs remains primitive: traditional approaches are not appropriate in such a situation and several methods which were proposed in the literature in recent years are effective when used to analyze two-level SSDs. In this paper, we introduce a variable selection procedure, called the PLSVS method, to screen active effects in mixed-level SSDs based on the variable importance in projection which is an important concept in the partial least-squares regression. Simulation studies show that this procedure is effective.  相似文献   

10.
Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data.  相似文献   

11.
Many directional data such as wind directions can be collected extremely easily so that experiments typically yield a huge number of data points that are sequentially collected. To deal with such big data, the traditional nonparametric techniques rapidly require a lot of time to be computed and therefore become useless in practice if real time or online forecasts are expected. In this paper, we propose a recursive kernel density estimator for directional data which (i) can be updated extremely easily when a new set of observations is available and (ii) keeps asymptotically the nice features of the traditional kernel density estimator. Our methodology is based on Robbins–Monro stochastic approximations ideas. We show that our estimator outperforms the traditional techniques in terms of computational time while being extremely competitive in terms of efficiency with respect to its competitors in the sequential context considered here. We obtain expressions for its asymptotic bias and variance together with an almost sure convergence rate and an asymptotic normality result. Our technique is illustrated on a wind dataset collected in Spain. A Monte‐Carlo study confirms the nice properties of our recursive estimator with respect to its non‐recursive counterpart.  相似文献   

12.
Among the diverse frameworks that have been proposed for regression analysis of angular data, the projected multivariate linear model provides a particularly appealing and tractable methodology. In this model, the observed directional responses are assumed to correspond to the angles formed by latent bivariate normal random vectors that are assumed to depend upon covariates through a linear model. This implies an angular normal distribution for the observed angles, and incorporates a regression structure through a familiar and convenient relationship. In this paper we extend this methodology to accommodate clustered data (e.g., longitudinal or repeated measures data) by formulating a marginal version of the model and basing estimation on an EM‐like algorithm in which correlation among within‐cluster responses is taken into account by incorporating a working correlation matrix into the M step. A sandwich estimator is used for the parameter estimates’ covariance matrix. The methodology is motivated and illustrated using an example involving clustered measurements of microbril angle on loblolly pine (Pinus taeda L.) Simulation studies are presented that evaluate the finite sample properties of the proposed fitting method. In addition, the relationship between within‐cluster correlation on the latent Euclidean vectors and the corresponding correlation structure for the observed angles is explored.  相似文献   

13.
In this article, we study the methods for two-sample hypothesis testing of high-dimensional data coming from a multivariate binary distribution. We test the random projection method and apply an Edgeworth expansion for improvement. Additionally, we propose new statistics which are especially useful for sparse data. We compare the performance of these tests in various scenarios through simulations run in a parallel computing environment. Additionally, we apply these tests to the 20 Newsgroup data showing that our proposed tests have considerably higher power than the others for differentiating groups of news articles with different topics.  相似文献   

14.
Conditional information measures the information in a sample for an interest parameter in the presence of nuisance parameter. In the context of Gaussian likelihoods this paper first derives conditions under which a projection of the data may reduce conditional information to zero. These are then applied in the context of time series regressions, and inference on a covariance parameter, such as with either autoregressive or moving average errors. It is shown that regressing out very common regressors, such as a linear trend or dummy variable, can imply that conditional information is zero in the case of non-stationary autoregressions or non-invertible moving averages, respectively.  相似文献   

15.
Data on the timing of events such as births, residential moves and changes in employment status are collected in many longitudinal surveys. These data often have a highly complex structure, with events of several types occurring repeatedly over time to an individual and interdependences between different event processes (e.g. births and employment transitions). The aim of this paper is to review a general class of multilevel discrete‐time event history models for handling recurrent events and transitions between multiple states. It is also shown how standard methods can be extended to allow for time‐varying covariates that are outcomes of an event process that is jointly determined with the process of interest. The considerable potential of these methods for studying transitions through the life course is illustrated in analyses of the effect of the presence and age of children on women's employment transitions, using data from the British Household Panel Survey.  相似文献   

16.
17.
It is of essential importance that researchers have access to linked employer–employee data, but such data sets are rarely available for researchers or the public. Even in case that survey data have been made available, the evaluation of estimation methods is usually done by complex design-based simulation studies. For this aim, data on population level are needed to know the true parameters that are compared with the estimations derived from complex samples. These samples are usually drawn from the population under various sampling designs, missing values and outlier scenarios. The structural earnings statistics sample survey proposes accurate and harmonized data on the level and structure of remuneration of employees, their individual characteristics and the enterprise or place of employment to which they belong in EU member states and candidate countries. At the basis of this data set, we show how to simulate a synthetic close-to-reality population representing the employer and employee structure of Austria. The proposed simulation is based on work of A. Alfons, S. Kraft, M. Templ, and P. Filzmoser [{\em On the simulation of complex universes in the case of applying the German microcensus}, DACSEIS research paper series No. 4, University of Tübingen, 2003] and R. Münnich and J. Schürle [{\em Simulation of close-to-reality population data for household surveys with application to EU-SILC}, Statistical Methods & Applications 20(3) (2011c), pp. 383–407]. However, new challenges are related to consider the special structure of employer–employee data and the complexity induced with the underlying two-stage design of the survey. By using quality measures in form of simple summary statistics, benchmarking indicators and visualizations, the simulated population is analysed and evaluated. An accompanying study on literature has been made to select the most important benchmarking indicators.  相似文献   

18.
Multivariate Poisson regression with covariance structure   总被引:1,自引:0,他引:1  
In recent years the applications of multivariate Poisson models have increased, mainly because of the gradual increase in computer performance. The multivariate Poisson model used in practice is based on a common covariance term for all the pairs of variables. This is rather restrictive and does not allow for modelling the covariance structure of the data in a flexible way. In this paper we propose inference for a multivariate Poisson model with larger structure, i.e. different covariance for each pair of variables. Maximum likelihood estimation, as well as Bayesian estimation methods are proposed. Both are based on a data augmentation scheme that reflects the multivariate reduction derivation of the joint probability function. In order to enlarge the applicability of the model we allow for covariates in the specification of both the mean and the covariance parameters. Extension to models with complete structure with many multi-way covariance terms is discussed. The method is demonstrated by analyzing a real life data set.  相似文献   

19.
Summary: One specific problem statistical offices and research institutes are faced with when releasing microdata is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of the data, and information loss is potentially high. In this paper an alternative technique of creating scientific–use files is discussed, which reproduces the characteristics of the original data quite well. It is based on Fienberg (1997, 1994) who estimates and resamples from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates data sets – the resample – which have the same characteristics as the original survey data. The paper includes some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and a comparison between resampling and a common method of disclosure control (disturbance with multiplicative error) with regard to confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if the resampling procedure implements the correlation structure of the original data as a scale or if the data is multiplicatively perturbed and a correction term is used. On average, anonymization of data with multiplicatively perturbed values protects better against re–identification than the various resampling methods used.  相似文献   

20.
The posterior corneal curvature and many other medical, environmental, and ecological variables are measured with angles where its range is less than π. Such data are so-called axial or half circular data. Half circular data modeling has not received much attention from researchers. This paper proposes a new half circular distribution model based on inverse stereographic projection technique of Burr-XII distribution. The maximum likelihood estimates of parameters are obtained and a simulation study to evaluate the performance of estimates was carried out. The application on the posterior corneal curvature of 23 patients shows that the proposed distribution fits the data well.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号