首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We describe inferactive data analysis, so-named to denote an interactive approach to data analysis with an emphasis on inference after data analysis. Our approach is a compromise between Tukey's exploratory and confirmatory data analysis allowing also for Bayesian data analysis. We see this as a useful step in concrete providing tools (with statistical guarantees) for current data scientists. The basis of inference we use is (a conditional approach to) selective inference, in particular its randomized form. The relevant reference distributions are constructed from what we call a DAG-DAG—a Data Analysis Generative DAG, and a selective change of variables formula is crucial to any practical implementation of inferactive data analysis via sampling these distributions. We discuss a canonical example of an incomplete cross-validation test statistic to discriminate between black box models, and a real HIV dataset example to illustrate inference after making multiple queries on data.  相似文献   

2.
Clustering of Variables Around Latent Components   总被引:1,自引:0,他引:1  
Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies.  相似文献   

3.
This paper focuses on smoothed functional canonical correlation analysis (SFCCA) to investigate the relationships and changes in large, seasonal and long-term data sets. The aim of this study is to introduce a guideline for SFCCA for functional data and to give some insights on the fine tuning of the methodology for long-term periodical data. The guidelines are applied on temperature and humidity data for 11 years between 2000 and 2010 and the results are interpreted. Seasonal changes or periodical shifts are visually studied by yearly comparisons. The effects of the ‘number of basis functions’ and the ‘selection of smoothing parameter’ on the general variability structure and on correlations between the curves are examined. It is concluded that the number of time points (knots), number of basis functions and the time span of evaluation (monthly, daily, etc.) should all be chosen harmoniously. It is found that changing the smoothing parameter does not have a significant effect on the structure of curves and correlations. The number of basis functions is found to be the main effector on both individual and correlation weight functions.  相似文献   

4.
5.
Compositional data are characterized by values containing relative information, and thus the ratios between the data values are of interest for the analysis. Due to specific features of compositional data, standard statistical methods should be applied to compositions expressed in a proper coordinate system with respect to an orthonormal basis. It is discussed how three-way compositional data can be analyzed with the Parafac model. When data are contaminated by outliers, robust estimates for the Parafac model parameters should be employed. It is demonstrated how robust estimation can be done in the context of compositional data and how the results can be interpreted. A real data example from macroeconomics underlines the usefulness of this approach.  相似文献   

6.
This article analyzes impulse response functions in the context of vector fractionally integrated time series. We derive analytically the restrictions required to identify the structural-form system. As an illustration of the recommended procedure, we carry out an empirical application based on a bivariate system including real output in the USA and, in turn, in one of the four Scandinavian countries (Denmark, Finland, Norway, and Sweden). The empirical results appear to be sensitive, to some extent, to the specification of the stochastic process driving the disturbances, but generally a positive shock to US output has a positive effect on the Scandinavian countries, which tend to disappear in the long run.  相似文献   

7.
Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set.  相似文献   

8.
Cluster analysis is an important technique of explorative data mining. It refers to a collection of statistical methods for learning the structure of data by solely exploring pairwise distances or similarities. Often meaningful structures are not detectable in these high-dimensional feature spaces. Relevant features can be obfuscated by noise from irrelevant measurements. These observations led to the design of subspace clustering algorithms, which can identify clusters that originate from different subsets of features. Hunting for clusters in arbitrary subspaces is intractable due to the infinite search space spanned by all feature combinations. In this work, we present a subspace clustering algorithm that can be applied for exhaustively screening all feature combinations of small- or medium-sized datasets (approximately 30 features). Based on a robustness analysis via subsampling we are able to identify a set of stable candidate subspace cluster solutions.  相似文献   

9.
Bayesian classification of Neolithic tools   总被引:1,自引:0,他引:1  
The classification of Neolithic tools by using cluster analysis enables archaeologists to understand the function of the tools and the technological and cultural conditions of the societies that made them. In this paper, Bayesian classification is adopted to analyse data which raise the question whether the observed variability, e.g. the shape and dimensions of the tools, is related to their use. The data present technical difficulties for the practitioner, such as the presence of mixed mode data, missing data and errors in variables. These complications are overcome by employing a finite mixture model and Markov chain Monte Carlo methods. The analysis uses prior information which expresses the archaeologist's belief that there are two tool groups that are similar to contemporary adzes and axes. The resulting mixing densities provide evidence that the morphological dimensional variability among tools is related to the existence of these two tool groups.  相似文献   

10.
We critically review the Better Life Index (BLI) recently introduced by the Organization for Economic Co-operation and Development (OECD). We discuss methodological issues in the definition of the criteria used to rank the countries, as well as in their aggregation method. Moreover, we explore the unique option offered by the BLI to apply one's own weight set to 11 criteria. Although 16 countries can be ranked first by choosing ad hoc weightings, only Canada, Australia and Sweden do so over a substantial fraction of the parameter space defined by all possible weight sets. Furthermore, most pairwise comparisons between countries are insensitive to the choice of the weights. Therefore, the BLI establishes a hierarchy among the evaluated countries, independent of the chosen set of weights.  相似文献   

11.
We first compare correspondence analysis, which uses chi-square distance, and an alternative approach using Hellinger distance, for representing categorical data in a contingency table. We propose a coefficient which globally measures the similarity between these two approaches. This coefficient can be decomposed into several components, one component for each principal dimension, indicating the contribution of the dimensions to the difference between the two representations. We also make comparisons with the logratio approach based on compositional data. These three methods of representation can produce quite similar results. Two illustrative examples are given.  相似文献   

12.
扫描数据是一种新的统计数据来源,在消费者价格指数编制中拥有很大的优势。国外统计部门从一开始就非常重视对这一新的数据来源的应用研究和实践。本文分析了荷兰、挪威、瑞士和瑞典四国目前应用扫描数据编制CPI的情况,选择奶酪和啤酒两个基本分类,分别运用Jevons和T?rnqvist公式编制CPI基本分类指数,编制结果与RYGEKS基准指数进行比较分析,分析显示:Jevons链指数与RYGEKS指数比较有下行偏差,且偏差较大;T?rnqvist链指数与RYGEKS指数比较偏差方向不明但差异较小。偏差的上行或下行不仅与公式的选择有关,还与产品的性质有关。根据国外的经验和本文的研究结果,提出我国应用扫描数据编制CPI的研究方向和建议。  相似文献   

13.
The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with eight clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures.  相似文献   

14.
Model-based clustering of Gaussian copulas for mixed data   总被引:1,自引:0,他引:1  
Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features.  相似文献   

15.
This article discusses the use of mixture models in the analysis of longitudinal partially ranked data, where respondents, for example, choose only the preferred and second preferred out of a set of items. To model such data we convert it to a set of paired comparisons. Covariates can be incorporated into the model. We use a nonparametric mixture to account for unmeasured variability in individuals over time. The resulting multi-valued mass points can be interpreted as latent classes of the items. The work is illustrated by two questions on (post)materialism in three sweeps of the British Household Panel Survey.  相似文献   

16.
Using recently developed statistical tools for analyzing cointegrated 1(2) data, this article models money, income, prices, and interest rates in Denmark. The final model describes the dynamic adjustment to short-run changes of the process, to deviations from long-run steady states, and to several political interventions. It provides new insights about the effects of the liberalization of trade and capital in a small open European economy.  相似文献   

17.
Do divorcing couples become happier by breaking up?   总被引:1,自引:1,他引:0  
Summary.  Divorce is a leap in the dark. The paper investigates whether people who split up actually become happier. Using the British Household Panel Survey, we can observe an individual's level of psychological well-being in the years before and after divorce. Our results show that divorcing couples reap psychological gains from the dissolution of their marriages. Men and women benefit equally. The paper also studies the effects of bereavement, of having dependant children and of remarriage. We measure well-being by using general health questionnaire and life satisfaction scores.  相似文献   

18.
Repeated measures data collected at random observation times are quite common in clinical studies and are often difficult to analyze. A Monte Carlo comparison of four analysis procedures with respect to significance level and power is presented. The basic procedures compared are successive difference analyses and three procedures using the data as summarized in the estimated quadratic polynomial regression coefficients for each subject. These three procedures are (1) Hotelling's T-square, (2) Multivariate Multisample Rank Sum Test (MMRST) and (3) Multivariate Multisample Median Test (MMMT).

For the variety of dispersion structures, sample sizes and treatement groups simulated the MMRST and successive difference analysis were the most satisfactory.  相似文献   

19.
Summary.  A fundamental focus of Government concern is to enhance well-being. Recently, policy makers in the UK and elsewhere have recognized the importance of the community and society to the well-being of the nation as a whole. We explore the extent to which economic and social factors influence the psychological well-being of individuals and their perceptions of the social support that they receive, using Health Survey for England data. We employ a random-effects ordered probit modelling approach and find that unobserved intrahousehold characteristics help to explain the variation in our dependent variables, particularly for co-resident females. Our results indicate that individuals with acute and chronic physical illness, who are female, unemployed or inactive in the labour market and who live in poor households or areas of multiple deprivation report lower levels of psychological well-being. Reduced perceptions of social support are associated with being male, single or post marriage, from an ethnic minority, having low educational attainment and living in a poor household, but are not statistically related to area deprivation measures. These findings may help to inform the contemporary policy debate surrounding the promotion of individual well-being and community, through the alleviation of social exclusion.  相似文献   

20.
The latent class model or multivariate multinomial mixture is a powerful approach for clustering categorical data. It uses a conditional independence assumption given the latent class to which a statistical unit is belonging. In this paper, we exploit the fact that a fully Bayesian analysis with Jeffreys non-informative prior distributions does not involve technical difficulty to propose an exact expression of the integrated complete-data likelihood, which is known as being a meaningful model selection criterion in a clustering perspective. Similarly, a Monte Carlo approximation of the integrated observed-data likelihood can be obtained in two steps: an exact integration over the parameters is followed by an approximation of the sum over all possible partitions through an importance sampling strategy. Then, the exact and the approximate criteria experimentally compete, respectively, with their standard asymptotic BIC approximations for choosing the number of mixture components. Numerical experiments on simulated data and a biological example highlight that asymptotic criteria are usually dramatically more conservative than the non-asymptotic presented criteria, not only for moderate sample sizes as expected but also for quite large sample sizes. This research highlights that asymptotic standard criteria could often fail to select some interesting structures present in the data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号