首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 890 毫秒
1.
In this work we study a way to explore and extract more information from data sets with a hierarchical tree structure. We propose that any statistical study on this type of data should be made by group, after clustering. In this sense, the most adequate approach is to use the Mahalanobis–Wasserstein distance as a measure of similarity between the cases, to carry out clustering or unsupervised classification. This methodology allows for the clustering of cases, as well as the identification of their profiles, based on the distribution of all the variables that characterises each subject associated with each case. An application to a set of teenagers' interviews regarding their habits of communication is described. The interviewees answered several questions about the kind of contacts they had on their phone, Facebook, email or messenger as well as the frequency of communication between them. The results indicate that the methodology is adequate to cluster this kind of data sets, since it allows us to identify and characterise different profiles from the data. We compare the results obtained with this methodology with the ones obtained using the entire database, and we conclude that they may lead to different findings.  相似文献   

2.
By considering uncertainty in the attributes common methods cannot be applicable in data clustering. In the recent years, many researches have been done by considering fuzzy concepts to interpolate the uncertainty. But when data elements attributes have probabilistic distributions, the uncertainty cannot be interpreted by fuzzy theory. In this article, a new concept for clustering of elements with predefined probabilistic distributions for their attributes has been proposed, so each observation will be as a member of a cluster with special probability. Two metaheuristic algorithms have been applied to deal with the problem. Squared Euclidean distance type has been considered to calculate the similarity of data elements to cluster centers. The sensitivity analysis shows that the proposed approach will converge to the classic approaches results when the variance of each point tends to be zero. Moreover, numerical analysis confirms that the proposed approach is efficient in clustering of probabilistic data.  相似文献   

3.
The last decade has seen an explosion of work on the use of mixture models for clustering. The use of the Gaussian mixture model has been common practice, with constraints sometimes imposed upon the component covariance matrices to give families of mixture models. Similar approaches have also been applied, albeit with less fecundity, to classification and discriminant analysis. In this paper, we begin with an introduction to model-based clustering and a succinct account of the state-of-the-art. We then put forth a novel family of mixture models wherein each component is modeled using a multivariate t-distribution with an eigen-decomposed covariance structure. This family, which is largely a t-analogue of the well-known MCLUST family, is known as the tEIGEN family. The efficacy of this family for clustering, classification, and discriminant analysis is illustrated with both real and simulated data. The performance of this family is compared to its Gaussian counterpart on three real data sets.  相似文献   

4.
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations.  相似文献   

5.
We study the correlation structure for a mixture of ordinal and continuous repeated measures using a Bayesian approach. We assume a multivariate probit model for the ordinal variables and a normal linear regression for the continuous variables, where latent normal variables underlying the ordinal data are correlated with continuous variables in the model. Due to the probit model assumption, we are required to sample a covariance matrix with some of the diagonal elements equal to one. The key computational idea is to use parameter-extended data augmentation, which involves applying the Metropolis-Hastings algorithm to get a sample from the posterior distribution of the covariance matrix incorporating the relevant restrictions. The methodology is illustrated through a simulated example and through an application to data from the UCLA Brain Injury Research Center.  相似文献   

6.
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.  相似文献   

7.
The marginalized frailty model is often used for the analysis of correlated times in survival data. When only two correlated times are analyzed, this model is often referred to as the Clayton–Oakes model [7,22]. With time-to-event data, there may exist multiple end points (competing risks) suggesting that an analysis focusing on all available outcomes is of interest. The purpose of this work is to extend the single risk marginalized frailty model to the multiple risk setting via cause-specific hazards (CSH). The methods herein make use of the marginalized frailty model described by Pipper and Martinussen [24]. As such, this work uses the martingale theory to develop a likelihood based on estimating equations and observed histories. The proposed multivariate CSH model yields marginal regression parameter estimates while accommodating the clustering of outcomes. The multivariate CSH model can be fitted using a data augmentation algorithm described by Lunn and McNeil [21] or by fitting a series of single risk models for each of the competing risks. An example of the application of the multivariate CSH model is provided through the analysis of a family-based follow-up study of breast cancer with death in absence of breast cancer as a competing risk.  相似文献   

8.
Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter.  相似文献   

9.
Multi-level models can be used to account for clustering in data from multi-stage surveys. In some cases, the intraclass correlation may be close to zero, so that it may seem reasonable to ignore clustering and fit a single-level model. This article proposes several adaptive strategies for allowing for clustering in regression analysis of multi-stage survey data. The approach is based on testing whether the PSU-level variance component is zero. If this hypothesis is retained, then variance estimates are calculated ignoring clustering; otherwise, clustering is reflected in variance estimation. A simple simulation study is used to evaluate the various procedures.  相似文献   

10.
We consider a non-centered parameterization of the standard random-effects model, which is based on the Cholesky decomposition of the variance-covariance matrix. The regression type structure of the non-centered parameterization allows us to use Bayesian variable selection methods for covariance selection. We search for a parsimonious variance-covariance matrix by identifying the non-zero elements of the Cholesky factors. With this method we are able to learn from the data for each effect whether it is random or not, and whether covariances among random effects are zero. An application in marketing shows a substantial reduction of the number of free elements in the variance-covariance matrix.  相似文献   

11.
Various methods to control the influence of a covariate on a response variable are compared. These methods are ANOVA with or without homogeneity of variances (HOV) of errors and Kruskal–Wallis (K–W) tests on (covariate-adjusted) residuals and analysis of covariance (ANCOVA). Covariate-adjusted residuals are obtained from the overall regression line fit to the entire data set ignoring the treatment levels or factors. It is demonstrated that the methods on covariate-adjusted residuals are only appropriate when the regression lines are parallel and covariate means are equal for all treatments. Empirical size and power performance of the methods are compared by extensive Monte Carlo simulations. We manipulated the conditions such as assumptions of normality and HOV, sample size, and clustering of the covariates. The parametric methods on residuals and ANCOVA exhibited similar size and power when error terms have symmetric distributions with variances having the same functional form for each treatment, and covariates have uniform distributions within the same interval for each treatment. In such cases, parametric tests have higher power compared to the K–W test on residuals. When error terms have asymmetric distributions or have variances that are heterogeneous with different functional forms for each treatment, the tests are liberal with K–W test having higher power than others. The methods on covariate-adjusted residuals are severely affected by the clustering of the covariates relative to the treatment factors when covariate means are very different for treatments. For data clusters, ANCOVA method exhibits the appropriate level. However, such a clustering might suggest dependence between the covariates and the treatment factors, so makes ANCOVA less reliable as well.  相似文献   

12.
高海燕等 《统计研究》2020,37(8):91-103
函数型聚类分析算法涉及投影和聚类两个基本要素。通常,最优投影结果未必能够有效地保留类别信息,从而影响后续聚类效果。为此,本文梳理了函数型聚类的构成要素及运行过程;借助非负矩阵分解的聚类特性,提出了基于非负矩阵分解的函数型聚类算法,构建了“投影与聚类”并行的实现框架,并采用交替迭代方法更新求解,分析了算法的计算时间复杂度。针对随机模拟数据验证和语音识别数据的实例检验结果显示,该函数型聚类算法有助于提高聚类效果;针对北京市二氧化氮(NO2)污染物小时浓度数据的实例应用表明,该函数型聚类算法对空气质量监测点类型的区分能够充分识别站点布局的空间模式,具有良好的实际应用价值。  相似文献   

13.
Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data.  相似文献   

14.
在聚类问题中,若变量之间存在相关性,传统的应对方法主要是考虑采用马氏距离、主成分聚类等方法,但其可操作性或可解释性较差,因此提出一类基于模型的聚类方法,先对变量间的相关性结构建模(作为辅助信息)再做聚类分析。这种方法的优点主要在于:适用范围更宽泛,不仅能处理(线性)相关问题,而且还可以处理变量间存在的其他复杂结构生成的数据聚类问题;各个变量的重要性也可以通过模型的回归系数来体现;比马氏距离更稳健、更具操作性,比主成分聚类更容易得到解释,算法上也更为简洁有效。  相似文献   

15.
Multivariate failure time data are commonly encountered in biomedical research since each study subject may experience multiple events or because there exists clustering of subjects such that failure times within the same cluster are correlated. In this article, we use the frailty approach to catch the related survival variables and assume each event is a discrete analog as an interval of clinical examinations periodically. For estimation, an Expectation–Maximization (EM) algorithm is developed and is applied to the diabetic retinopathy study (DRS).  相似文献   

16.
We present a scalable Bayesian modelling approach for identifying brain regions that respond to a certain stimulus and use them to classify subjects. More specifically, we deal with multi‐subject electroencephalography (EEG) data with a binary response distinguishing between alcoholic and control groups. The covariates are matrix‐variate with measurements taken from each subject at different locations across multiple time points. EEG data have a complex structure with both spatial and temporal attributes. We use a divide‐and‐conquer strategy and build separate local models, that is, one model at each time point. We employ Bayesian variable selection approaches using a structured continuous spike‐and‐slab prior to identify the locations that respond to a certain stimulus. We incorporate the spatio‐temporal structure through a Kronecker product of the spatial and temporal correlation matrices. We develop a highly scalable estimation algorithm, using likelihood approximation, to deal with large number of parameters in the model. Variable selection is done via clustering of the locations based on their duration of activation. We use scoring rules to evaluate the prediction performance. Simulation studies demonstrate the efficiency of our scalable algorithm in terms of estimation and fast computation. We present results using our scalable approach on a case study of multi‐subject EEG data.  相似文献   

17.
In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.  相似文献   

18.
Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation–maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.  相似文献   

19.
In the health and social sciences, researchers often encounter categorical data for which complexities come from a nested hierarchy and/or cross-classification for the sampling structure. A common feature of these studies is a non-standard data structure with repeated measurements which may have some degree of clustering. In this paper, methodology is presented for the joint estimation of quantities of interest in the context of a stratified two-stage sample with bivariate dichotomous data. These quantities are the mean value π of an observed dichotomous response for a certain condition or time-point and a set of correlation coefficients for intra-cluster association for each condition or time period and for inter-condition correlation within and among clusters. The methodology uses the cluster means and pairwise joint probability parameters from each cluster. They together provide appropriate information across clusters for the estimation of the correlation coefficients.  相似文献   

20.
This article proposes a dynamic framework for modeling and forecasting of realized covariance matrices using vine copulas to allow for more flexible dependencies between assets. Our model automatically guarantees positive definiteness of the forecast through the use of a Cholesky decomposition of the realized covariance matrix. We explicitly account for long-memory behavior by using fractionally integrated autoregressive moving average (ARFIMA) and heterogeneous autoregressive (HAR) models for the individual elements of the decomposition. Furthermore, our model incorporates non-Gaussian innovations and GARCH effects, accounting for volatility clustering and unconditional kurtosis. The dependence structure between assets is studied using vine copula constructions, which allow for nonlinearity and asymmetry without suffering from an inflexible tail behavior or symmetry restrictions as in conventional multivariate models. Further, the copulas have a direct impact on the point forecasts of the realized covariances matrices, due to being computed as a nonlinear transformation of the forecasts for the Cholesky matrix. Beside studying in-sample properties, we assess the usefulness of our method in a one-day-ahead forecasting framework, comparing recent types of models for the realized covariance matrix based on a model confidence set approach. Additionally, we find that in Value-at-Risk (VaR) forecasting, vine models require less capital requirements due to smoother and more accurate forecasts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号