首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Abstract

We propose a statistical method for clustering multivariate longitudinal data into homogeneous groups. This method relies on a time-varying extension of the classical K-means algorithm, where a multivariate vector autoregressive model is additionally assumed for modeling the evolution of clusters' centroids over time. Model inference is based on a least-squares method and on a coordinate descent algorithm. To illustrate our work, we consider a longitudinal dataset on human development. Three variables are modeled, namely life expectancy, education and gross domestic product.  相似文献   

2.
Clustering Algorithms are nowadays really important tools in microarray data analysis. The different clustering algorithm generally used in biological science does not take into consideration the underlying probability distribution of the data. In this sense, they are heuristic in nature. In this work we proposed a clustering algorithm based on EM Algorithm. It gives 28% less misclassification than the K-means algorithm (which is mostly use in Bio science). We have also shown on a real data set that this algorithm can be efficiently used for detecting the genes which are responsible for a particular disease.  相似文献   

3.
函数型数据的稀疏性和无穷维特性使得传统聚类分析失效。针对此问题,本文在界定函数型数据概念与内涵的基础上提出了一种自适应迭代更新聚类分析。首先,基于数据参数信息实现无穷维函数空间向有限维多元空间的过渡;在此基础上,依据变量信息含量的差异构建了自适应赋权聚类统计量,并依此为函数型数据的相似性测度进行初始类别划分;进一步地,在给定阈值限制下,对所有函数的初始类别归属进行自适应迭代更新,将收敛的优化结果作为最终的类别划分。随机模拟和实证检验表明,与现有的同类函数型聚类分析相比,文中方法的分类正确率显著提高,体现了新方法的相对优良性和实际问题应用中的有效性。  相似文献   

4.
Interpretation of principal components is difficult due to their weights (loadings, coefficients) being of various sizes. Whereas very small weights or very large weights can give clear indication of the importance of particular variables, weights that are neither large nor small (‘grey area’ weights) are problematical. This is a particular problem in the fast moving goods industries where a lot of multivariate panel data are collected on products. These panel data are subjected to univariate analyses and multivariate analyses where principal components (PCs) are key to the interpretation of the data. Several authors have suggested alternatives to PCs, seeking simplified components such as sparse PCs. Here components, termed simple components (SCs), are sought in conjunction with Thurstonian criteria that a component should have only a few variables highly weighted on it and each variable should be weighted heavily on just a few components. An algorithm is presented that finds SCs efficiently. Simple components are found for panel data consisting of the responses to a questionnaire on efficacy and other features of deodorants. It is shown that five SCs can explain an amount of variation within the data comparable to that explained by the PCs, but with easier interpretation.  相似文献   

5.
在面板数据聚类分析方法的研究中,基于面板数据兼具截面维度和时间维度的特征,对欧氏距离函数进行了改进,在聚类过程中考虑指标权重与时间权重,提出了适用于面板数据聚类分析的"加权距离函数"以及相应的Ward.D聚类方法。首先定义了考虑指标绝对值、邻近时点增长率以及波动变异程度的欧氏距离函数;然后,将指标权重与时间权重通过线性模型集结成综合加权距离,最终实现面板数据的加权聚类过程。实证分析结果显示,考虑指标权重与时间权重的面板数据加权聚类分析方法具有更好的分辨能力,能提高样本聚类的准确性。  相似文献   

6.
In this paper, we propose the MulticlusterKDE algorithm applied to classify elements of a database into categories based on their similarity. MulticlusterKDE is centered on the multiple optimization of the kernel density estimator function with multivariate Gaussian kernel. One of the main features of the proposed algorithm is that the number of clusters is an optional input parameter. Furthermore, it is very simple, easy to implement, well defined and stops at a finite number of steps and it always converges regardless of the data set. We illustrate our findings by implementing the algorithm in R software. The results indicate that the MulticlusterKDE algorithm is competitive when compared to K-means, K-medoids, CLARA, DBSCAN and PdfCluster algorithms. Features such as simplicity and efficiency make the proposed algorithm an attractive and promising research field that can be used as basis for its improvement and also for the development of new density-based clustering algorithms.  相似文献   

7.
This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

8.
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations.  相似文献   

9.
Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.  相似文献   

10.
Compared to tests for localized clusters, the tests for global clustering only collect evidence for clustering throughout the study region without evaluating the statistical significance of the individual clusters. The weighted likelihood ratio (WLR) test based on the weighted sum of likelihood ratios represents an important class of tests for global clustering. Song and Kulldorff (Likelihood based tests for spatial randomness. Stat Med. 2006;25(5):825–839) developed a wide variety of weight functions with the WLR test for global clustering. However, these weight functions are often defined based on the cell population size or the geographic information such as area size and distance between cells. They do not make use of the information from the observed count, although the likelihood ratio of a potential cluster depends on both the observed count and its population size. In this paper, we develop a self-adjusted weight function to directly allocate weights onto the likelihood ratios according to their values. The power of the test was evaluated and compared with existing methods based on a benchmark data set. The comparison results favour the suggested test especially under global chain clustering models.  相似文献   

11.
In this paper, we present an algorithm for clustering based on univariate kernel density estimation, named ClusterKDE. It consists of an iterative procedure that in each step a new cluster is obtained by minimizing a smooth kernel function. Although in our applications we have used the univariate Gaussian kernel, any smooth kernel function can be used. The proposed algorithm has the advantage of not requiring a priori the number of cluster. Furthermore, the ClusterKDE algorithm is very simple, easy to implement, well-defined and stops in a finite number of steps, namely, it always converges independently of the initial point. We also illustrate our findings by numerical experiments which are obtained when our algorithm is implemented in the software Matlab and applied to practical applications. The results indicate that the ClusterKDE algorithm is competitive and fast when compared with the well-known Clusterdata and K-means algorithms, used by Matlab to clustering data.  相似文献   

12.
In this article, we introduce a new weighted quantile regression method. Traditionally, the estimation of the parameters involved in quantile regression is obtained by minimizing a loss function based on absolute distances with weights independent of explanatory variables. Specifically, we study a new estimation method using a weighted loss function with the weights associated with explanatory variables so that the performance of the resulting estimation can be improved. In full generality, we derive the asymptotic distribution of the weighted quantile regression estimators for any uniformly bounded positive weight function independent of the response. Two practical weighting schemes are proposed, each for a certain type of data. Monte Carlo simulations are carried out for comparing our proposed methods with the classical approaches. We also demonstrate the proposed methods using two real-life data sets from the literature. Both our simulation study and the results from these examples show that our proposed method outperforms the classical approaches when the relative efficiency is measured by the mean-squared errors of the estimators.  相似文献   

13.
Calibration on the available auxiliary variables is widely used to increase the precision of the estimates of parameters. Singh and Sedory [Two-step calibration of design weights in survey sampling. Commun Stat Theory Methods. 2016;45(12):3510–3523.] considered the problem of calibration of design weights under two-step for single auxiliary variable. For a given sample, design weights and calibrated weights are set proportional to each other, in the first step. While, in the second step, the value of proportionality constant is determined on the basis of objectives of individual investigator/user for, for example, to get minimum mean squared error or reduction of bias. In this paper, we have suggested to use two auxiliary variables for two-step calibration of the design weights and compared the results with single auxiliary variable for different sample sizes based on simulated and real-life data set. The simulated and real-life application results show that two-auxiliary variables based two-step calibration estimator outperforms the estimator under single auxiliary variable in terms of minimum mean squared error.  相似文献   

14.
For an estimation with missing data, a crucial step is to determine if the data are missing completely at random (MCAR), in which case a complete‐case analysis would suffice. Most existing tests for MCAR do not provide a method for a subsequent estimation once the MCAR is rejected. In the setting of estimating means, we propose a unified approach for testing MCAR and the subsequent estimation. Upon rejecting MCAR, the same set of weights used for testing can then be used for estimation. The resulting estimators are consistent if the missingness of each response variable depends only on a set of fully observed auxiliary variables and the true outcome regression model is among the user‐specified functions for deriving the weights. The proposed method is based on the calibration idea from survey sampling literature and the empirical likelihood theory.  相似文献   

15.
In this article, a robust variable selection procedure based on the weighted composite quantile regression (WCQR) is proposed. Compared with the composite quantile regression (CQR), WCQR is robust to heavy-tailed errors and outliers in the explanatory variables. For the choice of the weights in the WCQR, we employ a weighting scheme based on the principal component method. To select variables with grouping effect, we consider WCQR with SCAD-L2 penalization. Furthermore, under some suitable assumptions, the theoretical properties, including the consistency and oracle property of the estimator, are established with a diverging number of parameters. In addition, we study the numerical performance of the proposed method in the case of ultrahigh-dimensional data. Simulation studies and real examples are provided to demonstrate the superiority of our method over the CQR method when there are outliers in the explanatory variables and/or the random error is from a heavy-tailed distribution.  相似文献   

16.
函数数据聚类分析方法探析   总被引:3,自引:0,他引:3  
函数数据是目前数据分析中新出现的一种数据类型,它同时具有时间序列和横截面数据的特征,通常可以描述为关于某一变量的函数图像,在实际应用中具有很强的实用性。首先简要分析函数数据的一些基本特征和目前提出的一些函数数据聚类方法,如均匀修正的函数数据K均值聚类方法、函数数据层次聚类方法等,并在此基础上,从函数特征分析的角度探讨了函数数据聚类方法,提出了一种基于导数分析的函数数据区间聚类分析方法,并利用中国中部六省的就业人口数据对该方法进行实证分析,取得了聚类结果。  相似文献   

17.
We propose a Random Splitting Model Averaging procedure, RSMA, to achieve stable predictions in high-dimensional linear models. The idea is to use split training data to construct and estimate candidate models and use test data to form a second-level data. The second-level data is used to estimate optimal weights for candidate models by quadratic optimization under non-negative constraints. This procedure has three appealing features: (1) RSMA avoids model overfitting, as a result, gives improved prediction accuracy. (2) By adaptively choosing optimal weights, we obtain more stable predictions via averaging over several candidate models. (3) Based on RSMA, a weighted importance index is proposed to rank the predictors to discriminate relevant predictors from irrelevant ones. Simulation studies and a real data analysis demonstrate that RSMA procedure has excellent predictive performance and the associated weighted importance index could well rank the predictors.  相似文献   

18.
This paper develops a novel weighted composite quantile regression (CQR) method for estimation of a linear model when some covariates are missing at random and the probability for missingness mechanism can be modelled parametrically. By incorporating the unbiased estimating equations of incomplete data into empirical likelihood (EL), we obtain the EL-based weights, and then re-adjust the inverse probability weighted CQR for estimating the vector of regression coefficients. Theoretical results show that the proposed method can achieve semiparametric efficiency if the selection probability function is correctly specified, therefore the EL weighted CQR is more efficient than the inverse probability weighted CQR. Besides, our algorithm is computationally simple and easy to implement. Simulation studies are conducted to examine the finite sample performance of the proposed procedures. Finally, we apply the new method to analyse the US news College data.  相似文献   

19.

The purpose of this paper is to show in regression clustering how to choose the most relevant solutions, analyze their stability, and provide information about best combinations of optimal number of groups, restriction factor among the error variance across groups and level of trimming. The procedure is based on two steps. First we generalize the information criteria of constrained robust multivariate clustering to the case of clustering weighted models. Differently from the traditional approaches which are based on the choice of the best solution found minimizing an information criterion (i.e. BIC), we concentrate our attention on the so called optimal stable solutions. In the second step, using the monitoring approach, we select the best value of the trimming factor. Finally, we validate the solution using a confirmatory forward search approach. A motivating example based on a novel dataset concerning the European Union trade of face masks shows the limitations of the current existing procedures. The suggested approach is initially applied to a set of well known datasets in the literature of robust regression clustering. Then, we focus our attention on a set of international trade datasets and we provide a novel informative way of updating the subset in the random start approach. The Supplementary material, in the spirit of the Special Issue, deepens the analysis of trade data and compares the suggested approach with the existing ones available in the literature.

  相似文献   

20.
基于聚类关联规则的缺失数据处理研究   总被引:2,自引:1,他引:2       下载免费PDF全文
 本文提出了基于聚类和关联规则的缺失数据处理新方法,通过聚类方法将含有缺失数据的数据集相近的记录归到一类,然后利用改进后的关联规则方法对各子数据集挖掘变量间的关联性,并利用这种关联性来填补缺失数据。通过实例分析,发现该方法对缺失数据处理,尤其是海量数据集具有较好的效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号