首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described.  相似文献   

2.
This paper deals with the analysis of data from a HET‐CAMVT experiment. From a statistical perspective, such data yield many challenges. First of all, the data are typically time‐to‐event like data, which are at the same time interval censored and right truncated. In addition, one has to cope with overdispersion as well as clustering. Traditional analysis approaches ignore overdispersion and clustering and summarize the data into a continuous score that can be analysed using simple linear models. In this paper, a novel combined frailty model is developed that simultaneously captures all of the aforementioned statistical challenges posed by the data. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

3.
Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data.  相似文献   

4.
函数数据聚类分析方法探析   总被引:3,自引:0,他引:3  
函数数据是目前数据分析中新出现的一种数据类型,它同时具有时间序列和横截面数据的特征,通常可以描述为关于某一变量的函数图像,在实际应用中具有很强的实用性。首先简要分析函数数据的一些基本特征和目前提出的一些函数数据聚类方法,如均匀修正的函数数据K均值聚类方法、函数数据层次聚类方法等,并在此基础上,从函数特征分析的角度探讨了函数数据聚类方法,提出了一种基于导数分析的函数数据区间聚类分析方法,并利用中国中部六省的就业人口数据对该方法进行实证分析,取得了聚类结果。  相似文献   

5.
The authors propose a profile likelihood approach to linear clustering which explores potential linear clusters in a data set. For each linear cluster, an errors‐in‐variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing clustering methods are discussed. Methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purposes. The Canadian Journal of Statistics 38: 716–737; 2010 © 2010 Statistical Society of Canada  相似文献   

6.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

7.
函数型数据的稀疏性和无穷维特性使得传统聚类分析失效。针对此问题,本文在界定函数型数据概念与内涵的基础上提出了一种自适应迭代更新聚类分析。首先,基于数据参数信息实现无穷维函数空间向有限维多元空间的过渡;在此基础上,依据变量信息含量的差异构建了自适应赋权聚类统计量,并依此为函数型数据的相似性测度进行初始类别划分;进一步地,在给定阈值限制下,对所有函数的初始类别归属进行自适应迭代更新,将收敛的优化结果作为最终的类别划分。随机模拟和实证检验表明,与现有的同类函数型聚类分析相比,文中方法的分类正确率显著提高,体现了新方法的相对优良性和实际问题应用中的有效性。  相似文献   

8.
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.  相似文献   

9.
k-POD: A Method for k-Means Clustering of Missing Data   总被引:1,自引:0,他引:1  
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.]  相似文献   

10.
This article proposes a new model for right‐censored survival data with multi‐level clustering based on the hierarchical Kendall copula model of Brechmann (2014) with Archimedean clusters. This model accommodates clusters of unequal size and multiple clustering levels, without imposing any structural conditions on the parameters or on the copulas used at various levels of the hierarchy. A step‐wise estimation procedure is proposed and shown to yield consistent and asymptotically Gaussian estimates under mild regularity conditions. The model fitting is based on multiple imputation, given that the censoring rate increases with the level of the hierarchy. To check the model assumption of Archimedean dependence, a goodness‐of test is developed. The finite‐sample performance of the proposed estimators and of the goodness‐of‐fit test is investigated through simulations. The new model is applied to data from the study of chronic granulomatous disease. The Canadian Journal of Statistics 47: 182–203; 2019 © 2019 Statistical Society of Canada  相似文献   

11.
In this paper, we propose and study a new global test, namely, GPF test, for the one‐way anova problem for functional data, obtained via globalizing the usual pointwise F‐test. The asymptotic random expressions of the test statistic are derived, and its asymptotic power is investigated. The GPF test is shown to be root‐n consistent. It is much less computationally intensive than a parametric bootstrap test proposed in the literature for the one‐way anova for functional data. Via some simulation studies, it is found that in terms of size‐controlling and power, the GPF test is comparable with two existing tests adopted for the one‐way anova problem for functional data. A real data example illustrates the GPF test.  相似文献   

12.
Abstract A model is introduced here for multivariate failure time data arising from heterogenous populations. In particular, we consider a situation in which the failure times of individual subjects are often temporally clustered, so that many failures occur during a relatively short age interval. The clustering is modelled by assuming that the subjects can be divided into ‘internally homogenous’ latent classes, each such class being then described by a time‐dependent frailty profile function. As an example, we reanalysed the dental caries data presented earlier in Härkänen et al. [Scand. J. Statist. 27 (2000) 577], as it turned out that our earlier model could not adequately describe the observed clustering.  相似文献   

13.
Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.  相似文献   

14.
In this work, we develop a method of adaptive non‐parametric estimation, based on ‘warped’ kernels. The aim is to estimate a real‐valued function s from a sample of random couples (X,Y). We deal with transformed data (Φ(X),Y), with Φ a one‐to‐one function, to build a collection of kernel estimators. The data‐driven bandwidth selection is performed with a method inspired by Goldenshluger and Lepski (Ann. Statist., 39, 2011, 1608). The method permits to handle various problems such as additive and multiplicative regression, conditional density estimation, hazard rate estimation based on randomly right‐censored data, and cumulative distribution function estimation from current‐status data. The interest is threefold. First, the squared‐bias/variance trade‐off is automatically realized. Next, non‐asymptotic risk bounds are derived. Lastly, the estimator is easily computed, thanks to its simple expression: a short simulation study is presented.  相似文献   

15.
The paper considers the clustering of two large sets of Internet traffic data consisting of information measured from headers of transmission control protocol packets collected on a busy arc of a university network connecting with the Internet. Packets are grouped into 'flows' thought to correspond to particular movements of information between one computer and another. The clustering is based on representing the flows as each sampled from one of a finite number of multinomial distributions and seeks to identify clusters of flows containing similar packet‐length distributions. The clustering uses the EM algorithm, and the data‐analytic and computational details are given.  相似文献   

16.
The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.  相似文献   

17.
This study considers the binary classification of functional data collected in the form of curves. In particular, we assume a situation in which the curves are highly mixed over the entire domain, so that the global discriminant analysis based on the entire domain is not effective. This study proposes an interval-based classification method for functional data: the informative intervals for classification are selected and used for separating the curves into two classes. The proposed method, called functional logistic regression with fused lasso penalty, combines the functional logistic regression as a classifier and the fused lasso for selecting discriminant segments. The proposed method automatically selects the most informative segments of functional data for classification by employing the fused lasso penalty and simultaneously classifies the data based on the selected segments using the functional logistic regression. The effectiveness of the proposed method is demonstrated with simulated and real data examples.  相似文献   

18.
This paper introduces a method for clustering spatially dependent functional data. The idea is to consider the contribution of each curve to the spatial variability. Thus, we define a spatial dispersion function associated to each curve and perform a k-means like clustering algorithm. The algorithm is based on the optimization of a fitting criterion between the spatial dispersion functions associated to each curve and the representative of the clusters. The performance of the proposed method is illustrated by an application on real data and a simulation study.  相似文献   

19.
The authors propose a method for comparing two samples of curves. The notion of similarity between two curves is the basis of three statistics they suggest for testing the null hypothesis of no difference between the two groups. They exploit standard tools from functional data analysis to preprocess the observed curves and use the permutation distribution under the null hypothesis to obtain p‐values for their tests. They explore the operating characteristics of these tests through simulations and as an application, compare the ganglioside distribution in brain tissue between old and young rats.  相似文献   

20.
农业险定价中的核心问题是农业风险区划问题,为了体现农业区划中个体指标的动态发展特征,根据近邻传播改进自适应近邻传播聚类方法对数据进行优化,基于轮廓系数、归属度和吸引度得到最佳聚类中心和几何聚类中心,并将聚类转化为新数据集的聚类问题;选取代表性的棉花为例进行实证分析,通过计算生产、销售、收入、财政等指标进行棉花风险区划实例分析,计算最优棉花风险区划,结果表明对于具有动态特征的数据,本模型具有很好的有效性、实用性和解释性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号