期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Strong Consistency of Reduced K‐means Clustering

Yoshikazu Terada 《Scandinavian Journal of Statistics》2014,41(4):913-931

Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described. 相似文献

2.

Flexible modelling of simultaneously interval censored and truncated time‐to‐event data

下载免费PDF全文

Sammy Chebon Christel Faes Ann De Smedt Helena Geys 《Pharmaceutical statistics》2015,14(4):311-321

This paper deals with the analysis of data from a HET‐CAM^VT experiment. From a statistical perspective, such data yield many challenges. First of all, the data are typically time‐to‐event like data, which are at the same time interval censored and right truncated. In addition, one has to cope with overdispersion as well as clustering. Traditional analysis approaches ignore overdispersion and clustering and summarize the data into a continuous score that can be analysed using simple linear models. In this paper, a novel combined frailty model is developed that simultaneously captures all of the aforementioned statistical challenges posed by the data. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

3.

Curve prediction and clustering with mixtures of Gaussian process functional regression models

J. Q. Shi B. Wang 《Statistics and Computing》2008,18(3):267-283

Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data. 相似文献

4.

函数数据聚类分析方法探析 总被引：3，自引：0，他引：3

曾玉钰翁金钟《统计与信息论坛》2007,22(5):10-14

函数数据是目前数据分析中新出现的一种数据类型,它同时具有时间序列和横截面数据的特征,通常可以描述为关于某一变量的函数图像,在实际应用中具有很强的实用性。首先简要分析函数数据的一些基本特征和目前提出的一些函数数据聚类方法,如均匀修正的函数数据K均值聚类方法、函数数据层次聚类方法等,并在此基础上,从函数特征分析的角度探讨了函数数据聚类方法,提出了一种基于导数分析的函数数据区间聚类分析方法,并利用中国中部六省的就业人口数据对该方法进行实证分析,取得了聚类结果。相似文献

5.

Model‐based linear clustering

Guohua Yan William J. Welch Ruben H. Zamar 《Revue canadienne de statistique》2010,38(4):716-737

The authors propose a profile likelihood approach to linear clustering which explores potential linear clusters in a data set. For each linear cluster, an errors‐in‐variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing clustering methods are discussed. Methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purposes. The Canadian Journal of Statistics 38: 716–737; 2010 © 2010 Statistical Society of Canada 相似文献

6.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

7.

基于自适应迭代更新的函数型数据聚类方法研究

王德青等《统计研究》2015,32(4):91-96

函数型数据的稀疏性和无穷维特性使得传统聚类分析失效。针对此问题,本文在界定函数型数据概念与内涵的基础上提出了一种自适应迭代更新聚类分析。首先,基于数据参数信息实现无穷维函数空间向有限维多元空间的过渡;在此基础上,依据变量信息含量的差异构建了自适应赋权聚类统计量,并依此为函数型数据的相似性测度进行初始类别划分;进一步地,在给定阈值限制下,对所有函数的初始类别归属进行自适应迭代更新,将收敛的优化结果作为最终的类别划分。随机模拟和实证检验表明,与现有的同类函数型聚类分析相比,文中方法的分类正确率显著提高,体现了新方法的相对优良性和实际问题应用中的有效性。相似文献

8.

A Comparison of Hierarchical Methods for Clustering Functional Data

Laura Ferreira 《统计学通讯:模拟与计算》2013,42(9):1925-1949

Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data. 相似文献

9.

k-POD: A Method for k-Means Clustering of Missing Data 总被引：1，自引：0，他引：1

Jocelyn T. Chi Eric C. Chi Richard G. Baraniuk 《The American statistician》2013,67(1):91-99

The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.] 相似文献

10.

Modelling hierarchical clustered censored data with the hierarchical Kendall copula

Chien‐Lin Su Johanna G. Ne&#x;lehov Weijing Wang 《Revue canadienne de statistique》2019,47(2):182-203

This article proposes a new model for right‐censored survival data with multi‐level clustering based on the hierarchical Kendall copula model of Brechmann (2014) with Archimedean clusters. This model accommodates clusters of unequal size and multiple clustering levels, without imposing any structural conditions on the parameters or on the copulas used at various levels of the hierarchy. A step‐wise estimation procedure is proposed and shown to yield consistent and asymptotically Gaussian estimates under mild regularity conditions. The model fitting is based on multiple imputation, given that the censoring rate increases with the level of the hierarchy. To check the model assumption of Archimedean dependence, a goodness‐of test is developed. The finite‐sample performance of the proposed estimators and of the goodness‐of‐fit test is investigated through simulations. The new model is applied to data from the study of chronic granulomatous disease. The Canadian Journal of Statistics 47: 182–203; 2019 © 2019 Statistical Society of Canada 相似文献

11.

One‐Way anova for Functional Data via Globalizing the Pointwise F‐test

Jin‐Ting Zhang Xuehua Liang 《Scandinavian Journal of Statistics》2014,41(1):51-71

In this paper, we propose and study a new global test, namely, GPF test, for the one‐way anova problem for functional data, obtained via globalizing the usual pointwise F‐test. The asymptotic random expressions of the test statistic are derived, and its asymptotic power is investigated. The GPF test is shown to be root‐n consistent. It is much less computationally intensive than a parametric bootstrap test proposed in the literature for the one‐way anova for functional data. Via some simulation studies, it is found that in terms of size‐controlling and power, the GPF test is comparable with two existing tests adopted for the one‐way anova problem for functional data. A real data example illustrates the GPF test. 相似文献

12.

A Non-parametric Frailty Model for Temporally Clustered Multivariate Failure Times

Tommi Härkänen Hannu Hausen Jorma I. Virtanen Elja Arjas 《Scandinavian Journal of Statistics》2003,30(3):523-533

Abstract A model is introduced here for multivariate failure time data arising from heterogenous populations. In particular, we consider a situation in which the failure times of individual subjects are often temporally clustered, so that many failures occur during a relatively short age interval. The clustering is modelled by assuming that the subjects can be divided into ‘internally homogenous’ latent classes, each such class being then described by a time‐dependent frailty profile function. As an example, we reanalysed the dental caries data presented earlier in Härkänen et al. [Scand. J. Statist. 27 (2000) 577], as it turned out that our earlier model could not adequately describe the observed clustering. 相似文献

13.

Mixtures of general location model with factor analyzer covariance structure for clustering mixed type data

Leila Amiri Mojtaba Ganjali 《Journal of applied statistics》2019,46(11):2075-2100

Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented. 相似文献

14.

Adaptive Warped Kernel Estimators

下载免费PDF全文

Gaëlle Chagny 《Scandinavian Journal of Statistics》2015,42(2):336-360

In this work, we develop a method of adaptive non‐parametric estimation, based on ‘warped’ kernels. The aim is to estimate a real‐valued function s from a sample of random couples (X,Y). We deal with transformed data (Φ(X),Y), with Φ a one‐to‐one function, to build a collection of kernel estimators. The data‐driven bandwidth selection is performed with a method inspired by Goldenshluger and Lepski (Ann. Statist., 39, 2011, 1608). The method permits to handle various problems such as additive and multiplicative regression, conditional density estimation, hazard rate estimation based on randomly right‐censored data, and cumulative distribution function estimation from current‐status data. The interest is threefold. First, the squared‐bias/variance trade‐off is automatically realized. Next, non‐asymptotic risk bounds are derived. Lastly, the estimator is easily computed, thanks to its simple expression: a short simulation study is presented. 相似文献

15.

Using Multinomial Mixture Models to Cluster Internet Traffic

Murray Jorgensen 《Australian & New Zealand Journal of Statistics》2004,46(2):205-218

The paper considers the clustering of two large sets of Internet traffic data consisting of information measured from headers of transmission control protocol packets collected on a busy arc of a university network connecting with the Internet. Packets are grouped into 'flows' thought to correspond to particular movements of information between one computer and another. The clustering is based on representing the flows as each sampled from one of a finite number of multinomial distributions and seeks to identify clusters of flows containing similar packet‐length distributions. The clustering uses the EM algorithm, and the data‐analytic and computational details are given. 相似文献

16.

Functional logistic regression with fused lasso penalty

Hyojoong Kim 《Journal of Statistical Computation and Simulation》2018,88(15):2982-2999

This study considers the binary classification of functional data collected in the form of curves. In particular, we assume a situation in which the curves are highly mixed over the entire domain, so that the global discriminant analysis based on the entire domain is not effective. This study proposes an interval-based classification method for functional data: the informative intervals for classification are selected and used for separating the curves into two classes. The proposed method, called functional logistic regression with fused lasso penalty, combines the functional logistic regression as a classifier and the fused lasso for selecting discriminant segments. The proposed method automatically selects the most informative segments of functional data for classification by employing the fused lasso penalty and simultaneously classifies the data based on the selected segments using the functional logistic regression. The effectiveness of the proposed method is demonstrated with simulated and real data examples. 相似文献

17.

Impact of Contamination on Training and Test Error Rates in Statistical Clustering

C. Ruwet G. Haesbroeck 《统计学通讯:模拟与计算》2013,42(3):394-411

The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. 相似文献

18.

Spatial variability clustering for spatially dependent functional data

Elvira Romano Antonio Balzanella Rosanna Verde 《Statistics and Computing》2017,27(3):645-658

This paper introduces a method for clustering spatially dependent functional data. The idea is to consider the contribution of each curve to the spatial variability. Thus, we define a spatial dispersion function associated to each curve and perform a k-means like clustering algorithm. The algorithm is based on the optimization of a fitting criterion between the spatial dispersion functions associated to each curve and the representative of the clusters. The performance of the proposed method is illustrated by an application on real data and a simulation study. 相似文献

19.

A similarity analysis of curves

Yolanda Mu oz Maldonado Joan G. Staniswalis Louis N. Irwin Donna Byers 《Revue canadienne de statistique》2002,30(3):373-381

The authors propose a method for comparing two samples of curves. The notion of similarity between two curves is the basis of three statistics they suggest for testing the null hypothesis of no difference between the two groups. They exploit standard tools from functional data analysis to preprocess the observed curves and use the permutation distribution under the null hypothesis to obtain p‐values for their tests. They explore the operating characteristics of these tests through simulations and as an application, compare the ganglioside distribution in brain tissue between old and young rats. 相似文献

20.

Joint modeling of hierarchically clustered and overdispersed non‐gaussian continuous outcomes for comet assay data

Aklilu Habteab Ghebretinsae Christel Faes Geert Molenberghs Helena Geys Bas‐Jan Van der Leede 《Pharmaceutical statistics》2012,11(6):449-455

Multivariate longitudinal or clustered data are commonly encountered in clinical trials and toxicological studies. Typically, there is no single standard endpoint to assess the toxicity or efficacy of the compound of interest, but co‐primary endpoints are available to assess the toxic effects or the working of the compound. Modeling the responses jointly is thus appealing to draw overall inferences using all responses and to capture the association among the responses. Non‐Gaussian outcomes are often modeled univariately using exponential family models. To accommodate both the overdispersion and hierarchical structure in the data, Molenberghs et al. A family of generalized linear models for repeated measures with normal and conjugate random effects. Statistical Science 2010; 25:325–347 proposed using two separate sets of random effects. This papers considers a model for multivariate data with hierarchically clustered and overdispersed non‐Gaussian data. Gamma random effect for the over‐dispersion and normal random effects for the clustering in the data are being used. The two outcomes are jointly analyzed by assuming that the normal random effects for both endpoints are correlated. The association structure between the response is analytically derived. The fit of the joint model to data from a so‐called comet assay are compared with the univariate analysis of the two outcomes. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献