首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Unsupervised Curve Clustering using B-Splines   总被引:5,自引:0,他引:5  
Data in many different fields come to practitioners through a process naturally described as functional. Although data are gathered as finite vector and may contain measurement errors, the functional form have to be taken into account. We propose a clustering procedure of such data emphasizing the functional nature of the objects. The new clustering method consists of two stages: fitting the functional data by B‐splines and partitioning the estimated model coefficients using a k‐means algorithm. Strong consistency of the clustering method is proved and a real‐world example from food industry is given.  相似文献   

2.
In this work it is shown how the k-means method for clustering objects can be applied in the context of statistical shape analysis. Because the choice of the suitable distance measure is a key issue for shape analysis, the Hartigan and Wong k-means algorithm is adapted for this situation. Simulations on controlled artificial data sets demonstrate that distances on the pre-shape spaces are more appropriate than the Euclidean distance on the tangent space. Finally, results are presented of an application to a real problem of oceanography, which in fact motivated the current work.  相似文献   

3.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

4.
k-POD: A Method for k-Means Clustering of Missing Data   总被引:1,自引:0,他引:1  
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.]  相似文献   

5.
Abstract. The focus of this article is on simultaneous confidence bands over a rectangular covariate region for a linear regression model with k>1 covariates, for which only conservative or approximate confidence bands are available in the statistical literature stretching back to Working & Hotelling (J. Amer. Statist. Assoc. 24 , 1929; 73–85). Formulas of simultaneous confidence levels of the hyperbolic and constant width bands are provided. These involve only a k‐dimensional integral; it is unlikely that the simultaneous confidence levels can be expressed as an integral of less than k‐dimension. These formulas allow the construction for the first time of exact hyperbolic and constant width confidence bands for at least a small k(>1) by using numerical quadrature. Comparison between the hyperbolic and constant width bands is then addressed under both the average width and minimum volume confidence set criteria. It is observed that the constant width band can be drastically less efficient than the hyperbolic band when k>1. Finally it is pointed out how the methods given in this article can be applied to more general regression models such as fixed‐effect or random‐effect generalized linear regression models.  相似文献   

6.
Clustering algorithms like types of k-means are fast, but they are inefficient for shape clustering. There are some algorithms, which are effective, but their time complexities are too high. This paper proposes a novel heuristic to solve large-scale shape clustering. The proposed method is effective and it solves large-scale clustering problems in fraction of a second.  相似文献   

7.
A vector of k positive coordinates lies in the k-dimensional simplex when the sum of all coordinates in the vector is constrained to equal 1. Sampling distributions efficiently on the simplex can be difficult because of this constraint. This paper introduces a transformed logit-scale proposal for Markov Chain Monte Carlo that naturally adjusts step size based on the position in the simplex. This enables efficient sampling on the simplex even when the simplex is high dimensional and/or includes coordinates of differing orders of magnitude. Implementation of this method is shown with the SALTSampler R package and comparisons are made to other simpler sampling schemes to illustrate the improvement in performance this method provides. A simulation of a typical calibration problem also demonstrates the utility of this method.  相似文献   

8.
Through random cut‐points theory, the author extends inference for ordered categorical data to the unspecified continuum underlying the ordered categories. He shows that a random cut‐point Mann‐Whitney test yields slightly smaller p‐values than the conventional test for most data. However, when at least P% of the data lie in one of the k categories (with P = 80 for k = 2, P = 67 for k = 3,…, P = 18 for k = 30), he also shows that the conventional test can yield much smaller p‐values, and hence misleadingly liberal inference for the underlying continuum. The author derives formulas for exact tests; for k = 2, the Mann‐Whitney test is but a binomial test.  相似文献   

9.
The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.  相似文献   

10.
Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix.  相似文献   

11.
This paper focuses on unsupervised curve classification in the context of nuclear industry. At the Commissariat à l'Energie Atomique (CEA), Cadarache (France), the thermal-hydraulic computer code CATHARE is used to study the reliability of reactor vessels. The code inputs are physical parameters and the outputs are time evolution curves of a few other physical quantities. As the CATHARE code is quite complex and CPU time-consuming, it has to be approximated by a regression model. This regression process involves a clustering step. In the present paper, the CATHARE output curves are clustered using a k-means scheme, with a projection onto a lower dimensional space. We study the properties of the empirically optimal cluster centres found by the clustering method based on projections, compared with the ‘true’ ones. The choice of the projection basis is discussed, and an algorithm is implemented to select the best projection basis among a library of orthonormal bases. The approach is illustrated on a simulated example and then applied to the industrial problem.  相似文献   

12.
Liu and Singh (1993, 2006) introduced a depth‐based d‐variate extension of the nonparametric two sample scale test of Siegel and Tukey (1960). Liu and Singh (2006) generalized this depth‐based test for scale homogeneity of k ≥ 2 multivariate populations. Motivated by the work of Gastwirth (1965), we propose k sample percentile modifications of Liu and Singh's proposals. The test statistic is shown to be asymptotically normal when k = 2, and compares favorably with Liu and Singh (2006) if the underlying distributions are either symmetric with light tails or asymmetric. In the case of skewed distributions considered in this paper the power of the proposed tests can attain twice the power of the Liu‐Singh test for d ≥ 1. Finally, in the k‐sample case, it is shown that the asymptotic distribution of the proposed percentile modified Kruskal‐Wallis type test is χ2 with k ? 1 degrees of freedom. Power properties of this k‐sample test are similar to those for the proposed two sample one. The Canadian Journal of Statistics 39: 356–369; 2011 © 2011 Statistical Society of Canada  相似文献   

13.
In market research and some other areas, it is common that a sample of n judges (consumers, evaluators, etc.) are asked to independently rank a series of k objects or candidates. It is usually difficult to obtain the judges' full cooperation to completely rank all k objects. A practical way to overcome this difficulty is to give each judge the freedom to choose the number of top candidates he is willing to rank. A frequently encountered question in this type of survey is how to select the best object or candidate from the incompletely ranked data. This paper proposes a subset selection procedure which constructs a random subset of all the k objects involved in the survey such that the best object is included in the subset with a prespecified confidence. It is shown that the proposed subset selection procedure is distribution-free over a very broad class of underlying distributions. An example from a market research study is used to illustrate the proposed procedure.  相似文献   

14.
This paper deals with the problem of finding saturated designs for multivariate cubic regression on a cube which are nearly D-optimal. A finite class of designs is presented for the k dimensional cube having the property that the sequence of the best designs in this class for each k is asymptotically efficient as k increases. A method for constructing good designs in this class is discussed and the construction is carried out for 1?k?8. These numerical results are presented in the last section of the paper.  相似文献   

15.
Icicle Plots: Better Displays for Hierarchical Clustering   总被引:1,自引:0,他引:1  
An icicle plot is a method for presenting a hierarchical clustering. Compared with other methods of presentation, it is far easier in an icicle plot to read off which objects belong to which clusters, and which objects join or drop out from a cluster as we move up and down the levels of the hierarchy, though these benefits only appear when enough objects are being clustered. Icicle plots are described, and their benefits are illustrated using a clustering of 48 objects.  相似文献   

16.
Gnot et al. (J Statist Plann Inference 30(1):223–236, 1992) have presented the formulae for computing Bayes invariant quadratic estimators of variance components in normal mixed linear models of the form where the matrices V i , 1 ≤ ik − 1, are symmetric and nonnegative definite and V k is an identity matrix. These formulae involve a basis of a quadratic subspace containing MV 1 M,...,MV k-1 M,M, where M is an orthogonal projector on the null space of X′. In the paper we discuss methods of construction of such a basis. We survey Malley’s algorithms for finding the smallest quadratic subspace including a given set of symmetric matrices of the same order and propose some modifications of these algorithms. We also consider a class of matrices sharing some of the symmetries common to MV 1 M,...,MV k-1 M,M. We show that the matrices from this class constitute a quadratic subspace and describe its explicit basis, which can be directly used for computing Bayes invariant quadratic estimators of variance components. This basis can be also used for improving the efficiency of Malley’s algorithms when applied to finding a basis of the smallest quadratic subspace containing the matrices MV 1 M,...,MV k-1 M,M. Finally, we present the results of a numerical experiment which confirm the potential usefulness of the proposed methods. Dedicated to the memory of Professor Stanisław Gnot.  相似文献   

17.
For high dimensional data, the SigClust is developed for testing the significance of clustering. The cluster index (CI) for SigClust is conducted by the ratio of the within-cluster and total sum of squares. But its empirical size is too conservative to be over controlled. By removing the cumbrous terms in the CI, an improved index (BCI) is proposed in this paper. The coefficient of variation of the BCI can be significantly reduced, implying that the new index BCI is stable. Moreover, the new significance test (NewSig) maintains the size, meanwhile, provides a greater power. Simulation experiments and two real cancer data examples are analysed for illustrating the performance of the new methodology.  相似文献   

18.
Contours may be viewed as the 2D outline of the image of an object. This type of data arises in medical imaging as well as in computer vision and can be modeled as data on a manifold and can be studied using statistical shape analysis. Practically speaking, each observed contour, while theoretically infinite dimensional, must be discretized for computations. As such, the coordinates for each contour as obtained at k sampling times, resulting in the contour being represented as a k-dimensional complex vector. While choosing large values of k will result in closer approximations to the original contour, this will also result in higher computational costs in the subsequent analysis. The goal of this study is to determine reasonable values for k so as to keep the computational cost low while maintaining accuracy. To do this, we consider two methods for selecting sample points and determine lower bounds for k for obtaining a desired level of approximation error using two different criteria. Because this process is computationally inefficient to perform on a large scale, we then develop models for predicting the lower bounds for k based on simple characteristics of the contours.  相似文献   

19.
The authors consider the situation of incomplete rankings in which n judges independently rank ki ∈ {2, …, t} objects. They wish to test the null hypothesis that each judge picks the ranking at random from the space of ki! permutations of the integers 1, …, ki. The statistic considered is a generalization of the Friedman test in which the ranks assigned by each judge are replaced by real‐valued functions a(j, ki), 1 ≤ jkit of the ranks. The authors define a measure of pairwise similarity between complete rankings based on such functions, and use averages of such similarities to construct measures of the level of concordance of the judges' rankings. In the complete ranking case, the resulting statistics coincide with those defined by Hájek & ?idák (1967, p. 118), and Sen (1968). These measures of similarity are extended to the situation of incomplete rankings. A statistic is derived in this more general situation and its properties are investigated.  相似文献   

20.
We propose a novel method for tensorial‐independent component analysis. Our approach is based on TJADE and k‐JADE, two recently proposed generalizations of the classical JADE algorithm. Our novel method achieves the consistency and the limiting distribution of TJADE under mild assumptions and at the same time offers notable improvement in computational speed. Detailed mathematical proofs of the statistical properties of our method are given and, as a special case, a conjecture on the properties of k‐JADE is resolved. Simulations and timing comparisons demonstrate remarkable gain in speed. Moreover, the desired efficiency is obtained approximately for finite samples. The method is applied successfully to large‐scale video data, for which neither TJADE nor k‐JADE is feasible. Finally, an experimental procedure is proposed to select the values of a set of tuning parameters. Supplementary material including the R‐code for running the examples and the proofs of the theoretical results is available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号