首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells.  相似文献   

2.
ABSTRACT

Various methods have been proposed to estimate intra-cluster correlation coefficients (ICCs) for correlated binary data, and many are very sensitive to the type of design and underlying distributional assumptions. We proposed a new method to estimate ICC and its 95% confidence intervals based on resampling principles and U-statistics, where we resampled with replacement pairs of individuals from within and between clusters. We concluded from our simulation study that the resampling-based estimates approximate the population ICC more precisely than the analysis of variance and method of moments techniques for different event rates, varying number of clusters, and cluster sizes.  相似文献   

3.
Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix.  相似文献   

4.
Consider k independent exponential distributions possibly with different location parameters and a common scale parameter. If the best population is defined to be the one having the largest mean or equivalently having the largest location parameter, we then derive a set of simultaneous upper confidence bounds for all distances of the means from the largest one. These bounds not only can serve as confidence intervals for all distances from the largest parameter but they also can be used to identify the best population. Relationships to ranking and selection procedures are pointed out. Cases in which scale parameters are known or unknown and samples are complete or type II censored are considered. Tables to implement this procedure are given.  相似文献   

5.
Various estimators proposed for the estimation of a common mean are extended to the estimation of the common location parameters for two linear models including the estimators based on preliminary tests of equality of variances. Exact distribution of these estimates, simultaneous confidence bounds based on these estimates and the bounds on the variances of these estimates are obtained using different approaches.  相似文献   

6.
Distribution-free confidence bands for a distribution function are typically obtained by inverting a distribution-free hypothesis test. We propose an alternate strategy in which the upper and lower bounds of the confidence band are chosen to minimize a narrowness criterion. We derive necessary and sufficient conditions for optimality with respect to such a criterion, and we use these conditions to construct an algorithm for finding optimal bands. We also derive uniqueness results, with the Brunn–Minkowski Inequality from the theory of convex bodies playing a key role in this work. We illustrate the optimal confidence bands using some galaxy velocity data, and we also show that the optimal bands compare favorably to other bands both in terms of power and in terms of area enclosed.  相似文献   

7.
ABSTRACT

Among the statistical methods to model stochastic behaviours of objects, clustering is a preliminary technique to recognize similar patterns within a group of observations in a data set. Various distances to measure differences among objects could be invoked to cluster data through numerous clustering methods. When variables in hand contain geometrical information of objects, such metrics should be adequately adapted. In fact, statistical methods for these typical data are endowed with a geometrical paradigm in a multivariate sense. In this paper, a procedure for clustering shape data is suggested employing appropriate metrics. Then, the best shape distance candidate as well as a suitable agglomerative method for clustering the simulated shape data are provided by considering cluster validation measures. The results are implemented in a real life application.  相似文献   

8.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

9.
This paper introduces a method for clustering spatially dependent functional data. The idea is to consider the contribution of each curve to the spatial variability. Thus, we define a spatial dispersion function associated to each curve and perform a k-means like clustering algorithm. The algorithm is based on the optimization of a fitting criterion between the spatial dispersion functions associated to each curve and the representative of the clusters. The performance of the proposed method is illustrated by an application on real data and a simulation study.  相似文献   

10.
Counting by weighing is widely used in industry and often more efficient than counting manually which is time consuming and prone to human errors especially when the number of items is large. Lower confidence bounds on the numbers of items in infinitely many future bags based on the weights of the bags have been proposed recently in Liu et al. [Counting by weighing: Know your numbers with confidence, J. Roy. Statist. Soc. Ser. C 65(4) (2016), pp. 641–648]. These confidence bounds are constructed using the data from one calibration experiment and for different parameters (or numbers), but have the frequency interpretation similar to a usual confidence set for one parameter only. In this paper, the more challenging problem of constructing two-sided confidence intervals is studied. A simulation-based method for computing the critical constant is proposed. This method is proven to give the required critical constant when the number of simulations goes to infinity, and shown to be easily implemented on an ordinary computer to compute the critical constant accurately and quickly. The methodology is illustrated with a real data example.  相似文献   

11.
Summary.  Non-hierarchical clustering methods are frequently based on the idea of forming groups around 'objects'. The main exponent of this class of methods is the k -means method, where these objects are points. However, clusters in a data set may often be due to certain relationships between the measured variables. For instance, we can find linear structures such as straight lines and planes, around which the observations are grouped in a natural way. These structures are not well represented by points. We present a method that searches for linear groups in the presence of outliers. The method is based on the idea of impartial trimming. We search for the 'best' subsample containing a proportion 1− α of the data and the best k affine subspaces fitting to those non-discarded observations by measuring discrepancies through orthogonal distances. The population version of the sample problem is also considered. We prove the existence of solutions for the sample and population problems together with their consistency. A feasible algorithm for solving the sample problem is described as well. Finally, some examples showing how the method proposed works in practice are provided.  相似文献   

12.
Inference for clusters of extreme values   总被引:3,自引:0,他引:3  
Summary. Inference for clusters of extreme values of a time series typically requires the identification of independent clusters of exceedances over a high threshold. The choice of declustering scheme often has a significant effect on estimates of cluster characteristics. We propose an automatic declustering scheme that is justified by an asymptotic result for the times between threshold exceedances. The scheme relies on the extremal index, which we show may be estimated before declustering, and supports a bootstrap procedure for assessing the variability of estimates.  相似文献   

13.
We consider confidence bands for continuous distribution functions. Following a review of the literature we find that previously considered confidence bands, which have exact coverage, are all step-functions jumping only at the sample points. We find that the step-function bands can be constructed through rectangular tolerance regions for an ordered sample from the uniform distribution R(0, 1). We then construct a set of new bands. Two criteria for assessing confidence bands are presented. One is the power criterion, and the other is the average-width criterion that we propose. Numerical comparisons between our new bands and the old bands are carried out, and show that our new bands perform much better than the old ones.  相似文献   

14.
We consider Dirichlet process mixture models in which the observed clusters in any particular dataset are not viewed as belonging to a finite set of possible clusters but rather as representatives of a latent structure in which objects belong to one of a potentially infinite number of clusters. As more information is revealed the number of inferred clusters is allowed to grow. The precision parameter of the Dirichlet process is a crucial parameter that controls the number of clusters. We develop a framework for the specification of the hyperparameters associated with the prior for the precision parameter that can be used both in the presence or absence of subjective prior information about the level of clustering. Our approach is illustrated in an analysis of clustering brands at the magazine Which?. The results are compared with the approach of Dorazio (2009) via a simulation study.  相似文献   

15.
Silhouette information evaluates the quality of the partition detected by a clustering technique. Since it is based on a measure of distance between the clustered observations, its standard formulation is not adequate when a density-based clustering technique is used. In this work we propose a suitable modification of the Silhouette information aimed at evaluating the quality of clusters in a density-based framework. It is based on the estimation of the data posterior probabilities of belonging to the clusters and may be used to measure our confidence about data allocation to the clusters as well as to choose the best partition among different ones.  相似文献   

16.
Icicle Plots: Better Displays for Hierarchical Clustering   总被引:1,自引:0,他引:1  
An icicle plot is a method for presenting a hierarchical clustering. Compared with other methods of presentation, it is far easier in an icicle plot to read off which objects belong to which clusters, and which objects join or drop out from a cluster as we move up and down the levels of the hierarchy, though these benefits only appear when enough objects are being clustered. Icicle plots are described, and their benefits are illustrated using a clustering of 48 objects.  相似文献   

17.
The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with eight clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures.  相似文献   

18.
Data analysts frequently calculate power and sample size for a planned study using mean and variance estimates from an initial trial. Hence power,or the sample size needed to achieve a fixed power, varies randomly. Such claculations can be very inaccurate in the General Linear Univeriate Model (GLUM). Biased noncentrality estimators and censored power calculations create inaccuracy. Censoring occurs if only certain outcomes of an initial trial lead to a power calculation. For example, a confirmatory study may be planned (and a sample size estimated) only following a significant resulte in the initial trial.

Computing accurate point estimates or confidence bounds of GLUM noncentrality, power, or sample size in the presence of censoring involves truncated noncentral F distributions. We recommed confidence bounds, whether or not censoring occurs. A power analysis of data from humans exposed to carbon monoxide demonstrates the substantial impact on samle size that may occur. The results highlight potential; biases and should aid study planning and interpretation.  相似文献   

19.
A proper understanding and modelling of the behaviour of heavily loaded large-scale electrical transmission systems is essential for a secure and uninterrupted operation. In this paper, we present methods to cluster electrical power networks based on different criteria into regions. These regions are useful for the efficient modelling of large transcontinental electricity networks, switching operation decisions or placement of redundant parts of the monitoring and control system. In alternating current electricity networks, power oscillations are normal, but they can become dangerous if they build up. The first approach uses the correlation between results of a stability assessment for these oscillations at every node for the cluster criterion. The second method concentrates on the network topology and uses spectral clustering on the network graph to create clusters where all nodes are interconnected. In this work, we also discuss the problem how to choose the right number of clusters and how the discussed clustering methods can be used for an efficient modelling of large electricity networks or in protection and control systems.  相似文献   

20.
Cluster analysis is an important technique of explorative data mining. It refers to a collection of statistical methods for learning the structure of data by solely exploring pairwise distances or similarities. Often meaningful structures are not detectable in these high-dimensional feature spaces. Relevant features can be obfuscated by noise from irrelevant measurements. These observations led to the design of subspace clustering algorithms, which can identify clusters that originate from different subsets of features. Hunting for clusters in arbitrary subspaces is intractable due to the infinite search space spanned by all feature combinations. In this work, we present a subspace clustering algorithm that can be applied for exhaustively screening all feature combinations of small- or medium-sized datasets (approximately 30 features). Based on a robustness analysis via subsampling we are able to identify a set of stable candidate subspace cluster solutions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号