期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Clustering large number of extragalactic spectra of galaxies and quasars through canopies

Tuli De Didier Fraix Burnet Asis Kumar Chattopadhyay 《统计学通讯:理论与方法》2013,42(9):2638-2653

Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix. 相似文献

2.

Cluster analysis of massive datasets in astronomy

Woncheol Jang Martin Hendry 《Statistics and Computing》2007,17(3):253-262

Clusters of galaxies are a useful proxy to trace the distribution of mass in the universe. By measuring the mass of clusters of galaxies on different scales, one can follow the evolution of the mass distribution (Martínez and Saar, Statistics of the Galaxy Distribution, 2002). It can be shown that finding galaxy clusters is equivalent to finding density contour clusters (Hartigan, Clustering Algorithms, 1975): connected components of the level set S _c≡{f>c} where f is a probability density function. Cuevas et al. (Can. J. Stat. 28, 367–382, 2000; Comput. Stat. Data Anal. 36, 441–459, 2001) proposed a nonparametric method for density contour clusters, attempting to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data. 相似文献

3.

Bayesian nonparametric clustering for large data sets

Zuanetti Daiane Aparecida Müller Peter Zhu Yitan Yang Shengjie Ji Yuan 《Statistics and Computing》2019,29(2):203-215

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

相似文献

4.

A comparative study of the K-means algorithm and the normal mixture model for clustering: Bivariate homoscedastic case

Dingxi Qiu 《Journal of statistical planning and inference》2010

The K-means algorithm and the normal mixture model method are two common clustering methods. The K-means algorithm is a popular heuristic approach which gives reasonable clustering results if the component clusters are ball-shaped. Currently, there are no analytical results for this algorithm if the component distributions deviate from the ball-shape. This paper analytically studies how the K-means algorithm changes its classification rule as the normal component distributions become more elongated under the homoscedastic assumption and compares this rule with that of the Bayes rule from the mixture model method. We show that the classification rules of both methods are linear, but the slopes of the two classification lines change in the opposite direction as the component distributions become more elongated. The classification performance of the K-means algorithm is then compared to that of the mixture model method via simulation. The comparison, which is limited to two clusters, shows that the K-means algorithm provides poor classification performances consistently as the component distributions become more elongated while the mixture model method can potentially, but not necessarily, take advantage of this change and provide a much better classification performance. 相似文献

5.

The cluster correlation-network support vector machine for high-dimensional binary classification

Rachid Kharoubi Abdallah Mkhadri 《Journal of Statistical Computation and Simulation》2019,89(6):1020-1043

ABSTRACT

Identifying homogeneous subsets of predictors in classification can be challenging in the presence of high-dimensional data with highly correlated variables. We propose a new method called cluster correlation-network support vector machine (CCNSVM) that simultaneously estimates clusters of predictors that are relevant for classification and coefficients of penalized SVM. The new CCN penalty is a function of the well-known Topological Overlap Matrix whose entries measure the strength of connectivity between predictors. CCNSVM implements an efficient algorithm that alternates between searching for predictors’ clusters and optimizing a penalized SVM loss function using Majorization–Minimization tricks and a coordinate descent algorithm. This combining of clustering and sparsity into a single procedure provides additional insights into the power of exploring dimension reduction structure in high-dimensional binary classification. Simulation studies are considered to compare the performance of our procedure to its competitors. A practical application of CCNSVM on DNA methylation data illustrates its good behaviour. 相似文献

6.

Threshold knot selection for large-scale spatial models with applications to the Deepwater Horizon disaster

Casey M. Jelsema Richard K. Kwok Shyamal D. Peddada 《Journal of Statistical Computation and Simulation》2019,89(11):2121-2137

Large spatial datasets are typically modelled through a small set of knot locations; often these locations are specified by the investigator by arbitrary criteria. Existing methods of estimating the locations of knots assume their number is known a priori, or are otherwise computationally intensive. We develop a computationally efficient method of estimating both the location and number of knots for spatial mixed effects models. Our proposed algorithm, Threshold Knot Selection (TKS), estimates knot locations by identifying clusters of large residuals and placing a knot in the centroid of those clusters. We conduct a simulation study showing TKS in relation to several comparable methods of estimating knot locations. Our case study utilizes data of particulate matter concentrations collected during the course of the response and clean-up effort from the 2010 Deepwater Horizon oil spill in the Gulf of Mexico. 相似文献

7.

Subsemble: an ensemble method for combining subset-specific algorithm fits

Stephanie Sapp Mark J. van der Laan John Canny 《Journal of applied statistics》2014,41(6):1247-1259

Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set. 相似文献

8.

Classification of multivariate non-stationary signals: The SLEX-shrinkage approach

Hilmar Böhm Hernando Ombao Rainer von Sachs Jerome Sanes 《Journal of statistical planning and inference》2010

相似文献

9.

Clustering work and family trajectories by using a divisive algorithm

Raffaella Piccarreta Francesco C. Billari 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2007,170(4):1061-1078

Summary. We present an approach to the construction of clusters of life course trajectories and use it to obtain ideal types of trajectories that can be interpreted and analysed meaningfully. We represent life courses as sequences on a monthly timescale and apply optimal matching analysis to compute dissimilarities between individuals. We introduce a new divisive clustering algorithm which has features that are in common with both Ward's agglomerative algorithm and classification and regression trees. We analyse British Household Panel Survey data on the employment and family trajectories of women. Our method produces clusters of sequences for which it is straightforward to determine who belongs to each cluster, making it easier to interpret the relative importance of life course factors in distinguishing subgroups of the population. Moreover our method gives guidance on selecting the number of clusters. 相似文献

10.

Quantifying the similarity of 2D images using edge pixels: an application to the forensic comparison of footwear impressions

Soyoung Park Alicia Carriquiry 《Journal of applied statistics》2021,48(10):1833

We propose a novel method to quantify the similarity between an impression (Q) from an unknown source and a test impression (K) from a known source. Using the property of geometrical congruence in the impressions, the degree of correspondence is quantified using ideas from graph theory and maximum clique (MC). The algorithm uses the x and y coordinates of the edges in the images as the data. We focus on local areas in Q and the corresponding regions in K and extract features for comparison. Using pairs of images with known origin, we train a random forest to classify pairs into mates and non-mates. We collected impressions from 60 pairs of shoes of the same brand and model, worn over six months. Using a different set of very similar shoes, we evaluated the performance of the algorithm in terms of the accuracy with which it correctly classified images into source classes. Using classification error rates and ROC curves, we compare the proposed method to other algorithms in the literature and show that for these data, our method shows good classification performance relative to other methods. The algorithm can be implemented with the R package shoeprintr. 相似文献

11.

An online classification EM algorithm based on the mixture model

Allou Samé Christophe Ambroise Gérard Govaert 《Statistics and Computing》2007,17(3):209-218

Mixture model-based clustering is widely used in many applications. In certain real-time applications the rapid increase of data size with time makes classical clustering algorithms too slow. An online clustering algorithm based on mixture models is presented in the context of a real-time flaw-diagnosis application for pressurized containers which uses data from acoustic emission signals. The proposed algorithm is a stochastic gradient algorithm derived from the classification version of the EM algorithm (CEM). It provides a model-based generalization of the well-known online k-means algorithm, able to handle non-spherical clusters. Using synthetic and real data sets, the proposed algorithm is compared with the batch CEM algorithm and the online EM algorithm. The three approaches generate comparable solutions in terms of the resulting partition when clusters are relatively well separated, but online algorithms become faster as the size of the available observations increases. 相似文献

12.

A heteroscedastic structural errors-in-variables model with equation error

A.G. Patriota H. Bolfarine M. de Castro 《Statistical Methodology》2009,6(4):408-423

It is not uncommon with astrophysical and epidemiological data sets that the variances of the observations are accessible from an analytical treatment of the data collection process. Moreover, in a regression model, heteroscedastic measurement errors and equation errors are common situations when modelling such data. This article deals with the limiting distribution of the maximum-likelihood and method-of-moments estimators for the line parameters of the regression model. We use the delta method to achieve it, making it possible to build joint confidence regions and hypothesis testing. This technique produces closed expressions for the asymptotic covariance matrix of those estimators. In the moment approach we do not assign any distribution for the unobservable covariate while with the maximum-likelihood approach, we assume a normal distribution. We also conduct simulation studies of rejection rates for Wald-type statistics in order to verify the test size and power. Practical applications are reported for a data set produced by the Chandra observatory and also from the WHO MONICA Project on cardiovascular disease. 相似文献

13.

A sequential clustering algorithm with applications to gene expression data

Jongwoo Song Dan L. Nicolae 《Journal of the Korean Statistical Society》2009,38(2):175-184

Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells. 相似文献

14.

Modified spline regression based on randomly right-censored data: A comparative study

Dursun Aydin 《统计学通讯:模拟与计算》2013,42(9):2587-2611

ABSTRACT

In this paper, we propose modified spline estimators for nonparametric regression models with right-censored data, especially when the censored response observations are converted to synthetic data. Efficient implementation of these estimators depends on the set of knot points and an appropriate smoothing parameter. We use three algorithms, the default selection method (DSM), myopic algorithm (MA), and full search algorithm (FSA), to select the optimum set of knots in a penalized spline method based on a smoothing parameter, which is chosen based on different criteria, including the improved version of the Akaike information criterion (AICc), generalized cross validation (GCV), restricted maximum likelihood (REML), and Bayesian information criterion (BIC). We also consider the smoothing spline (SS), which uses all the data points as knots. The main goal of this study is to compare the performance of the algorithm and criteria combinations in the suggested penalized spline fits under censored data. A Monte Carlo simulation study is performed and a real data example is presented to illustrate the ideas in the paper. The results confirm that the FSA slightly outperforms the other methods, especially for high censoring levels. 相似文献

15.

A weighted least-squares approach to clusterwise regression

Rainer Schlittgen 《AStA Advances in Statistical Analysis》2011,95(2):205-217

Clusterwise regression aims to cluster data sets where the clusters are characterized by their specific regression coefficients in a linear regression model. In this paper, we propose a method for determining a partition which uses an idea of robust regression. We start with some random weighting to determine a start partition and continue in the spirit of M-estimators. The residuals for all regressions are used to assign the observations to the different groups. As target function we use the determination coefficient R²_wR^{2}_{w} for the overall model. This coefficient is suitably defined for weighted regression. 相似文献

16.

Model-based clustering of multivariate skew data with circular components and missing values

Francesco Lagona Marco Picone 《Journal of applied statistics》2012,39(5):927-945

Motivated by classification issues that arise in marine studies, we propose a latent-class mixture model for the unsupervised classification of incomplete quadrivariate data with two linear and two circular components. The model integrates bivariate circular densities and bivariate skew normal densities to capture the association between toroidal clusters of bivariate circular observations and planar clusters of bivariate linear observations. Maximum-likelihood estimation of the model is facilitated by an expectation maximization (EM) algorithm that treats unknown class membership and missing values as different sources of incomplete information. The model is exploited on hourly observations of wind speed and direction and wave height and direction to identify a number of sea regimes, which represent specific distributional shapes that the data take under environmental latent conditions. 相似文献

17.

Genetic Algorithm in the Wavelet Domain for Large p Small n Regression

Eylem Deniz Howe Orietta Nicolis 《统计学通讯:模拟与计算》2015,44(5):1144-1157

Many areas of statistical modeling are plagued by the “curse of dimensionality,” in which there are more variables than observations. This is especially true when developing functional regression models where the independent dataset is some type of spectral decomposition, such as data from near-infrared spectroscopy. While we could develop a very complex model by simply taking enough samples (such that n > p), this could prove impossible or prohibitively expensive. In addition, a regression model developed like this could turn out to be highly inefficient, as spectral data usually exhibit high multicollinearity. In this article, we propose a two-part algorithm for selecting an effective and efficient functional regression model. Our algorithm begins by evaluating a subset of discrete wavelet transformations, allowing for variation in both wavelet and filter number. Next, we perform an intermediate processing step to remove variables with low correlation to the response data. Finally, we use the genetic algorithm to perform a stochastic search through the subset regression model space, driven by an information-theoretic objective function. We allow our algorithm to develop the regression model for each response variable independently, so as to optimally model each variable. We demonstrate our method on the familiar biscuit dough dataset, which has been used in a similar context by several researchers. Our results demonstrate both the flexibility and the power of our algorithm. For each response variable, a different subset model is selected, and different wavelet transformations are used. The models developed by our algorithm show an improvement, as measured by lower mean error, over results in the published literature. 相似文献

18.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

19.

Modular-transform based clustering

Gang Wang Jun Wang Mingyu Wang 《Journal of applied statistics》2013,40(12):2749-2759

Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations. 相似文献

20.

Computing exact clustering posteriors with subset convolution

Jukka Kohonen Jukka Corander 《统计学通讯:理论与方法》2013,42(10):3048-3058

ABSTRACT

An exponential-time exact algorithm is provided for the task of clustering n items of data into k clusters. Instead of seeking one partition, posterior probabilities are computed for summary statistics: the number of clusters and pairwise co-occurrence. The method is based on subset convolution and yields the posterior distribution for the number of clusters in O(n3ⁿ) operations or O(n³2ⁿ) using fast subset convolution. Pairwise co-occurrence probabilities are then obtained in O(n³2ⁿ) operations. This is considerably faster than exhaustive enumeration of all partitions. 相似文献