首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ABSTRACT

In a changing climate, changes in timing of seasonal events such as floods and flowering should be assessed using circular methods. Six different methods for clustering on a circle and one linear method are compared across different locations, spreads, and sample sizes. Best results are obtained when clusters are well separated and the number of observations in each cluster is approximately equal. Simulations of flood-like distributions are used to assess and explore clustering methods. Generally, k-means provides results that are close to the expected results, some other methods perform well under specific conditions, but no single method is exemplary.  相似文献   

2.
The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods.  相似文献   

3.
Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter.  相似文献   

4.
The authors consider the problem of simultaneous transformation and variable selection for linear regression. They propose a fully Bayesian solution to the problem, which allows averaging over all models considered including transformations of the response and predictors. The authors use the Box‐Cox family of transformations to transform the response and each predictor. To deal with the change of scale induced by the transformations, the authors propose to focus on new quantities rather than the estimated regression coefficients. These quantities, referred to as generalized regression coefficients, have a similar interpretation to the usual regression coefficients on the original scale of the data, but do not depend on the transformations. This allows probabilistic statements about the size of the effect associated with each variable, on the original scale of the data. In addition to variable and transformation selection, there is also uncertainty involved in the identification of outliers in regression. Thus, the authors also propose a more robust model to account for such outliers based on a t‐distribution with unknown degrees of freedom. Parameter estimation is carried out using an efficient Markov chain Monte Carlo algorithm, which permits moves around the space of all possible models. Using three real data sets and a simulated study, the authors show that there is considerable uncertainty about variable selection, choice of transformation, and outlier identification, and that there is advantage in dealing with all three simultaneously. The Canadian Journal of Statistics 37: 361–380; 2009 © 2009 Statistical Society of Canada  相似文献   

5.
研究面板数据聚类问题过程中,在相似性度量上,用Logistic回归模型构造相似系数和非对称相似矩阵。在聚类算法上,目前的聚类算法只适用于对称的相似矩阵。在非对称相似矩阵的聚类算法上,采用最佳优先搜索和轮廓系数,改进DBSCAN聚类方法,提出BF—DBSCAN方法。通过实例分析,比较了BF—DBSCAN和DBSCAN方法的聚类结果,以及不同参数设置对BF—DBSCAN聚类结果的影响,验证了该方法的有效性和实用性。  相似文献   

6.
Clustering Algorithms are nowadays really important tools in microarray data analysis. The different clustering algorithm generally used in biological science does not take into consideration the underlying probability distribution of the data. In this sense, they are heuristic in nature. In this work we proposed a clustering algorithm based on EM Algorithm. It gives 28% less misclassification than the K-means algorithm (which is mostly use in Bio science). We have also shown on a real data set that this algorithm can be efficiently used for detecting the genes which are responsible for a particular disease.  相似文献   

7.
Unsupervised Curve Clustering using B-Splines   总被引:5,自引:0,他引:5  
Data in many different fields come to practitioners through a process naturally described as functional. Although data are gathered as finite vector and may contain measurement errors, the functional form have to be taken into account. We propose a clustering procedure of such data emphasizing the functional nature of the objects. The new clustering method consists of two stages: fitting the functional data by B‐splines and partitioning the estimated model coefficients using a k‐means algorithm. Strong consistency of the clustering method is proved and a real‐world example from food industry is given.  相似文献   

8.
数据分布密度划分的聚类算法是数据挖掘聚类算法的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区,产生数据集的子簇空间,并进行子簇合并,形成理想的聚类结果;依据该算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

9.
Clustering of Variables Around Latent Components   总被引:1,自引:0,他引:1  
Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies.  相似文献   

10.
In this article, we present a novel approach to clustering finite or infinite dimensional objects observed with different uncertainty levels. The novelty lies in using confidence sets rather than point estimates to obtain cluster membership and the number of clusters based on the distance between the confidence set estimates. The minimal and maximal distances between the confidence set estimates provide confidence intervals for the true distances between objects. The upper bounds of these confidence intervals can be used to minimize the within clustering variability and the lower bounds can be used to maximize the between clustering variability. We assign objects to the same cluster based on a min–max criterion and we separate clusters based on a max–min criterion. We illustrate our technique by clustering a large number of curves and evaluate our clustering procedure with a synthetic example and with a specific application.  相似文献   

11.
In document clustering, a document may be assigned to multiple clusters and the probabilities of a document belonging to different clusters are directly normalized. We propose a new Posterior Probabilistic Clustering (PPC) model that has this normalization property. The clustering model is based on Nonnegative Matrix Factorization (NMF) and flexible such that if we use class conditional probability normalization, the model reduces to Probabilistic Latent Semantic Indexing (PLSI). Systematic comparison and evaluation indicates that PPC is competitive with other state-of-art clustering methods. Furthermore, the results of PPC are more sparse and orthogonal, both of which are highly desirable.  相似文献   

12.
Summary.  Multilevel modelling is sometimes used for data from complex surveys involving multistage sampling, unequal sampling probabilities and stratification. We consider generalized linear mixed models and particularly the case of dichotomous responses. A pseudolikelihood approach for accommodating inverse probability weights in multilevel models with an arbitrary number of levels is implemented by using adaptive quadrature. A sandwich estimator is used to obtain standard errors that account for stratification and clustering. When level 1 weights are used that vary between elementary units in clusters, the scaling of the weights becomes important. We point out that not only variance components but also regression coefficients can be severely biased when the response is dichotomous. The pseudolikelihood methodology is applied to complex survey data on reading proficiency from the American sample of the 'Program for international student assessment' 2000 study, using the Stata program gllamm which can estimate a wide range of multilevel and latent variable models. Performance of pseudo-maximum-likelihood with different methods for handling level 1 weights is investigated in a Monte Carlo experiment. Pseudo-maximum-likelihood estimators of (conditional) regression coefficients perform well for large cluster sizes but are biased for small cluster sizes. In contrast, estimators of marginal effects perform well in both situations. We conclude that caution must be exercised in pseudo-maximum-likelihood estimation for small cluster sizes when level 1 weights are used.  相似文献   

13.
Label switching is one of the fundamental issues for Bayesian mixture modeling. It occurs due to the nonidentifiability of the components under symmetric priors. Without solving the label switching, the ergodic averages of component specific quantities will be identical and thus useless for inference relating to individual components, such as the posterior means, predictive component densities, and marginal classification probabilities. The author establishes the equivalence between the labeling and clustering and proposes two simple clustering criteria to solve the label switching. The first method can be considered as an extension of K-means clustering. The second method is to find the labels by minimizing the volume of labeled samples and this method is invariant to the scale transformation of the parameters. Using a simulation example and the application of two real data sets, the author demonstrates the success of these new methods in dealing with the label switching problem.  相似文献   

14.
Abstract

Document Clustering aims at organizing a large quantity of unlabeled documents into a smaller number of meaningful and coherent clusters. One of the main unsolved problems in the literature is the lack of a reliable methodology to evaluate the results, although a wide variety of validation measures has been proposed. Validation measures are often unsatisfactory with numerical data, and even underperforming with textual data. Our attention focuses on the use of cosine similarity into the clustering process. A new measure based on the same criterion is here proposed. The effectiveness of the proposal is shown by an extensive comparative study.  相似文献   

15.
对于一类变量非线性相关的面板数据,现有的基于线性算法的面板数据聚类方法并不能准确地度量样本间的相似性,且聚类结果的可解释性低。综合考虑变量非线性相关问题及聚类结果可解释性问题,提出一种非线性面板数据的聚类方法,通过非线性核主成分算法实现对样本相似性的测度,并基于混合高斯模型进行样本概率聚类,实证表明该方法的有效性及其对聚类结果的可解释性有所提高。  相似文献   

16.
ABSTRACT

In this paper, we develop an efficient wavelet-based regularized linear quantile regression framework for coefficient estimations, where the responses are scalars and the predictors include both scalars and function. The framework consists of two important parts: wavelet transformation and regularized linear quantile regression. Wavelet transform can be used to approximate functional data through representing it by finite wavelet coefficients and effectively capturing its local features. Quantile regression is robust for response outliers and heavy-tailed errors. In addition, comparing with other methods it provides a more complete picture of how responses change conditional on covariates. Meanwhile, regularization can remove small wavelet coefficients to achieve sparsity and efficiency. A novel algorithm, Alternating Direction Method of Multipliers (ADMM) is derived to solve the optimization problems. We conduct numerical studies to investigate the finite sample performance of our method and applied it on real data from ADHD studies.  相似文献   

17.
随着大数据时代的来临,近年来函数型数据分析方法成为研究的热点问题,针对曲线的聚类分析方法引起了学界的关注.给出一种曲线聚类的方法:以L2距离作为亲疏程度的度量,在B样条基底函数展开表述下,将曲线本身信息、曲线变化信息引入聚类算法构建,并实现了曲线聚类与传统多元统计聚类方法的对接.作为应用,以城乡收入函数聚类实例验证了该曲线聚类方法,结果表明,在引入曲线变化信息的情况下,比仅考虑曲线本身信息能够取得更好的聚类效果.  相似文献   

18.
In this paper,we propose a class of general partially linear varying-coefficient transformation models for ranking data. In the models, the functional coefficients are viewed as nuisance parameters and approximated by B-spline smoothing approximation technique. The B-spline coefficients and regression parameters are estimated by rank-based maximum marginal likelihood method. The three-stage Monte Carlo Markov Chain stochastic approximation algorithm based on ranking data is used to compute estimates and the corresponding variances for all the B-spline coefficients and regression parameters. Through three simulation studies and a Hong Kong horse racing data application, the proposed procedure is illustrated to be accurate, stable and practical.  相似文献   

19.
农业险定价中的核心问题是农业风险区划问题,为了体现农业区划中个体指标的动态发展特征,根据近邻传播改进自适应近邻传播聚类方法对数据进行优化,基于轮廓系数、归属度和吸引度得到最佳聚类中心和几何聚类中心,并将聚类转化为新数据集的聚类问题;选取代表性的棉花为例进行实证分析,通过计算生产、销售、收入、财政等指标进行棉花风险区划实例分析,计算最优棉花风险区划,结果表明对于具有动态特征的数据,本模型具有很好的有效性、实用性和解释性。  相似文献   

20.
ABSTRACT

Panel datasets have been increasingly used in economics to analyze complex economic phenomena. Panel data is a two-dimensional array that combines cross-sectional and time series data. Through constructing a panel data matrix, the clustering method is applied to panel data analysis. This method solves the heterogeneity question of the dependent variable, which belongs to panel data, before the analysis. Clustering is a widely used statistical tool in determining subsets in a given dataset. In this article, we present that the mixed panel dataset is clustered by agglomerative hierarchical algorithms based on Gower's distance and by k-prototypes. The performance of these algorithms has been studied on panel data with mixed numerical and categorical features. The effectiveness of these algorithms is compared by using cluster accuracy. An experimental analysis is illustrated on a real dataset using Stata and R package software.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号