首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Abstract

An aspect of cluster analysis which has been widely studied in recent years is the weighting and selection of variables. Procedures have been proposed which are able to identify the cluster structure present in a data matrix when that structure is confined to a subset of variables. Other methods assess the relative importance of each variable as revealed by a suitably chosen weight. But when a cluster structure is present in more than one subset of variables and is different from one subset to another, those solutions as well as standard clustering algorithms can lead to misleading results. Some very recent methodologies for finding consensus classifications of the same set of units can be useful also for the identification of cluster structures in a data matrix, but each one seems to be only partly satisfactory for the purpose at hand. Therefore a new more specific procedure is proposed and illustrated by analyzing two real data sets; its performances are evaluated by means of a simulation experiment.  相似文献   

2.
Kendall and Gehan estimating functions are commonly used to estimate the regression parameter in accelerated failure time model with censored observations in survival analysis. In this paper, we apply the jackknife empirical likelihood method to overcome the computation difficulty about interval estimation. A Wilks’ theorem of jackknife empirical likelihood for U-statistic type estimating equations is established, which is used to construct the confidence intervals for the regression parameter. We carry out an extensive simulation study to compare the Wald-type procedure, the empirical likelihood method, and the jackknife empirical likelihood method. The proposed jackknife empirical likelihood method has a better performance than the existing methods. We also use a real data set to compare the proposed methods.  相似文献   

3.
Model selection methods are important to identify the best approximating model. To identify the best meaningful model, purpose of the model should be clearly pre-stated. The focus of this paper is model selection when the modelling purpose is classification. We propose a new model selection approach designed for logistic regression model selection where main modelling purpose is classification. The method is based on the distance between the two clustering trees. We also question and evaluate the performances of conventional model selection methods based on information theory concepts in determining best logistic regression classifier. An extensive simulation study is used to assess the finite sample performances of the cluster tree based and the information theoretic model selection methods. Simulations are adjusted for whether the true model is in the candidate set or not. Results show that the new approach is highly promising. Finally, they are applied to a real data set to select a binary model as a means of classifying the subjects with respect to their risk of breast cancer.  相似文献   

4.
Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.  相似文献   

5.
This paper introduces a method for clustering spatially dependent functional data. The idea is to consider the contribution of each curve to the spatial variability. Thus, we define a spatial dispersion function associated to each curve and perform a k-means like clustering algorithm. The algorithm is based on the optimization of a fitting criterion between the spatial dispersion functions associated to each curve and the representative of the clusters. The performance of the proposed method is illustrated by an application on real data and a simulation study.  相似文献   

6.
The use of parametric linear mixed models and generalized linear mixed models to analyze longitudinal data collected during randomized control trials (RCT) is conventional. The application of these methods, however, is restricted due to various assumptions required by these models. When the number of observations per subject is sufficiently large, and individual trajectories are noisy, functional data analysis (FDA) methods serve as an alternative to parametric longitudinal data analysis techniques. However, the use of FDA in RCTs is rare. In this paper, the effectiveness of FDA and linear mixed models (LMMs) was compared by analyzing data from rural persons living with HIV and comorbid depression enrolled in a depression treatment randomized clinical trial. Interactive voice response systems were used for weekly administrations of the 10-item Self-Administered Depression Scale (SADS) over 41 weeks. Functional principal component analysis and functional regression analysis methods detected a statistically significant difference in SADS between telphone-administered interpersonal psychotherapy (tele-IPT) and controls but linear mixed effects model results did not. Additional simulation studies were conducted to compare FDA and LMMs under a different nonlinear trajectory assumption. In this clinical trial with sufficient per subject measured outcomes and individual trajectories that are noisy and nonlinear, we found FDA methods to be a better alternative to LMMs.  相似文献   

7.
When modeling correlated binary data in the presence of informative cluster sizes, generalized estimating equations with either resampling or inverse-weighting, are often used to correct for estimation bias. However, existing methods for the clustered longitudinal setting assume constant cluster sizes over time. We present a subject-weighted generalized estimating equations scheme that provides valid parameter estimation for the clustered longitudinal setting while allowing cluster sizes to change over time. We compare, via simulation, the performance of existing methods to our subject-weighted approach. The subject-weighted approach was the only method that showed negligible bias, with excellent coverage, for all model parameters.  相似文献   

8.
Five univariate divisive clustering methods for grouping means in analysis of variance are considered.Unlike pairwise multiple comparison procedures, cluster analysis has the advantage of producing non-overlapping groups of the treatment means. Comparisonwise Type I error rates and average numbers of clusters per experiment are examined for a heterogeneous set of 20 true treatment means with 11 embedded homogenous sub-groups of one or more treatments. The results of a simulation study clearly show that observed comparisonwise error rate and number of clusters are determined to a far greater extent by the precision of the experiment (as determined by the magnitude of the standard deviation) than by either the stated significance level or the clustering method used.  相似文献   

9.
ABSTRACT

Among the statistical methods to model stochastic behaviours of objects, clustering is a preliminary technique to recognize similar patterns within a group of observations in a data set. Various distances to measure differences among objects could be invoked to cluster data through numerous clustering methods. When variables in hand contain geometrical information of objects, such metrics should be adequately adapted. In fact, statistical methods for these typical data are endowed with a geometrical paradigm in a multivariate sense. In this paper, a procedure for clustering shape data is suggested employing appropriate metrics. Then, the best shape distance candidate as well as a suitable agglomerative method for clustering the simulated shape data are provided by considering cluster validation measures. The results are implemented in a real life application.  相似文献   

10.
ABSTRACT

Panel datasets have been increasingly used in economics to analyze complex economic phenomena. Panel data is a two-dimensional array that combines cross-sectional and time series data. Through constructing a panel data matrix, the clustering method is applied to panel data analysis. This method solves the heterogeneity question of the dependent variable, which belongs to panel data, before the analysis. Clustering is a widely used statistical tool in determining subsets in a given dataset. In this article, we present that the mixed panel dataset is clustered by agglomerative hierarchical algorithms based on Gower's distance and by k-prototypes. The performance of these algorithms has been studied on panel data with mixed numerical and categorical features. The effectiveness of these algorithms is compared by using cluster accuracy. An experimental analysis is illustrated on a real dataset using Stata and R package software.  相似文献   

11.
We investigate the relative performance of stratified bivariate ranked set sampling (SBVRSS), with respect to stratified simple random sampling (SSRS) for estimating the population mean with regression methods. The mean and variance of the proposed estimators are derived with the mean being shown to be unbiased. We perform a simulation study to compare the relative efficiency of SBVRSS to SSRS under various data-generating scenarios. We also compare the two sampling schemes on a real data set from trauma victims in a hospital setting. The results of our simulation study and the real data illustration indicate that using SBVRSS for regression estimation provides more efficiency than SSRS in most cases.  相似文献   

12.
In the present paper, we propose non parametric estimators for the inaccuracy measure for the lifetime distribution based on censored data. This measure plays important roles in reliability and survival analysis in connection with modeling and analysis of life time data. Asymptotic properties of the estimators are established under suitable regularity conditions. Monte Carlo simulation studies are carried out to compare the performance of the estimators using the mean-squared error. The methods are illustrated using a real data set.  相似文献   

13.
The performance of two clustering strategies for spatially correlated functional data based on the same measure of spatial dependence is examined and compared. In particular, the role of the spatial dependence computed by the trace-variogram function is analyzed. The main features of both procedures is shown through a simulation study based on a variety of practical scenarios easily encountered in the analysis of spatial functional data. An application on real data based on salinity curves is also presented.  相似文献   

14.
Binary outcome data with small clusters often arise in medical studies and the size of clusters might be informative of the outcome. The authors conducted a simulation study to examine the performance of a range of statistical methods. The simulation results showed that all methods performed mostly comparable in the estimation of covariate effects. However, the standard logistic regression approach that ignores the clustering encountered an undercoverage problem when the degree of clustering was nontrivial. The performance of random-effects logistic regression approach tended to be affected by low disease prevalence, relatively small cluster size, or informative cluster size.  相似文献   

15.
The receiver operating characteristic (ROC) curve is a graphical representation of the relationship between false positive and true positive rates. It is a widely used statistical tool for describing the accuracy of a diagnostic test. In this paper we propose a new nonparametric ROC curve estimator based on the smoothed empirical distribution functions. We prove its strong consistency and perform a simulation study to compare it with some other popular nonparametric estimators of the ROC curve. We also apply the proposed method to a real data set.  相似文献   

16.
In statistical data analysis it is often important to compare, classify, and cluster different time series. For these purposes various methods have been proposed in the literature, but they usually assume time series with the same sample size. In this article, we propose a spectral domain method for handling time series of unequal length. The method make the spectral estimates comparable by producing statistics at the same frequency. The procedure is compared with other methods proposed in the literature by a Monte Carlo simulation study. As an illustrative example, the proposed spectral method is applied to cluster industrial production series of some developed countries.  相似文献   

17.
This research is dedicated to the study of periodic characteristics of periodically correlated time series such as seasonal means, seasonal variances and autocovariance functions. Two bootstrap methods are used: the extension of the usual Moving Block Bootstrap (EMBB) and the Generalised Seasonal Block Bootstrap (GSBB). The first approach is proposed, because the usual Moving Block Bootstrap does not preserve the periodic structure contained in the data and cannot be applied for the considered problems. For the aforementioned periodic characteristics the bootstrap estimators are introduced and consistency of the EMBB in all cases is obtained. Moreover, the GSBB consistency results for seasonal variances and autocovariance function are presented. Additionally, the bootstrap consistency of both considered techniques for smooth functions of the parameters of interest is obtained. Finally, the simultaneous bootstrap confidence intervals are constructed. A simulation study to compare their actual coverage probabilities is provided. A real data example is presented.  相似文献   

18.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

19.
The authors propose a profile likelihood approach to linear clustering which explores potential linear clusters in a data set. For each linear cluster, an errors‐in‐variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing clustering methods are discussed. Methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purposes. The Canadian Journal of Statistics 38: 716–737; 2010 © 2010 Statistical Society of Canada  相似文献   

20.
本文研究的是时间序列的聚类问题。由于现实世界中时间序列多数是非线性的,而现有的时间序列聚类问题大都是基于线性时间序列模型进行聚类的,本文提出了可以用于非线性时间序列的聚类方法。以时间序列的二维核密度估计之间的相似性作为非线性时间序列的距离度量,该距离度量方式是一种非参数的距离度量方法,考虑到了时间序列自相关结构的差异,能够粗糙地识别时间序列形状和动态相关结构的相似性。与理论研究结果相一致,我们的模拟实验结果也验证了这种距离度量的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号