首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 908 毫秒
1.
Abstract

An aspect of cluster analysis which has been widely studied in recent years is the weighting and selection of variables. Procedures have been proposed which are able to identify the cluster structure present in a data matrix when that structure is confined to a subset of variables. Other methods assess the relative importance of each variable as revealed by a suitably chosen weight. But when a cluster structure is present in more than one subset of variables and is different from one subset to another, those solutions as well as standard clustering algorithms can lead to misleading results. Some very recent methodologies for finding consensus classifications of the same set of units can be useful also for the identification of cluster structures in a data matrix, but each one seems to be only partly satisfactory for the purpose at hand. Therefore a new more specific procedure is proposed and illustrated by analyzing two real data sets; its performances are evaluated by means of a simulation experiment.  相似文献   

2.
Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.  相似文献   

3.
This article reports results of an extensive simulation study which investigated the performances of some commonly used methods of estimating error rates in discriminant analysis. Earlier research papers limited their comparisons of these methods to independent training data. This study allows for a simple auto-regressive dependence among the training data. The results suggest that the estimation methods based on the normal distribution perform adequately well under conditions of negative or mild positive correlation in the data, and small dimensions (p) of the observation vectors. For large p or strong positive correlation structures the conclusion is that one of the better non-parametric methods should be used. Special circumstances and conditions which notably affect the relative performances of the methods are identified.  相似文献   

4.
To compare their performance on high dimensional data, several regression methods are applied to data sets in which the number of exploratory variables greatly exceeds the sample sizes. The methods are stepwise regression, principal components regression, two forms of latent root regression, partial least squares, and a new method developed here. The data are four sample sets for which near infrared reflectance spectra have been determined and the regression methods use the spectra to estimate the concentration of various chemical constituents, the latter having been determined by standard chemical analysis. Thirty-two regression equations are estimated using each method and their performances are evaluated using validation data sets. Although it is the most widely used, stepwise regression was decidedly poorer than the other methods considered. Differences between the latter were small with partial least squares performing slightly better than other methods under all criteria examined, albeit not by a statistically significant amount.  相似文献   

5.
A review of several statistical methods that are currently in use for outlier identification is presented, and their performances are compared theoretically for typical statistical distributions of experimental data, considering values derived from the distribution of extreme order statistics as reference terms. A simple modification of a popular, broadly used method based upon box-plot is introduced, in order to overcome a major limitation concerning sample size. Examples are presented concerning exploitation of methods considered on two data sets: a historical one concerning evaluation of an astronomical constant performed by a number of leading observatories and a substantial database pertaining to an ongoing investigation on absolute measurement of gravity acceleration, exhibiting peculiar aspects concerning outliers. Some problems related to outlier treatment are examined, and the requirement of both statistical analysis and expert opinion for proper outlier management is underlined.  相似文献   

6.
A simulation comparison is done of Mann–Whitney U test extensions recently proposed for simple cluster samples or for repeated ordinal responses. These are based on two approaches: the permutation approach of Fay and Gennings (four tests, two exact), and Edwardes’ approach (two asymptotic tests, one new). Edwardes’ approach permits confidence interval estimation, unlike the permutation approach. An appropriate parameter for estimation is P(X<Y)−P(X>Y), where X is the rank of a response from group 1 and Y is from group 2. The permutation tests are shown to be unsuitable for some survey data, since they are sensitive to a difference in cluster intra-correlations when there is no distribution difference between groups at the individual level. The exact permutation tests are of little use for less than seven clusters, precisely where they are most needed. Otherwise, the permutation tests perform well.  相似文献   

7.
Among innovations and improvements that occurred in the past two decades on the techniques and tools used for statistical process control (SPC), adaptive control charts have shown to substantially improve the statistical and/or economical performances. Variable sampling intervals (VSI) control charts are one of the most applied types of the adaptive control charts and have shown to be faster than traditional Shewhart control charts in identifying small changes of concerned quality characteristics. While in the designing procedure of the VSI control charts the data or measurements are assumed independent normal observations, in real situations the validity of these assumptions is under question in many processes. This article develops an economic-statistical design of a VSI X-bar control chart under non-normality and correlation. Since the proposed design consists of a complex nonlinear cost model that cannot be solved using a classical optimization method, a genetic algorithm (GA) is employed to solve it. Moreover, to improve the performances, response surface methodology (RSM) is employed to calibrate GA parameters. The solution procedure, efficiency, and sensitivity analysis of the proposed design are demonstrated through a numerical illustration at the end.  相似文献   

8.
In this study some new unbiased estimators based on order statistics are proposed for the scale parameter in some family of scale distributions. These new estimators are suitable for the cases of complete (uncensored) and symmetric doubly Type-II censored samples. Further, they can be adapted to Type II right or Type II left censored samples. In addition, unbiased standard deviation estimators of the proposed estimators are also given. Moreover, unlike BLU estimators based on order statistics, expectation and variance-covariance of relevant order statistics are not required in computing these new estimators.

Simulation studies are conducted to compare performances of the new estimators with their counterpart BLU estimators for small sample sizes. The simulation results show that most of the proposed estimators in general perform almost as good as the counterpart BLU estimators; even some of them are better than BLU in some cases. Furthermore, a real data set is used to illustrate the new estimators and the results obtained parallel with those of BLUE methods.  相似文献   


9.
In this paper we study the class of augmented balanced incomplete block designs, which are used for comparing a control treatment with a set of test treatments. Under the A- criterion we establish a condition that enables us to determine the most efficient augmented design and we suggest some methods to compute a lower bound for the efficiency of these designs. For 3≤k≤10, vk, we list the parameters of the most efficient designs with a lower bound for their efficiency or, if known, mention their optimality.  相似文献   

10.
In high-dimensional regression problems regularization methods have been a popular choice to address variable selection and multicollinearity. In this paper we study bridge regression that adaptively selects the penalty order from data and produces flexible solutions in various settings. We implement bridge regression based on the local linear and quadratic approximations to circumvent the nonconvex optimization problem. Our numerical study shows that the proposed bridge estimators are a robust choice in various circumstances compared to other penalized regression methods such as the ridge, lasso, and elastic net. In addition, we propose group bridge estimators that select grouped variables and study their asymptotic properties when the number of covariates increases along with the sample size. These estimators are also applied to varying-coefficient models. Numerical examples show superior performances of the proposed group bridge estimators in comparisons with other existing methods.  相似文献   

11.
Model selection methods are important to identify the best approximating model. To identify the best meaningful model, purpose of the model should be clearly pre-stated. The focus of this paper is model selection when the modelling purpose is classification. We propose a new model selection approach designed for logistic regression model selection where main modelling purpose is classification. The method is based on the distance between the two clustering trees. We also question and evaluate the performances of conventional model selection methods based on information theory concepts in determining best logistic regression classifier. An extensive simulation study is used to assess the finite sample performances of the cluster tree based and the information theoretic model selection methods. Simulations are adjusted for whether the true model is in the candidate set or not. Results show that the new approach is highly promising. Finally, they are applied to a real data set to select a binary model as a means of classifying the subjects with respect to their risk of breast cancer.  相似文献   

12.
We introduce the concept of snipping, complementing that of trimming, in robust cluster analysis. An observation is snipped when some of its dimensions are discarded, but the remaining are used for clustering and estimation. Snipped k-means is performed through a probabilistic optimization algorithm which is guaranteed to converge to the global optimum. We show global robustness properties of our snipped k-means procedure. Simulations and a real data application to optical recognition of handwritten digits are used to illustrate and compare the approach.  相似文献   

13.
Clinical trials often use paired binomial data as their clinical endpoint. The confidence interval is frequently used to estimate the treatment performance. Tang et al. (2009) have proposed exact and approximate unconditional methods for constructing a confidence interval in the presence of incomplete paired binary data. The approach proposed by Tang et al. can be overly conservative with large expected confidence interval width (ECIW) in some situations. We propose a profile likelihood‐based method with a Jeffreys' prior correction to construct the confidence interval. This approach generates confidence interval with a much better coverage probability and shorter ECIWs. The performances of the method along with the corrections are demonstrated through extensive simulation. Finally, three real world data sets are analyzed by all the methods. Statistical Analysis System (SAS) codes to execute the profile likelihood‐based methods are also presented. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

14.

Cluster point processes comprise a class of models that have been used for a wide range of applications. While several models have been studied for the probability density function of the offspring displacements and the parent point process, there are few examples of non-Poisson distributed cluster sizes. In this paper, we introduce a generalization of the Thomas process, which allows for the cluster sizes to have a variance that is greater or less than the expected value. We refer to this as the cluster sizes being over- and under-dispersed, respectively. To fit the model, we introduce minimum contrast methods and a Bayesian MCMC algorithm. These are evaluated in a simulation study. It is found that using the Bayesian MCMC method, we are in most cases able to detect over- and under-dispersion in the cluster sizes. We use the MCMC method to fit the model to nerve fiber data, and contrast the results to those of a fitted Thomas process.

  相似文献   

15.
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.  相似文献   

16.
Empirical likelihood based variable selection   总被引:1,自引:0,他引:1  
Information criteria form an important class of model/variable selection methods in statistical analysis. Parametric likelihood is a crucial part of these methods. In some applications such as the generalized linear models, the models are only specified by a set of estimating functions. To overcome the non-availability of well defined likelihood function, the information criteria under empirical likelihood are introduced. Under this setup, we successfully solve the existence problem of the profile empirical likelihood due to the over constraint in variable selection problems. The asymptotic properties of the new method are investigated. The new method is shown to be consistent at selecting the variables under mild conditions. Simulation studies find that the proposed method has comparable performance to the parametric information criteria when a suitable parametric model is available, and is superior when the parametric model assumption is violated. A real data set is also used to illustrate the usefulness of the new method.  相似文献   

17.
In this article, the partially linear covariate-adjusted regression models are considered, and the penalized least-squares procedure is proposed to simultaneously select variables and estimate the parametric components. The rate of convergence and the asymptotic normality of the resulting estimators are established under some regularization conditions. With the proper choices of the penalty functions and tuning parameters, it is shown that the proposed procedure can be as efficient as the oracle estimators. Some Monte Carlo simulation studies and a real data application are carried out to assess the finite sample performances for the proposed method.  相似文献   

18.
Nonparametric regression models are often used to check or suggest a parametric model. Several methods have been proposed to test the hypothesis of a parametric regression function against an alternative smoothing spline model. Some tests such as the locally most powerful (LMP) test by Cox et al. (Cox, D., Koh, E., Wahba, G. and Yandell, B. (1988). Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. Ann. Stat., 16, 113–119.), the generalized maximum likelihood (GML) ratio test and the generalized cross validation (GCV) test by Wahba (Wahba, G. (1990). Spline models for observational data. CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM.) were developed from the corresponding Bayesian models. Their frequentist properties have not been studied. We conduct simulations to evaluate and compare finite sample performances. Simulation results show that the performances of these tests depend on the shape of the true function. The LMP and GML tests are more powerful for low frequency functions while the GCV test is more powerful for high frequency functions. For all test statistics, distributions under the null hypothesis are complicated. Computationally intensive Monte Carlo methods can be used to calculate null distributions. We also propose approximations to these null distributions and evaluate their performances by simulations.  相似文献   

19.
The purpose of this article is to present a new method to predict the response variable of an observation in a new cluster for a multilevel logistic regression. The central idea is based on the empirical best estimator for the random effect. Two estimation methods for multilevel model are compared: penalized quasi-likelihood and Gauss–Hermite quadrature. The performance measures for the prediction of the probability for a new cluster observation of the multilevel logistic model in comparison with the usual logistic model are examined through simulations and an application.  相似文献   

20.
Many applications require efficient sampling from Gaussian distributions. The method of choice depends on the dimension of the problem as well as the structure of the covariance- (Σ) or precision matrix (Q). The most common black-box routine for computing a sample is based on Cholesky factorization. In high dimensions, computing the Cholesky factor of Σ or Q may be prohibitive due to accumulation of more non-zero entries in the factor than is possible to store in memory. We compare different methods for computing the samples iteratively adapting ideas from numerical linear algebra. These methods assume that matrix vector products, Qv, are fast to compute. We show that some of the methods are competitive and faster than Cholesky sampling and that a parallel version of one method on a Graphical Processing Unit (GPU) using CUDA can introduce a speed-up of up to 30x. Moreover, one method is used to sample from the posterior distribution of petroleum reservoir parameters in a North Sea field, given seismic reflection data on a large 3D grid.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号