首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Boxplots are among the most widely used exploratory data analysis (EDA) tools in statistical practice. Typical applications of boxplots include eliciting information about the underlying distribution (shape, location, etc.) as well as identifying possible outliers. This article focuses on a modification using a type of lower and upper fences similar in concept to those used in a traditional boxplot; however, instead of constructing the upper and lower fences using the upper and lower quartiles, respectively, and a multiple of the interquartile range (IQR), multiples of the upper and the lower semi-interquartile ranges (SIQR), respectively, measured from the sample median, are used. Any observation beyond the proposed fences is labeled a potential outlier. An exact expression for the probability that at least one sample observation is wrongly classified as an outlier, the so-called “some-outside rate per sample” (Hoaglin et al. (1986)), is derived for the family of location-scale distributions and is used in the determination of the fence constants. Tables for the fence constants are provided for a number of well-known location-scale distributions along with some illustrations with data; the performance of the outlier detection rule is explored in a simulation study.  相似文献   

2.
Functional boxplot is an attractive technique to visualize data that come from functions. We propose an alternative to the functional boxplot based on depth measures. Our proposal generalizes the usual construction of the box-plot in one dimension related to the down-upward orderings of the data by considering two intuitive pre-orders in the functional context. These orderings are based on the epigraphs and hypographs of the data that allow a new definition of functional quartiles which is more robust to shape outliers. Simulated and real examples show that this proposal provides a convenient visualization technique with a great potential for analyzing functional data and illustrate its usefulness to detect outliers that other procedures do not detect.  相似文献   

3.
This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

4.
The presence of extreme outliers in the upper tail data of income distribution affects the Pareto tail modeling. A simulation study is carried out to compare the performance of three types of boxplot in the detection of extreme outliers for Pareto data, including standard boxplot, adjusted boxplot and generalized boxplot. It is found that the generalized boxplot is the best method for determining extreme outliers for Pareto distributed data. For the application, the generalized boxplot is utilized for determining the exreme outliers in the upper tail of Malaysian income distribution. In addition, for this data set, the confidence interval method is applied for examining the presence of dragon-kings, extreme outliers which are beyond the Pareto or power-laws distribution.  相似文献   

5.
The boxplot is an effective data-visualization tool useful in diverse applications and disciplines. Although more sophisticated graphical methods exist, the boxplot remains relevant due to its simplicity, interpretability, and usefulness, even in the age of big data. This article highlights the origins and developments of the boxplot that is now widely viewed as an industry standard as well as its inherent limitations when dealing with data from skewed distributions, particularly when detecting outliers. The proposed Ratio-Skewed boxplot is shown to be practical and suitable for outlier labeling across several parametric distributions.  相似文献   

6.
Model-based clustering is a method that clusters data with an assumption of a statistical model structure. In this paper, we propose a novel model-based hierarchical clustering method for a finite statistical mixture model based on the Fisher distribution. The main foci of the proposed method are: (a) provide efficient solution to estimate the parameters of a Fisher mixture model (FMM); (b) generate a hierarchy of FMMs and (c) select the optimal model. To this aim, we develop a Bregman soft clustering method for FMM. Our model estimation strategy exploits Bregman divergence and hierarchical agglomerative clustering. Whereas, our model selection strategy comprises a parsimony-based approach and an evaluation graph-based approach. We empirically validate our proposed method by applying it on simulated data. Next, we apply the method on real data to perform depth image analysis. We demonstrate that the proposed clustering method can be used as a potential tool for unsupervised depth image analysis.  相似文献   

7.
An important problem in network analysis is to identify significant communities. Most of the real-world data sets exhibit a certain topological structure between nodes and the attributes describing them. In this paper, we propose a new community detection criterion considering both structural similarities and attribute similarities. The clustering method integrates the cost of clustering node attributes with the cost of clustering the structural information via the normalized modularity. We show that the joint clustering problem can be formulated as a spectral relaxation problem. The proposed algorithm is capable of learning the degree of contributions of individual node attributes. A number of numerical studies involving simulated and real data sets demonstrate the effectiveness of the proposed method.  相似文献   

8.
基于聚类关联规则的缺失数据处理研究   总被引:2,自引:1,他引:2       下载免费PDF全文
 本文提出了基于聚类和关联规则的缺失数据处理新方法,通过聚类方法将含有缺失数据的数据集相近的记录归到一类,然后利用改进后的关联规则方法对各子数据集挖掘变量间的关联性,并利用这种关联性来填补缺失数据。通过实例分析,发现该方法对缺失数据处理,尤其是海量数据集具有较好的效果。  相似文献   

9.
Traditional methods of calculation of quartiles for ungrouped data are based on interpolation. In this article we focus on three methods of defining measures of this kind. Then we present hinges that divide data into four parts by a lower hinge, a median, and an upper hinge. A hinge is “crudely, a quartile.” The preceding four techniques may yield different numerical answers when applied to the same set of data. Two tests are proposed and are used to evaluate the various methods for calculation of quartiles and hinges. Finally, an alternative method of calculating quartiles is provided; it retains desirable characteristics of quartiles and combines them with the advantages ascribed to hinges.  相似文献   

10.
We consider regression analysis when part of covariates are incomplete in generalized linear models. The incomplete covariates could be due to measurement error or missing for some study subjects. We assume there exists a validation sample in which the data is complete and is a simple random subsample from the whole sample. Based on the idea of projection-solution method in Heyde (1997, Quasi-Likelihood and its Applications: A General Approach to Optimal Parameter Estimation. Springer, New York), a class of estimating functions is proposed to estimate the regression coefficients through the whole data. This method does not need to specify a correct parametric model for the incomplete covariates to yield a consistent estimate, and avoids the ‘curse of dimensionality’ encountered in the existing semiparametric method. Simulation results shows that the finite sample performance and efficiency property of the proposed estimates are satisfactory. Also this approach is computationally convenient hence can be applied to daily data analysis.  相似文献   

11.
In this article, we present a novel approach to clustering finite or infinite dimensional objects observed with different uncertainty levels. The novelty lies in using confidence sets rather than point estimates to obtain cluster membership and the number of clusters based on the distance between the confidence set estimates. The minimal and maximal distances between the confidence set estimates provide confidence intervals for the true distances between objects. The upper bounds of these confidence intervals can be used to minimize the within clustering variability and the lower bounds can be used to maximize the between clustering variability. We assign objects to the same cluster based on a min–max criterion and we separate clusters based on a max–min criterion. We illustrate our technique by clustering a large number of curves and evaluate our clustering procedure with a synthetic example and with a specific application.  相似文献   

12.
In this article, we deal with a two-parameter exponentiated half-logistic distribution. We consider the estimation of unknown parameters, the associated reliability function and the hazard rate function under progressive Type II censoring. Maximum likelihood estimates (M LEs) are proposed for unknown quantities. Bayes estimates are derived with respect to squared error, linex and entropy loss functions. Approximate explicit expressions for all Bayes estimates are obtained using the Lindley method. We also use importance sampling scheme to compute the Bayes estimates. Markov Chain Monte Carlo samples are further used to produce credible intervals for the unknown parameters. Asymptotic confidence intervals are constructed using the normality property of the MLEs. For comparison purposes, bootstrap-p and bootstrap-t confidence intervals are also constructed. A comprehensive numerical study is performed to compare the proposed estimates. Finally, a real-life data set is analysed to illustrate the proposed methods of estimation.  相似文献   

13.
The balanced iterative reducing and clustering hierarchies (BIRCH) algorithm handles massive datasets by reading the data file only once, clustering the data as it is read, and retaining only a few clustering features to summarize the data read so far. Using BIRCH allows to analyse datasets that are too large to fit in the computer main memory. We propose estimates of Spearman's ρ and Kendall's τ that are calculated from a BIRCH output and assess their performance through Monte Carlo studies. The numerical results show that the BIRCH-based estimates can achieve the same efficiency as the usual estimates of ρ and τ while using only a fraction of the memory otherwise required.  相似文献   

14.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

15.

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

  相似文献   

16.
The paper provides a method for generating epoch estimates for time series survey data, allowing for different periods of time (or even point estimates) according to user demand. The method uses a modified kriging estimator, which suppresses the contribution of sampling error variability in order to guarantee that custom epoch estimates have an interpolation property. For the veteran population variable of the American Community Survey, we utilize a simple Brownian Motion model of the population process and derive the modified kriging estimator for this case. The tuning parameters of this population model can be calibrated to the data via simple formulas. We illustrate the application of this method to the generation of point estimates of veteran population, an important objective for Veterans Affairs.  相似文献   

17.
18.
It is important to identify outliers since inclusion, especially when using parametric methods, can cause distortion in the analysis and lead to erroneous conclusions. One of the easiest and most useful methods is based on the boxplot. This method is particularly appealing since it does not use any outliers in computing spread. Two methods, one by Carling and another by Schwertman and de Silva, adjust the boxplot method for sample size and skewness. In this paper, the two procedures are compared both theoretically and by Monte Carlo simulations. Simulations using both a symmetric distribution and an asymmetric distribution were performed on data sets with none, one, and several outliers. Based on the simulations, the Carling approach is superior in avoiding masking outliers, that is, the Carling method is less likely to overlook an outlier while the Schwertman and de Silva procedure is much better at reducing swamping, that is, misclassifying an observation as an outlier. Carling’s method is to the Schwertman and de Silva procedure as comparisonwise versus experimentwise error rate is for multiple comparisons. The two methods, rather than being competitors, appear to complement each other. Used in tandem they provide the data analyst a more complete prospective for identifying possible outliers.  相似文献   

19.
Clustering due to unobserved heterogeneity may seriously impact on inference from binary regression models. We examined the performance of the logistic, and the logistic-normal models for data with such clustering. The total variance of unobserved heterogeneity rather than the level of clustering determines the size of bias of the maximum likelihood (ML) estimator, for the logistic model. Incorrect specification of clustering as level 2, using the logistic-normal model, provides biased estimates of the structural and random parameters, while specifying level 1, provides unbiased estimates for the former, and adequately estimates the latter. The proposed procedure appeals to many research areas.  相似文献   

20.
This paper is concerned with using the E-Bayesian method for computing estimates of the exponentiated distribution family parameter. Based on the LINEX loss function, formulas of E-Bayesian estimation for unknown parameter are given, these estimates are derived based on a conjugate prior. Moreover, property of E-Bayesian estimation—the relationship between of E-Bayesian estimations under different prior distributions of the hyper parameters are also provided. A comparison between the new method and the corresponding maximum likelihood techniques is conducted using the Monte Carlo simulation. Finally, combined with the golfers income data practical problem are calculated, the results show that the proposed method is feasible and convenient for application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号