首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we address the problem of simulating from a data-generating process for which the observed data do not follow a regular probability distribution. One existing method for doing this is bootstrapping, but it is incapable of interpolating between observed data. For univariate or bivariate data, in which a mixture structure can easily be identified, we could instead simulate from a Gaussian mixture model. In general, though, we would have the problem of identifying and estimating the mixture model. Instead of these, we introduce a non-parametric method for simulating datasets like this: Kernel Carlo Simulation. Our algorithm begins by using kernel density estimation to build a target probability distribution. Then, an envelope function that is guaranteed to be higher than the target distribution is created. We then use simple accept–reject sampling. Our approach is more flexible than others, can simulate intelligently across gaps in the data, and requires no subjective modelling decisions. With several univariate and multivariate examples, we show that our method returns simulated datasets that, compared with the observed data, retain the covariance structures and have distributional characteristics that are remarkably similar.  相似文献   

2.
Effectively solving the label switching problem is critical for both Bayesian and Frequentist mixture model analyses. In this article, a new relabeling method is proposed by extending a recently developed modal clustering algorithm. First, the posterior distribution is estimated by a kernel density from permuted MCMC or bootstrap samples of parameters. Second, a modal EM algorithm is used to find the m! symmetric modes of the KDE. Finally, samples that ascend to the same mode are assigned the same label. Simulations and real data applications demonstrate that the new method provides more accurate estimates than many existing relabeling methods.  相似文献   

3.
We propose a Bayesian nonparametric procedure for density estimation, for data in a closed, bounded interval, say [0,1]. To this aim, we use a prior based on Bemstein polynomials. This corresponds to expressing the density of the data as a mixture of given beta densities, with random weights and a random number of components. The density estimate is then obtained as the corresponding predictive density function. Comparison with classical and Bayesian kernel estimates is provided. The proposed procedure is illustrated in an example; an MCMC algorithm for approximating the estimate is also discussed.  相似文献   

4.
In this paper, we present an algorithm for clustering based on univariate kernel density estimation, named ClusterKDE. It consists of an iterative procedure that in each step a new cluster is obtained by minimizing a smooth kernel function. Although in our applications we have used the univariate Gaussian kernel, any smooth kernel function can be used. The proposed algorithm has the advantage of not requiring a priori the number of cluster. Furthermore, the ClusterKDE algorithm is very simple, easy to implement, well-defined and stops in a finite number of steps, namely, it always converges independently of the initial point. We also illustrate our findings by numerical experiments which are obtained when our algorithm is implemented in the software Matlab and applied to practical applications. The results indicate that the ClusterKDE algorithm is competitive and fast when compared with the well-known Clusterdata and K-means algorithms, used by Matlab to clustering data.  相似文献   

5.
Failure time models are considered when there is a subpopulation of individuals that is immune, or not susceptible, to an event of interest. Such models are of considerable interest in biostatistics. The most common approach is to postulate a proportion p of immunes or long-term survivors and to use a mixture model [5]. This paper introduces the defective inverse Gaussian model as a cure model and examines the use of the Gibbs sampler together with a data augmentation algorithm to study Bayesian inferences both for the cured fraction and the regression parameters. The results of the Bayesian and likelihood approaches are illustrated on two real data sets.  相似文献   

6.
ABSTRACT

In economics and government statistics, aggregated data instead of individual level data are usually reported for data confidentiality and for simplicity. In this paper we develop a method of flexibly estimating the probability density function of the population using aggregated data obtained as group averages when individual level data are grouped according to quantile limits. The kernel density estimator has been commonly applied to such data without taking into account the data aggregation process and has been shown to perform poorly. Our method models the quantile function as an integral of the exponential of a spline function and deduces the density function from the quantile function. We match the aggregated data to their theoretical counterpart using least squares, and regularize the estimation by using the squared second derivatives of the density function as the penalty function. A computational algorithm is developed to implement the method. Application to simulated data and US household income survey data show that our penalized spline estimator can accurately recover the density function of the underlying population while the common use of kernel density estimation is severely biased. The method is applied to study the dynamic of China's urban income distribution using published interval aggregated data of 1985–2010.  相似文献   

7.
居民收入密度函数的核密度估计具有非连续性,因无法通过积分计算特定收入区间的人口规模,故在核密度估计基础上,构建二分递归算法用以测算特定收入群体规模。使用中国健康和营养调查中的中国农村居民人均纯收入的微观调查数据,对中国农村居民收入分布进行核密度估计,并通过二分递归算法测算中国农村贫困发生率,结果显示:考虑到微观数据源和数据内容上的一些差异,计算得到的农村贫困发生率符合国家统计局公布的变动趋势且数值差异不大。因此,在核密度估计下使用二分递归算法计算特定收入群体规模具有有效性。  相似文献   

8.
A copula can fully characterize the dependence of multiple variables. The purpose of this paper is to provide a Bayesian nonparametric approach to the estimation of a copula, and we do this by mixing over a class of parametric copulas. In particular, we show that any bivariate copula density can be arbitrarily accurately approximated by an infinite mixture of Gaussian copula density functions. The model can be estimated by Markov Chain Monte Carlo methods and the model is demonstrated on both simulated and real data sets.  相似文献   

9.
A method of regularized discriminant analysis for discrete data, denoted DRDA, is proposed. This method is related to the regularized discriminant analysis conceived by Friedman (1989) in a Gaussian framework for continuous data. Here, we are concerned with discrete data and consider the classification problem using the multionomial distribution. DRDA has been conceived in the small-sample, high-dimensional setting. This method has a median position between multinomial discrimination, the first-order independence model and kernel discrimination. DRDA is characterized by two parameters, the values of which are calculated by minimizing a sample-based estimate of future misclassification risk by cross-validation. The first parameter is acomplexity parameter which provides class-conditional probabilities as a convex combination of those derived from the full multinomial model and the first-order independence model. The second parameter is asmoothing parameter associated with the discrete kernel of Aitchison and Aitken (1976). The optimal complexity parameter is calculated first, then, holding this parameter fixed, the optimal smoothing parameter is determined. A modified approach, in which the smoothing parameter is chosen first, is discussed. The efficiency of the method is examined with other classical methods through application to data.  相似文献   

10.
Kernel density classification and boosting: an L2 analysis   总被引:1,自引:0,他引:1  
Kernel density estimation is a commonly used approach to classification. However, most of the theoretical results for kernel methods apply to estimation per se and not necessarily to classification. In this paper we show that when estimating the difference between two densities, the optimal smoothing parameters are increasing functions of the sample size of the complementary group, and we provide a small simluation study which examines the relative performance of kernel density methods when the final goal is classification.A relative newcomer to the classification portfolio is boosting, and this paper proposes an algorithm for boosting kernel density classifiers. We note that boosting is closely linked to a previously proposed method of bias reduction in kernel density estimation and indicate how it will enjoy similar properties for classification. We show that boosting kernel classifiers reduces the bias whilst only slightly increasing the variance, with an overall reduction in error. Numerical examples and simulations are used to illustrate the findings, and we also suggest further areas of research.  相似文献   

11.
ABSTRACT

In many real-world applications, the traditional theory of analysis of covariance (ANCOVA) leads to inadequate and unreliable results because of violation of the response variable observations from the essential Gaussian assumption that may be due to the heterogeneity of population, the presence of outlier or both of them. In this paper, we develop a Gaussian mixture ANCOVA model for modelling heterogeneous populations with a finite number of subpopulation. We provide the maximum likelihood estimates of the model parameters via an EM algorithm. We also drive the adjusted effects estimators for treatments and covariates. The Fisher information matrix of the model and asymptotic confidence intervals for the parameter are also discussed. We performed a simulation study to assess the performance of the proposed model. A real-world example is also worked out to explained the methodology.  相似文献   

12.
In this paper, we are concerned with nonparametric estimation of the density and the failure rate functions of a random variable X which is at risk of being censored. First, we establish the asymptotic normality of a kernel density estimator in a general censoring setup. Then, we apply our result in order to derive the asymptotic normality of both the density and the failure rate estimators in the cases of right, twice and doubly censored data. Finally, the performance and the asymptotic Gaussian behaviour of the studied estimators, based on either doubly or twice censored data, are illustrated through a simulation study.  相似文献   

13.
This article is concerned with testing multiple hypotheses, one for each of a large number of small data sets. Such data are sometimes referred to as high-dimensional, low-sample size data. Our model assumes that each observation within a randomly selected small data set follows a mixture of C shifted and rescaled versions of an arbitrary density f. A novel kernel density estimation scheme, in conjunction with clustering methods, is applied to estimate f. Bayes information criterion and a new criterion weighted mean of within-cluster variances are used to estimate C, which is the number of mixture components or clusters. These results are applied to the multiple testing problem. The null sampling distribution of each test statistic is determined by f, and hence a bootstrap procedure that resamples from an estimate of f is used to approximate this null distribution.  相似文献   

14.
Abstract

In this article we propose an automatic selection of the bandwidth of the recursive kernel density estimators for spatial data defined by the stochastic approximation algorithm. We showed that, using the selected bandwidth and the stepsize which minimize the MWISE (Mean Weighted Integrated Squared Error), the recursive estimator will be quite similar to the nonrecursive one in terms of estimation error and much better in terms of computational costs. In addition, we obtain the central limit theorem for the nonparametric recursive density estimator under some mild conditions.  相似文献   

15.
We introduce a combined two-stage least-squares (2SLS)–expectation maximization (EM) algorithm for estimating vector-valued autoregressive conditional heteroskedasticity models with standardized errors generated by Gaussian mixtures. The procedure incorporates the identification of the parametric settings as well as the estimation of the model parameters. Our approach does not require a priori knowledge of the Gaussian densities. The parametric settings of the 2SLS_EM algorithm are determined by the genetic hybrid algorithm (GHA). We test the GHA-driven 2SLS_EM algorithm on some simulated cases and on international asset pricing data. The statistical properties of the estimated models and the derived mixture densities indicate good performance of the algorithm. We conduct tests on a massively parallel processor supercomputer to cope with situations involving numerous mixtures. We show that the algorithm is scalable.  相似文献   

16.
In this paper, we extend Choi and Hall's [Data sharpening as a prelude to density estimation. Biometrika. 1999;86(4):941–947] data sharpening algorithm for kernel density estimation to interval-censored data. Data sharpening has several advantages, including bias and mean integrated squared error (MISE) reduction as well as increased robustness to bandwidth misspecification. Several interval metrics are explored for use with the kernel function in the data sharpening transformation. A simulation study based on randomly generated data is conducted to assess and compare the performance of each interval metric. It is found that the bias is reduced by sharpening, often with little effect on the variance, thus maintaining or reducing overall MISE. Applications involving time to onset of HIV and running distances subject to measurement error are used for illustration.  相似文献   

17.
In this paper, we propose the MulticlusterKDE algorithm applied to classify elements of a database into categories based on their similarity. MulticlusterKDE is centered on the multiple optimization of the kernel density estimator function with multivariate Gaussian kernel. One of the main features of the proposed algorithm is that the number of clusters is an optional input parameter. Furthermore, it is very simple, easy to implement, well defined and stops at a finite number of steps and it always converges regardless of the data set. We illustrate our findings by implementing the algorithm in R software. The results indicate that the MulticlusterKDE algorithm is competitive when compared to K-means, K-medoids, CLARA, DBSCAN and PdfCluster algorithms. Features such as simplicity and efficiency make the proposed algorithm an attractive and promising research field that can be used as basis for its improvement and also for the development of new density-based clustering algorithms.  相似文献   

18.
We propose a modification to the regular kernel density estimation method that use asymmetric kernels to circumvent the spill over problem for densities with positive support. First a pivoting method is introduced for placement of the data relative to the kernel function. This yields a strongly consistent density estimator that integrates to one for each fixed bandwidth in contrast to most density estimators based on asymmetric kernels proposed in the literature. Then a data-driven Bayesian local bandwidth selection method is presented and lognormal, gamma, Weibull and inverse Gaussian kernels are discussed as useful special cases. Simulation results and a real-data example illustrate the advantages of the new methodology.  相似文献   

19.
In this article, we have developed a Poisson-mixed inverse Gaussian (PMIG) distribution. The mixed inverse Gaussian distribution is a mixture of the inverse Gaussian distribution and its length-biased counterpart. A PMIG regression model is developed and the maximum likelihood estimation of the parameters is studied. A dataset dealing with the number of hospital stays among the elderly population is analyzed by using the PMIG and the PIG (Poisson-inverse Gaussian) regression models and it has been shown that the PMIG model fits the data better than the PIG model.  相似文献   

20.
Integro-difference equations (IDEs) provide a flexible framework for dynamic modeling of spatio-temporal data. The choice of kernel in an IDE model relates directly to the underlying physical process modeled, and it can affect model fit and predictive accuracy. We introduce Bayesian non-parametric methods to the IDE literature as a means to allow flexibility in modeling the kernel. We propose a mixture of normal distributions for the IDE kernel, built from a spatial Dirichlet process for the mixing distribution, which can model kernels with shapes that change with location. This allows the IDE model to capture non-stationarity with respect to location and to reflect a changing physical process across the domain. We address computational concerns for inference that leverage the use of Hermite polynomials as a basis for the representation of the process and the IDE kernel, and incorporate Hamiltonian Markov chain Monte Carlo steps in the posterior simulation method. An example with synthetic data demonstrates that the model can successfully capture location-dependent dynamics. Moreover, using a data set of ozone pressure, we show that the spatial Dirichlet process mixture model outperforms several alternative models for the IDE kernel, including the state of the art in the IDE literature, that is, a Gaussian kernel with location-dependent parameters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号