首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we describe some results of an ESPRIT project known as StatLog whose purpose is the comparison of classification algorithms. We give a brief summary of some of the algorithms in the project: discriminant analysis; nearest neighbours; decision trees; neural net methods; SMART; kernel methods and other Bayesian approaches.We focus on data sets derived from images, ranging from raw pixel data to features and summaries extracted from such data.  相似文献   

2.
We describe standard single-site Monte Carlo Markov chain methods, the Hastings and Metropolis algorithms, the Gibbs sampler and simulated annealing, for maximum a posteriori and marginal posterior modes image estimation. These methods can experience great difficulty in traversing the whole image space in a finite time when the target distribution is multi-modal. We present a survey of multiple-site update methods, including Swendsen and Wang's algorithm, coupled Markov chains and cascade algorithms designed to tackle the problem of moving between modes of the posterior image distribution. We compare the performance of some of these algorithms for sampling from degraded and non-degraded Ising models  相似文献   

3.
Summary.  Top coding of extreme values of variables like income is a common method of statistical disclosure control, but it creates problems for the data analyst. The paper proposes two alternative methods to top coding for statistical disclosure control that are based on multiple imputation. We show in simulation studies that the multiple-imputation methods provide better inferences of the publicly released data than top coding, using straightforward multiple-imputation methods of analysis, while maintaining good statistical disclosure control properties. We illustrate the methods on data from the 1995 Chinese household income project.  相似文献   

4.
文章从算法角度对关联规则的提出、演变过程和前沿研究进行了较为详细的考察,并在此基础上提出了关联规则未来研究的领域和发展趋势。文章先详细地考察了关联规则的三类典型算法,然后总结了基于复杂数据属性的关联规则算法扩展。为考察其他方面的算法扩展和介绍其他学科领域对关联规则的研究奠定了基础。  相似文献   

5.
In this paper, we discuss the class of generalized Birnbaum–Saunders distributions, which is a very flexible family suitable for modeling lifetime data as it allows for different degrees of kurtosis and asymmetry and unimodality as well as bimodality. We describe the theoretical developments on this model including properties, transformations and related distributions, lifetime analysis, and shape analysis. We also discuss methods of inference based on uncensored and censored data, diagnostics methods, goodness-of-fit tests, and random number generation algorithms for the generalized Birnbaum–Saunders model. Finally, we present some illustrative examples and show that this distribution fits the data better than the classical Birnbaum–Saunders model.  相似文献   

6.
Data augmentation is required for the implementation of many Markov chain Monte Carlo (MCMC) algorithms. The inclusion of augmented data can often lead to conditional distributions from well‐known probability distributions for some of the parameters in the model. In such cases, collapsing (integrating out parameters) has been shown to improve the performance of MCMC algorithms. We show how integrating out the infection rate parameter in epidemic models leads to efficient MCMC algorithms for two very different epidemic scenarios, final outcome data from a multitype SIR epidemic and longitudinal data from a spatial SI epidemic. The resulting MCMC algorithms give fresh insight into real‐life epidemic data sets.  相似文献   

7.
Full likelihood-based inference for modern population genetics data presents methodological and computational challenges. The problem is of considerable practical importance and has attracted recent attention, with the development of algorithms based on importance sampling (IS) and Markov chain Monte Carlo (MCMC) sampling. Here we introduce a new IS algorithm. The optimal proposal distribution for these problems can be characterized, and we exploit a detailed analysis of genealogical processes to develop a practicable approximation to it. We compare the new method with existing algorithms on a variety of genetic examples. Our approach substantially outperforms existing IS algorithms, with efficiency typically improved by several orders of magnitude. The new method also compares favourably with existing MCMC methods in some problems, and less favourably in others, suggesting that both IS and MCMC methods have a continuing role to play in this area. We offer insights into the relative advantages of each approach, and we discuss diagnostics in the IS framework.  相似文献   

8.
The problem of computing the variance of a sample of N data points {xi } may be difficult for certain data sets, particularly when N is large and the variance is small. We present a survey of possible algorithms and their round-off error bounds, including some new analysis for computations with shifted data. Experimental results confirm these bounds and illustrate the dangers of some algorithms. Specific recommendations are made as to which algorithm should be used in various contexts.  相似文献   

9.
The analysis of infectious disease data presents challenges arising from the dependence in the data and the fact that only part of the transmission process is observable. These difficulties are usually overcome by making simplifying assumptions. The paper explores the use of Markov chain Monte Carlo (MCMC) methods for the analysis of infectious disease data, with the hope that they will permit analyses to be made under more realistic assumptions. Two important kinds of data sets are considered, containing temporal and non-temporal information, from outbreaks of measles and influenza. Stochastic epidemic models are used to describe the processes that generate the data. MCMC methods are then employed to perform inference in a Bayesian context for the model parameters. The MCMC methods used include standard algorithms, such as the Metropolis–Hastings algorithm and the Gibbs sampler, as well as a new method that involves likelihood approximation. It is found that standard algorithms perform well in some situations but can exhibit serious convergence difficulties in others. The inferences that we obtain are in broad agreement with estimates obtained by other methods where they are available. However, we can also provide inferences for parameters which have not been reported in previous analyses.  相似文献   

10.
We consider computational methods for evaluating and approximating multivariate chi-square probabilities in cases where the pertaining correlation matrix or blocks thereof have a low factorial representation. To this end, techniques from matrix factorization and probability theory are applied. We outline a variety of statistical applications of multivariate chi-square distributions and provide a system of MATLAB programs implementing the proposed algorithms. Computer simulations demonstrate the accuracy and the computational efficiency of our methods in comparison with Monte Carlo approximations, and a real data example from statistical genetics illustrates their usage in practice.  相似文献   

11.
Summary. We develop a flexible class of Metropolis–Hastings algorithms for drawing inferences about population histories and mutation rates from deoxyribonucleic acid (DNA) sequence data. Match probabilities for use in forensic identification are also obtained, which is particularly useful for mitochondrial DNA profiles. Our data augmentation approach, in which the ancestral DNA data are inferred at each node of the genealogical tree, simplifies likelihood calculations and permits a wide class of mutation models to be employed, so that many different types of DNA sequence data can be analysed within our framework. Moreover, simpler likelihood calculations imply greater freedom for generating tree proposals, so that algorithms with good mixing properties can be implemented. We incorporate the effects of demography by means of simple mechanisms for changes in population size and structure, and we estimate the corresponding demographic parameters, but we do not here allow for the effects of either recombination or selection. We illustrate our methods by application to four human DNA data sets, consisting of DNA sequences, short tandem repeat loci, single-nucleotide polymorphism sites and insertion sites. Two of the data sets are drawn from the male-specific Y-chromosome, one from maternally inherited mitochondrial DNA and one from the β -globin locus on chromosome 11.  相似文献   

12.
We propose two preprocessing algorithms suitable for climate time series. The first algorithm detects outliers based on an autoregressive cost update mechanism. The second one is based on the wavelet transform, a method from pattern recognition. In order to benchmark the algorithms'' performance we compare them to existing methods based on a synthetic data set. Eventually, for exemplary purposes, the proposed methods are applied to a data set of high-frequent temperature measurements from Novi Sad, Serbia. The results show that both methods together form a powerful tool for signal preprocessing: In case of solitary outliers the autoregressive cost update mechanism prevails, whereas the wavelet-based mechanism is the method of choice in the presence of multiple consecutive outliers.  相似文献   

13.
Record data are commonly encountered in many fields such as sports, geography, finance, and reliability. In this article, we use the well-known Box–Muller transformation to develop an efficient method of simulating record data from the normal distribution. Another method based on exponential records is also discussed. Then, the performance of these algorithms is compared with some standard simulation methods.  相似文献   

14.
Inequality-restricted hypotheses testing methods containing multivariate one-sided testing methods are useful in practice, especially in multiple comparison problems. In practice, multivariate and longitudinal data often contain missing values since it may be difficult to observe all values for each variable. However, although missing values are common for multivariate data, statistical methods for multivariate one-sided tests with missing values are quite limited. In this article, motivated by a dataset in a recent collaborative project, we develop two likelihood-based methods for multivariate one-sided tests with missing values, where the missing data patterns can be arbitrary and the missing data mechanisms may be non-ignorable. Although non-ignorable missing data are not testable based on observed data, statistical methods addressing this issue can be used for sensitivity analysis and might lead to more reliable results, since ignoring informative missingness may lead to biased analysis. We analyse the real dataset in details under various possible missing data mechanisms and report interesting findings which are previously unavailable. We also derive some asymptotic results and evaluate our new tests using simulations.  相似文献   

15.
There is an increasing amount of literature focused on Bayesian computational methods to address problems with intractable likelihood. One approach is a set of algorithms known as Approximate Bayesian Computational (ABC) methods. One of the problems with these algorithms is that their performance depends on the appropriate choice of summary statistics, distance measure and tolerance level. To circumvent this problem, an alternative method based on the empirical likelihood has been introduced. This method can be easily implemented when a set of constraints, related to the moments of the distribution, is specified. However, the choice of the constraints is sometimes challenging. To overcome this difficulty, we propose an alternative method based on a bootstrap likelihood approach. The method is easy to implement and in some cases is actually faster than the other approaches considered. We illustrate the performance of our algorithm with examples from population genetics, time series and stochastic differential equations. We also test the method on a real dataset.  相似文献   

16.
Statistical learning is emerging as a promising field where a number of algorithms from machine learning are interpreted as statistical methods and vice-versa. Due to good practical performance, boosting is one of the most studied machine learning techniques. We propose algorithms for multivariate density estimation and classification. They are generated by using the traditional kernel techniques as weak learners in boosting algorithms. Our algorithms take the form of multistep estimators, whose first step is a standard kernel method. Some strategies for bandwidth selection are also discussed with regard both to the standard kernel density classification problem, and to our 'boosted' kernel methods. Extensive experiments, using real and simulated data, show an encouraging practical relevance of the findings. Standard kernel methods are often outperformed by the first boosting iterations and in correspondence of several bandwidth values. In addition, the practical effectiveness of our classification algorithm is confirmed by a comparative study on two real datasets, the competitors being trees including AdaBoosting with trees.  相似文献   

17.
ABSTRACT

In statistical practice, inferences on standardized regression coefficients are often required, but complicated by the fact that they are nonlinear functions of the parameters, and thus standard textbook results are simply wrong. Within the frequentist domain, asymptotic delta methods can be used to construct confidence intervals of the standardized coefficients with proper coverage probabilities. Alternatively, Bayesian methods solve similar and other inferential problems by simulating data from the posterior distribution of the coefficients. In this paper, we present Bayesian procedures that provide comprehensive solutions for inferences on the standardized coefficients. Simple computing algorithms are developed to generate posterior samples with no autocorrelation and based on both noninformative improper and informative proper prior distributions. Simulation studies show that Bayesian credible intervals constructed by our approaches have comparable and even better statistical properties than their frequentist counterparts, particularly in the presence of collinearity. In addition, our approaches solve some meaningful inferential problems that are difficult if not impossible from the frequentist standpoint, including identifying joint rankings of multiple standardized coefficients and making optimal decisions concerning their sizes and comparisons. We illustrate applications of our approaches through examples and make sample R functions available for implementing our proposed methods.  相似文献   

18.

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

  相似文献   

19.
Abstract

In this paper, we propose maximum entropy in the mean methods for propensity score matching classification problems. We provide a new methodological approach and estimation algorithms to handle explicitly cases when data is available: (i) in interval form; (ii) with bounded measurement or observational errors; or (iii) both as intervals and with bounded errors. We show that entropy in the mean methods for these three cases generally outperform benchmark error-free approaches.  相似文献   

20.
Noting that several rule discovery algorithms in data mining can produce a large number of irrelevant or obvious rules from data, there has been substantial research in data mining that addressed the issue of what makes rules truly 'interesting'. This resulted in the development of a number of interestingness measures and algorithms that find all interesting rules from data. However, these approaches have the drawback that many of the discovered rules, while supposed to be interesting by definition, may actually (1) be obvious in that they logically follow from other discovered rules or (2) be expected given some of the other discovered rules and some simple distributional assumptions. In this paper we argue that this is a paradox since rules that are supposed to be interesting, in reality are uninteresting for the above reason. We show that this paradox exists for various popular interestingness measures and present an abstract characterization of an approach to alleviate the paradox. We finally discuss existing work in data mining that addresses this issue and show how these approaches can be viewed with respect to the characterization presented here.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号