期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Human immunodeficiency virus, DNA and statistics

E. C. Holmes 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》1998,161(2):199-208

Understanding the causes and consequences of genetic variation in human immunodeficiency virus (HIV) is one of the most important tasks facing medical and evolutionary biologists alike. A powerful analytical tool which is available to those working in this field is the phylogenetic tree, which describes the evolutionary relationships of the sequences in a sample and the history of the mutational events which separate them. Although phylogenetic trees of HIV are becoming commonplace, their use can be improved by tailoring the underlying statistical models to the idiosyncrasies of viral biology. The design and refinement of phylogenetic analyses consequently represents an important practical use of statistical methods in HIV research. 相似文献

2.

A Bayesian Approach to Modelling Reticulation Events with Application to the Ribosomal Protein Gene rps11 of Flowering Plants

Rosalba Radice 《Australian & New Zealand Journal of Statistics》2012,54(4):401-426

Traditional phylogenetic inference assumes that the history of a set of taxa can be explained by a tree. This assumption is often violated as some biological entities can exchange genetic material giving rise to non‐treelike events often called reticulations. Failure to consider these events might result in incorrectly inferred phylogenies. Phylogenetic networks provide a flexible tool which allows researchers to model the evolutionary history of a set of organisms in the presence of reticulation events. In recent years, a number of methods addressing phylogenetic network parameter estimation have been introduced. Some of them are based on the idea that a phylogenetic network can be defined as a directed acyclic graph. Based on this definition, we propose a Bayesian approach to the estimation of phylogenetic network parameters which allows for different phylogenies to be inferred at different parts of a multiple DNA alignment. The algorithm is tested on simulated data and applied to the ribosomal protein gene rps11 data from five flowering plants, where reticulation events are suspected to be present. The proposed approach can be applied to a wide variety of problems which aim at exploring the possibility of reticulation events in the history of a set of taxa. 相似文献

3.

MULTIPLE COMPARISONS OF LOG-LIKELIHOODS AND COMBINING NONNESTED MODELS WITH APPLICATIONS TO PHYLOGENETIC TREE SELECTION

《统计学通讯:理论与方法》2013,42(8-9):1751-1772

We consider multiple comparisons of log-likelihood's to take account of the multiplicity of testings in selection of nonnested models. A resampling version of the Gupta procedure for the selection problem is used to obtain a set of good models, which are not significantly worse than the maximum likelihood model; i.e., a confidence set of models. Our method is to test which model is better than the other, while the object of the classical testing methods is to find the correct model. Thus the null hypotheses behind these two approaches are very different. Our method and the other commonly used approaches, such as the approximate Bayesian posterior, the bootstrap selection probability, and the LR test against the full model, are applied to the selection of molecular phylogenetic tree of mammal species. Tree selection is a version of the model-based clustering, which is an example of nonnested model selection. It is shown that the structure of the tree selection problem is equivalent to that of the variable selection problem of the multiple regression with some constraints on the combinations of the variables. It turns out that the LR test rejects all the possible trees because of the misspecification of the models, whereas our method gives a reasonable confidence set. For a better understanding of the uncertainty in the selection, we combine the maximum likelihood estimates (MLE's) of the trees to obtain the full model that includes the trees as the submodels by using a linear approximation of the parametric models. The MLE of the phylogeny is then represented as a network of species rather than a tree. A geometrical interpretation of the problem is also discussed. 相似文献

4.

Bayesian estimation and case influence diagnostics for the zero-inflated negative binomial regression model 总被引：1，自引：0，他引：1

Aldo M. Garay Victor H. Lachos Heleno Bolfarine 《Journal of applied statistics》2015,42(6):1148-1165

In recent years, there has been considerable interest in regression models based on zero-inflated distributions. These models are commonly encountered in many disciplines, such as medicine, public health, and environmental sciences, among others. The zero-inflated Poisson (ZIP) model has been typically considered for these types of problems. However, the ZIP model can fail if the non-zero counts are overdispersed in relation to the Poisson distribution, hence the zero-inflated negative binomial (ZINB) model may be more appropriate. In this paper, we present a Bayesian approach for fitting the ZINB regression model. This model considers that an observed zero may come from a point mass distribution at zero or from the negative binomial model. The likelihood function is utilized to compute not only some Bayesian model selection measures, but also to develop Bayesian case-deletion influence diagnostics based on q-divergence measures. The approach can be easily implemented using standard Bayesian software, such as WinBUGS. The performance of the proposed method is evaluated with a simulation study. Further, a real data set is analyzed, where we show that ZINB regression models seems to fit the data better than the Poisson counterpart. 相似文献

5.

Bayesian neural tree models for nonparametric regression

Tanujit Chakraborty Gauri Kamat Ashis Kumar Chakraborty 《Australian & New Zealand Journal of Statistics》2023,65(2):101-126

Frequentist and Bayesian methods differ in many aspects but share some basic optimal properties. In real-life prediction problems, situations exist in which a model based on one of the above paradigms is preferable depending on some subjective criteria. Nonparametric classification and regression techniques, such as decision trees and neural networks, have both frequentist (classification and regression trees (CARTs) and artificial neural networks) as well as Bayesian counterparts (Bayesian CART and Bayesian neural networks) to learning from data. In this paper, we present two hybrid models combining the Bayesian and frequentist versions of CART and neural networks, which we call the Bayesian neural tree (BNT) models. BNT models can simultaneously perform feature selection and prediction, are highly flexible, and generalise well in settings with limited training observations. We study the statistical consistency of the proposed approaches and derive the optimal value of a vital model parameter. The excellent performance of the newly proposed BNT models is shown using simulation studies. We also provide some illustrative examples using a wide variety of standard regression datasets from a public available machine learning repository to show the superiority of the proposed models in comparison to popularly used Bayesian CART and Bayesian neural network models. 相似文献

6.

Hierarchical priors for Bayesian CART shrinkage

Chipman Hugh McCulloch Robert E. 《Statistics and Computing》2000,10(1):17-24

The Bayesian CART (classification and regression tree) approach proposed by Chipman, George and McCulloch (1998) entails putting a prior distribution on the set of all CART models and then using stochastic search to select a model. The main thrust of this paper is to propose a new class of hierarchical priors which enhance the potential of this Bayesian approach. These priors indicate a preference for smooth local mean structure, resulting in tree models which shrink predictions from adjacent terminal node towards each other. Past methods for tree shrinkage have searched for trees without shrinking, and applied shrinkage to the identified tree only after the search. By using hierarchical priors in the stochastic search, the proposed method searches for shrunk trees that fit well and improves the tree through shrinkage of predictions. 相似文献

7.

Joint Estimation of Intersecting Context Tree Models

ANTONIO GALVES AURÉLIEN GARIVIER ELISABETH GASSIAT 《Scandinavian Journal of Statistics》2013,40(2):344-362

We study a problem of model selection for data produced by two different context tree sources. Motivated by linguistic questions, we consider the case where the probabilistic context trees corresponding to the two sources are finite and share many of their contexts. In order to understand the differences between the two sources, it is important to identify which contexts and which transition probabilities are specific to each source. We consider a class of probabilistic context tree models with three types of contexts: those which appear in one, the other, or both sources. We use a BIC penalized maximum likelihood procedure that jointly estimates the two sources. We propose a new algorithm which efficiently computes the estimated context trees. We prove that the procedure is strongly consistent. We also present a simulation study showing the practical advantage of our procedure over a procedure that works separately on each data set. 相似文献

8.

Learning classification trees 总被引：11，自引：0，他引：11

Wray Buntine 《Statistics and Computing》1992,2(2):63-73

Algorithms for learning classification trees have had successes in artificial intelligence and statistics over many years. This paper outlines how a tree learning algorithm can be derived using Bayesian statistics. This introduces Bayesian techniques for splitting, smoothing, and tree averaging. The splitting rule is similar to Quinlan's information gain, while smoothing and averaging replace pruning. Comparative experiments with reimplementations of a minimum encoding approach,c4 (Quinlanet al., 1987) andcart (Breimanet al., 1984), show that the full Bayesian algorithm can produce more accurate predictions than versions of these other approaches, though pays a computational price. 相似文献

9.

Estimating finite mixture of continuous trees using penalized mutual information

Atefeh Khalili 《统计学通讯:理论与方法》2020,49(20):4974-4987

Abstract

In this paper we introduce continuous tree mixture model that is the mixture of undirected graphical models with tree structured graphs and is considered as multivariate analysis with a non parametric approach. We estimate its parameters, the component edge sets and mixture proportions through regularized maximum likalihood procedure. Our new algorithm, which uses expectation maximization algorithm and the modified version of Kruskal algorithm, simultaneosly estimates and prunes the mixture component trees. Simulation studies indicate this method performs better than the alternative Gaussian graphical mixture model. The proposed method is also applied to water-level data set and is compared with the results of Gaussian mixture model. 相似文献

10.

An Empirical Comparison of Multiple Imputation Methods for Categorical Data

Olanrewaju Akande Fan Li Jerome Reiter 《The American statistician》2017,71(2):162-170

Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. Supplementary material for this article is available online. 相似文献

11.

Tractable Bayesian learning of tree belief networks

Marina Meilă Tommi Jaakkola 《Statistics and Computing》2006,16(1):77-92

In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. Our result is the first where computing the normalization constant and averaging over a super-exponential number of graph structures can be performed in polynomial time. This follows from two main results: First, we show that factored distributions over spanning trees in a graph can be integrated in closed form. Second, we examine priors over tree parameters and show that a set of assumptions similar to Heckerman, Geiger and Chickering (1995) constrain the tree parameter priors to be a compactly parametrized product of Dirichlet distributions. Besides allowing for exact Bayesian learning, these results permit us to formulate a new class of tractable latent variable models in which the likelihood of a data point is computed through an ensemble average over tree structures. 相似文献

12.

An empirical comparison of ensemble methods based on classification trees

《Journal of Statistical Computation and Simulation》2012,82(8):629-643

In this paper, we perform an empirical comparison of the classification error of several ensemble methods based on classification trees. This comparison is performed by using 14 data sets that are publicly available and that were used by Lim, Loh and Shih [Lim, T., Loh, W. and Shih, Y.-S., 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40, 203–228.]. The methods considered are a single tree, Bagging, Boosting (Arcing) and random forests (RF). They are compared from different perspectives. More precisely, we look at the effects of noise and of allowing linear combinations in the construction of the trees, the differences between some splitting criteria and, specifically for RF, the effect of the number of variables from which to choose the best split at each given node. Moreover, we compare our results with those obtained by Lim et al. [Lim, T., Loh, W. and Shih, Y.-S., 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40, 203–228.]. In this study, the best overall results are obtained with RF. In particular, RF are the most robust against noise. The effect of allowing linear combinations and the differences between splitting criteria are small on average, but can be substantial for some data sets. 相似文献

13.

A Permutation Based Procedure for Classification Assessment

《统计学通讯:理论与方法》2012,41(16-17):3126-3137

This article proposes a permutation procedure for evaluating the performance of different classification methods. In particular, we focus on two of the most widespread and used classification methodologies: latent class analysis and k-means clustering. The classification performance is assessed by means of a permutation procedure which allows for a direct comparison of the methods, the development of a statistical test, and points out better potential solutions. Our proposal provides an innovative framework for the validation of the data partitioning and offers a guide in the choice of which classification procedure should be used 相似文献

14.

Bayesian Weibull tree models for survival analysis of clinico-genomic data

Clarke J West M 《Statistical Methodology》2008,5(3):238-262

An important goal of research involving gene expression data for outcome prediction is to establish the ability of genomic data to define clinically relevant risk factors. Recent studies have demonstrated that microarray data can successfully cluster patients into low- and high-risk categories. However, the need exists for models which examine how genomic predictors interact with existing clinical factors and provide personalized outcome predictions. We have developed clinico-genomic tree models for survival outcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups of patients, each with a specific Weibull survival distribution. These trees can provide personalized predictive distributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models; within each model we adopt a prior on the Weibull scale parameter and update this prior via Empirical Bayes whenever the sample is split at a given node. The decision to split is based on a Bayes factor criterion. The resulting trees are weighted according to their relative likelihood values and predictions are made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer we demonstrate that clinical and genomic data are complementary sources of information relevant to survival, and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of further study. 相似文献

15.

Phylogenetic trees via Hamming distance decomposition tests

《Journal of Statistical Computation and Simulation》2012,82(9):1287-1297

The paper considers the problem of phylogenetic tree construction. Our approach to the problem bases itself on a non-parametric paradigm seeking a model-free construction and symmetry on Types I and II errors. Trees are constructed through sequential tests using Hamming distance dissimilarity measures, from internal nodes to the tips. The method presents some novelties. The first, which is an advantage over the traditional methods, is that it is very fast, computationally efficient and feasible to be used for very large data sets. Two other novelties are its capacity to deal directly with multiple sequences per group (and built its statistical properties upon this richer information) and that the best tree will not have a predetermined number of tips, that is, the resulting number of tips will be statistically meaningful. We apply the method in two data sets of DNA sequences, illustrating that it can perform quite well even on very unbalanced designs. Computational complexities are also addressed. Supplemental materials are available online. 相似文献

16.

Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities 总被引：9，自引：0，他引：9

Ian J. Wilson Michael E. Weale David J. Balding 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2003,166(2):155-188

Summary. We develop a flexible class of Metropolis–Hastings algorithms for drawing inferences about population histories and mutation rates from deoxyribonucleic acid (DNA) sequence data. Match probabilities for use in forensic identification are also obtained, which is particularly useful for mitochondrial DNA profiles. Our data augmentation approach, in which the ancestral DNA data are inferred at each node of the genealogical tree, simplifies likelihood calculations and permits a wide class of mutation models to be employed, so that many different types of DNA sequence data can be analysed within our framework. Moreover, simpler likelihood calculations imply greater freedom for generating tree proposals, so that algorithms with good mixing properties can be implemented. We incorporate the effects of demography by means of simple mechanisms for changes in population size and structure, and we estimate the corresponding demographic parameters, but we do not here allow for the effects of either recombination or selection. We illustrate our methods by application to four human DNA data sets, consisting of DNA sequences, short tandem repeat loci, single-nucleotide polymorphism sites and insertion sites. Two of the data sets are drawn from the male-specific Y-chromosome, one from maternally inherited mitochondrial DNA and one from the β -globin locus on chromosome 11. 相似文献

17.

Impact of Contamination on Training and Test Error Rates in Statistical Clustering

C. Ruwet G. Haesbroeck 《统计学通讯:模拟与计算》2013,42(3):394-411

The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. 相似文献

18.

Multi-step quantile regression tree

《Journal of Statistical Computation and Simulation》2012,82(3):663-682

Quantile regression (QR) proposed by Koenker and Bassett [Regression quantiles, Econometrica 46(1) (1978), pp. 33–50] is a statistical technique that estimates conditional quantiles. It has been widely studied and applied to economics. Meinshausen [Quantile regression forests, J. Mach. Learn. Res. 7 (2006), pp. 983–999] proposed quantile regression forests (QRF), a non-parametric way based on random forest. QRF performs well in terms of prediction accuracy, but it struggles with noisy data sets. This motivates us to propose a multi-step QR tree method using GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) made by Loh [Regression trees with unbiased variable selection and interaction detection, Statist. Sinica 12 (2002), pp. 361–386]. Our simulation study shows that the multi-step QR tree performs better than a single tree or QRF especially when it deals with data sets having many irrelevant variables. 相似文献

19.

Invariant tests based on M-estimators,estimating functions,and the generalized method of moments

Jean-Marie Dufour Alain Trognon Purevdorj Tuvaandorj 《Econometric Reviews》2017,36(1-3):182-204

We study the invariance properties of various test criteria which have been proposed for hypothesis testing in the context of incompletely specified models, such as models which are formulated in terms of estimating functions (Godambe, 1960) or moment conditions and are estimated by generalized method of moments (GMM) procedures (Hansen, 1982), and models estimated by pseudo-likelihood (Gouriéroux, Monfort, and Trognon, 1984b,c) and M-estimation methods. The invariance properties considered include invariance to (possibly nonlinear) hypothesis reformulations and reparameterizations. The test statistics examined include Wald-type, LR-type, LM-type, score-type, and C(α)?type criteria. Extending the approach used in Dagenais and Dufour (1991), we show first that all these test statistics except the Wald-type ones are invariant to equivalent hypothesis reformulations (under usual regularity conditions), but all five of them are not generally invariant to model reparameterizations, including measurement unit changes in nonlinear models. In other words, testing two equivalent hypotheses in the context of equivalent models may lead to completely different inferences. For example, this may occur after an apparently innocuous rescaling of some model variables. Then, in view of avoiding such undesirable properties, we study restrictions that can be imposed on the objective functions used for pseudo-likelihood (or M-estimation) as well as the structure of the test criteria used with estimating functions and generalized method of moments (GMM) procedures to obtain invariant tests. In particular, we show that using linear exponential pseudo-likelihood functions allows one to obtain invariant score-type and C(α)?type test criteria, while in the context of estimating function (or GMM) procedures it is possible to modify a LR-type statistic proposed by Newey and West (1987) to obtain a test statistic that is invariant to general reparameterizations. The invariance associated with linear exponential pseudo-likelihood functions is interpreted as a strong argument for using such pseudo-likelihood functions in empirical work. 相似文献

20.

A spatial model for the needle losses of pine-trees in the forests of Baden-Württemberg: an application of Bayesian structured additive regression

Nicole H. Augustin Stefan Lang Monica Musio Klaus von Wilpert 《Journal of the Royal Statistical Society. Series C, Applied statistics》2007,56(1):29-50

Summary. The data that are analysed are from a monitoring survey which was carried out in 1994 in the forests of Baden-Württemberg, a federal state in the south-western region of Germany. The survey is part of a large monitoring scheme that has been carried out since the 1980s at different spatial and temporal resolutions to observe the increase in forest damage. One indicator for tree vitality is tree defoliation, which is mainly caused by intrinsic factors, age and stand conditions, but also by biotic (e.g. insects) and abiotic stresses (e.g. industrial emissions). In the survey, needle loss of pine-trees and many potential covariates are recorded at about 580 grid points of a 4 km × 4 km grid. The aim is to identify a set of predictors for needle loss and to investigate the relationships between the needle loss and the predictors. The response variable needle loss is recorded as a percentage in 5% steps estimated by eye using binoculars and categorized into healthy trees (10% or less), intermediate trees (10–25%) and damaged trees (25% or more). We use a Bayesian cumulative threshold model with non-linear functions of continuous variables and a random effect for spatial heterogeneity. For both the non-linear functions and the spatial random effect we use Bayesian versions of P -splines as priors. Our method is novel in that it deals with several non-standard data requirements: the ordinal response variable (the categorized version of needle loss), non-linear effects of covariates, spatial heterogeneity and prediction with missing covariates. The model is a special case of models with a geoadditive or more generally structured additive predictor. Inference can be based on Markov chain Monte Carlo techniques or mixed model technology. 相似文献