首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An important goal of research involving gene expression data for outcome prediction is to establish the ability of genomic data to define clinically relevant risk factors. Recent studies have demonstrated that microarray data can successfully cluster patients into low- and high-risk categories. However, the need exists for models which examine how genomic predictors interact with existing clinical factors and provide personalized outcome predictions. We have developed clinico-genomic tree models for survival outcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups of patients, each with a specific Weibull survival distribution. These trees can provide personalized predictive distributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models; within each model we adopt a prior on the Weibull scale parameter and update this prior via Empirical Bayes whenever the sample is split at a given node. The decision to split is based on a Bayes factor criterion. The resulting trees are weighted according to their relative likelihood values and predictions are made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer we demonstrate that clinical and genomic data are complementary sources of information relevant to survival, and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of further study.  相似文献   

2.
State-space models are widely used in ecology. However, it is well known that in practice it can be difficult to estimate both the process and observation variances that occur in such models. We consider this issue for integrated population models, which incorporate state-space models for population dynamics. To some extent, the mechanism of integrated population models protects against this problem, but it can still arise, and two illustrations are provided, in each of which the observation variance is estimated as zero. In the context of an extended case study involving data on British Grey herons, we consider alternative approaches for dealing with the problem when it occurs. In particular, we consider penalised likelihood, a method based on fitting splines and a method of pseudo-replication, which is undertaken via a simple bootstrap procedure. For the case study of the paper, it is shown that when it occurs, an estimate of zero observation variance is unimportant for inference relating to the model parameters of primary interest. This unexpected finding is supported by a simulation study.  相似文献   

3.
We consider multiple comparisons of log-likelihood's to take account of the multiplicity of testings in selection of nonnested models. A resampling version of the Gupta procedure for the selection problem is used to obtain a set of good models, which are not significantly worse than the maximum likelihood model; i.e., a confidence set of models. Our method is to test which model is better than the other, while the object of the classical testing methods is to find the correct model. Thus the null hypotheses behind these two approaches are very different. Our method and the other commonly used approaches, such as the approximate Bayesian posterior, the bootstrap selection probability, and the LR test against the full model, are applied to the selection of molecular phylogenetic tree of mammal species. Tree selection is a version of the model-based clustering, which is an example of nonnested model selection. It is shown that the structure of the tree selection problem is equivalent to that of the variable selection problem of the multiple regression with some constraints on the combinations of the variables. It turns out that the LR test rejects all the possible trees because of the misspecification of the models, whereas our method gives a reasonable confidence set. For a better understanding of the uncertainty in the selection, we combine the maximum likelihood estimates (MLE's) of the trees to obtain the full model that includes the trees as the submodels by using a linear approximation of the parametric models. The MLE of the phylogeny is then represented as a network of species rather than a tree. A geometrical interpretation of the problem is also discussed.  相似文献   

4.
《随机性模型》2013,29(3):299-324
In this paper we consider a bottleneck link and buffer used by one or two fluid sources that are subject to feedback. The feedback is such that the model captures essential aspects of the behavior of the Transmission Control Protocol as used in the Internet. During overflow, the buffer sends negative feedback signals to the sources to indicate that the sending rate should be reduced. Otherwise the buffer sends positive signals so as to increase the rate. In this context we find closed form expressions for the solution of the one-source case. The two-source case extends the single-source model considerably: we can control the behavior and parameters of each source individually. This enables us to study the impact of these parameters on the sharing of links and buffers. For the two-source case we solve the related two-point boundary value problem in the stationary case. We also establish a numerically efficient procedure to compute the coefficients of the solution of the differential equations. The numerical results of this model are presented in an accompanying paper.  相似文献   

5.
Recombinant binomial trees are binary trees where each non-leaf node has two child nodes, but adjacent parents share a common child node. Such trees arise in option pricing in finance. For example, an option can be valued by evaluating the expected payoffs with respect to random paths in the tree. The cost to exactly compute expected values over random paths grows exponentially in the depth of the tree, rendering a serial computation of one branch at a time impractical. We propose a parallelization method that transforms the calculation of the expected value into an embarrassingly parallel problem by mapping the branches of the binomial tree to the processes in a multiprocessor computing environment. We also discuss a parallel Monte Carlo method and verify the convergence and the variance reduction behavior by simulation study. Performance results from R and Julia implementations are compared on a distributed computing cluster.  相似文献   

6.
《随机性模型》2013,29(3):341-368
Abstract

We consider a flow of data packets from one source to many destinations in a communication network represented by a random oriented tree. Multicast transmission is characterized by the ability of some tree vertices to replicate received packets depending on the number of destinations downstream. We are interested in characteristics of multicast flows on Galton–Watson trees and trees generated by point aggregates of a Poisson process. Such stochastic settings are intended to represent tree shapes arising in the Internet and in some ad hoc networks. The main result in the branching process case is a functional equation for the joint probability generating function of flow volumes through a given vertex and in the whole tree. We provide conditions for the existence and uniqueness of solution and a method to compute it using Picard iterations. In the point process case, we provide bounds on flow volumes using the technique of stochastic comparison from the theory of continuous percolation. We use these results to derive a number of random trees' characteristics and discuss their applications to analytical evaluation of the load induced on a network by a multicast session.  相似文献   

7.
Finite memory sources and variable‐length Markov chains have recently gained popularity in data compression and mining, in particular, for applications in bioinformatics and language modelling. Here, we consider denser data compression and prediction with a family of sparse Bayesian predictive models for Markov chains in finite state spaces. Our approach lumps transition probabilities into classes composed of invariant probabilities, such that the resulting models need not have a hierarchical structure as in context tree‐based approaches. This can lead to a substantially higher rate of data compression, and such non‐hierarchical sparse models can be motivated for instance by data dependence structures existing in the bioinformatics context. We describe a Bayesian inference algorithm for learning sparse Markov models through clustering of transition probabilities. Experiments with DNA sequence and protein data show that our approach is competitive in both prediction and classification when compared with several alternative methods on the basis of variable memory length.  相似文献   

8.
Summary.  We consider three sorts of diagnostics for random imputations: displays of the completed data, which are intended to reveal unusual patterns that might suggest problems with the imputations, comparisons of the distributions of observed and imputed data values and checks of the fit of observed data to the model that is used to create the imputations. We formulate these methods in terms of sequential regression multivariate imputation, which is an iterative procedure in which the missing values of each variable are randomly imputed conditionally on all the other variables in the completed data matrix. We also consider a recalibration procedure for sequential regression imputations. We apply these methods to the 2002 environmental sustainability index, which is a linear aggregation of 64 environmental variables on 142 countries.  相似文献   

9.
For the analysis of binary data, various deterministic models have been proposed, which are generally simpler to fit and easier to understand than probabilistic models. We claim that corresponding to any deterministic model is an implicit stochastic model in which the deterministic model fits imperfectly, with errors occurring at random. In the context of binary data, we consider a model in which the probability of error depends on the model prediction. We show how to fit this model using a stochastic modification of deterministic optimization schemes.The advantages of fitting the stochastic model explicitly (rather than implicitly, by simply fitting a deterministic model and accepting the occurrence of errors) include quantification of uncertainty in the deterministic model’s parameter estimates, better estimation of the true model error rate, and the ability to check the fit of the model nontrivially. We illustrate this with a simple theoretical example of item response data and with empirical examples from archeology and the psychology of choice.  相似文献   

10.
Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables p is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large p. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small n large p” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.  相似文献   

11.
We consider nonparametric estimation of a regression curve when the data are observed with Berkson errors or with a mixture of classical and Berkson errors. In this context, other existing nonparametric procedures can either estimate the regression curve consistently on a very small interval or require complicated inversion of an estimator of the Fourier transform of a nonparametric regression estimator. We introduce a new estimation procedure which is simpler to implement, and study its asymptotic properties. We derive convergence rates which are faster than those previously obtained in the literature, and we prove that these rates are optimal. We suggest a data-driven bandwidth selector and apply our method to some simulated examples.  相似文献   

12.
Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets.  相似文献   

13.
We consider the context of probabilistic inference of model parameters given error bars or confidence intervals on model output values, when the data is unavailable. We introduce a class of algorithms in a Bayesian framework, relying on maximum entropy arguments and approximate Bayesian computation methods, to generate consistent data with the given summary statistics. Once we obtain consistent data sets, we pool the respective posteriors, to arrive at a single, averaged density on the parameters. This approach allows us to perform accurate forward uncertainty propagation consistent with the reported statistics.  相似文献   

14.
Quantitative model validation is playing an increasingly important role in performance and reliability assessment of a complex system whenever computer modelling and simulation are involved. The foci of this paper are to pursue a Bayesian probabilistic approach to quantitative model validation with non-normality data, considering data uncertainty and to investigate the impact of normality assumption on validation accuracy. The Box–Cox transformation method is employed to convert the non-normality data, with the purpose of facilitating the overall validation assessment of computational models with higher accuracy. Explicit expressions for the interval hypothesis testing-based Bayes factor are derived for the transformed data in the context of univariate and multivariate cases. Bayesian confidence measure is presented based on the Bayes factor metric. A generalized procedure is proposed to implement the proposed probabilistic methodology for model validation of complicated systems. Classic hypothesis testing method is employed to conduct a comparison study. The impact of data normality assumption and decision threshold variation on model assessment accuracy is investigated by using both classical and Bayesian approaches. The proposed methodology and procedure are demonstrated with a univariate stochastic damage accumulation model, a multivariate heat conduction problem and a multivariate dynamic system.  相似文献   

15.
Abstract. Spatial Cox point processes is a natural framework for quantifying the various sources of variation governing the spatial distribution of rain forest trees. We introduce a general criterion for variance decomposition for spatial Cox processes and apply it to specific Cox process models with additive or log linear random intensity functions. We moreover consider a new and flexible class of pair correlation function models given in terms of normal variance mixture covariance functions. The proposed methodology is applied to point pattern data sets of locations of tropical rain forest trees.  相似文献   

16.
Abstract

In this paper we introduce continuous tree mixture model that is the mixture of undirected graphical models with tree structured graphs and is considered as multivariate analysis with a non parametric approach. We estimate its parameters, the component edge sets and mixture proportions through regularized maximum likalihood procedure. Our new algorithm, which uses expectation maximization algorithm and the modified version of Kruskal algorithm, simultaneosly estimates and prunes the mixture component trees. Simulation studies indicate this method performs better than the alternative Gaussian graphical mixture model. The proposed method is also applied to water-level data set and is compared with the results of Gaussian mixture model.  相似文献   

17.
This paper proposes a probabilistic frontier regression model for binary type output data in a production process setup. We consider one of the two categories of outputs as ‘selected’ category and the reduction in probability of falling in this category is attributed to the reduction in technical efficiency (TE) of the decision-making unit. An efficiency measure is proposed to determine the deviations of individual units from the probabilistic frontier. Simulation results show that the average estimated TE component is close to its true value. An application of the proposed method to the data related to the Indian public sector banking system is provided where the output variable is the indicator of level of non-performing assets. Individual TE is obtained for each of the banks under consideration. Among the public sector banks, Andhra bank is found to be the most efficient, whereas the United Bank of India is the least.  相似文献   

18.
We propose a new generalized autoregressive conditional heteroscedastic (GARCH) model with tree-structured multiple thresholds for the estimation of volatility in financial time series. The approach relies on the idea of a binary tree where every terminal node parameterizes a (local) GARCH model for a partition cell of the predictor space. The fitting of such trees is constructed within the likelihood framework for non-Gaussian observations: it is very different from the well-known regression tree procedure which is based on residual sums of squares. Our strategy includes the classical GARCH model as a special case and allows us to increase model complexity in a systematic and flexible way. We derive a consistency result and conclude from simulation and real data analysis that the new method has better predictive potential than other approaches.  相似文献   

19.
Tree‐based methods are frequently used in studies with censored survival time. Their structure and ease of interpretability make them useful to identify prognostic factors and to predict conditional survival probabilities given an individual's covariates. The existing methods are tailor‐made to deal with a survival time variable that is measured continuously. However, survival variables measured on a discrete scale are often encountered in practice. The authors propose a new tree construction method specifically adapted to such discrete‐time survival variables. The splitting procedure can be seen as an extension, to the case of right‐censored data, of the entropy criterion for a categorical outcome. The selection of the final tree is made through a pruning algorithm combined with a bootstrap correction. The authors also present a simple way of potentially improving the predictive performance of a single tree through bagging. A simulation study shows that single trees and bagged‐trees perform well compared to a parametric model. A real data example investigating the usefulness of personality dimensions in predicting early onset of cigarette smoking is presented. The Canadian Journal of Statistics 37: 17‐32; 2009 © 2009 Statistical Society of Canada  相似文献   

20.
We consider the fitting of a Bayesian model to grouped data in which observations are assumed normally distributed around group means that are themselves normally distributed, and consider several alternatives for accommodating the possibility of heteroscedasticity within the data. We consider the case where the underlying distribution of the variances is unknown, and investigate several candidate prior distributions for those variances. In each case, the parameters of the candidate priors (the hyperparameters) are themselves given uninformative priors (hyperpriors). The most mathematically convenient model for the group variances is to assign them inverse gamma distributed priors, the inverse gamma distribution being the conjugate prior distribution for the unknown variance of a normal population. We demonstrate that for a wide class of underlying distributions of the group variances, a model that assigns the variances an inverse gamma-distributed prior displays favorable goodness-of-fit properties relative to other candidate priors, and hence may be used as standard for modeling such data. This allows us to take advantage of the elegant mathematical property of prior conjugacy in a wide variety of contexts without compromising model fitness. We test our findings on nine real world publicly available datasets from different domains, and on a wide range of artificially generated datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号