首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 93 毫秒
Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables p is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large p. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small n large p” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.  相似文献   

Let \(\mathbf {X} = (X_1,\ldots ,X_p)\) be a stochastic vector having joint density function \(f_{\mathbf {X}}(\mathbf {x})\) with partitions \(\mathbf {X}_1 = (X_1,\ldots ,X_k)\) and \(\mathbf {X}_2 = (X_{k+1},\ldots ,X_p)\). A new method for estimating the conditional density function of \(\mathbf {X}_1\) given \(\mathbf {X}_2\) is presented. It is based on locally Gaussian approximations, but simplified in order to tackle the curse of dimensionality in multivariate applications, where both response and explanatory variables can be vectors. We compare our method to some available competitors, and the error of approximation is shown to be small in a series of examples using real and simulated data, and the estimator is shown to be particularly robust against noise caused by independent variables. We also present examples of practical applications of our conditional density estimator in the analysis of time series. Typical values for k in our examples are 1 and 2, and we include simulation experiments with values of p up to 6. Large sample theory is established under a strong mixing condition.  相似文献   

In this work, the problem of transformation and simultaneous variable selection is thoroughly treated via objective Bayesian approaches by the use of default Bayes factor variants. Four uniparametric families of transformations (Box–Cox, Modulus, Yeo-Johnson and Dual), denoted by T, are evaluated and compared. The subjective prior elicitation for the transformation parameter \(\lambda _T\), for each T, is not a straightforward task. Additionally, little prior information for \(\lambda _T\) is expected to be available, and therefore, an objective method is required. The intrinsic Bayes factors and the fractional Bayes factors allow us to incorporate default improper priors for \(\lambda _T\). We study the behaviour of each approach using a simulated reference example as well as two real-life examples.  相似文献   

Let \({\{X_n, n\geq 1\}}\) be a sequence of independent and identically distributed non-degenerated random variables with common cumulative distribution function F. Suppose X 1 is concentrated on 0, 1, . . . , N ≤ ∞ and P(X 1 = 1) > 0. Let \({X_{U_w(n)}}\) be the n-th upper weak record value. In this paper we show that for any fixed m ≥ 2, X 1 has Geometric distribution if and only if \({X_{U_{w}(m)}\mathop=\limits^d X_1+\cdots+X_m ,}\) where \({\underline{\underline{d}}}\) denotes equality in distribution. Our result is a generalization of the case m = 2 obtained by Ahsanullah (J Stat Theory Appl 8(1):5–16, 2009).  相似文献   

This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire d-dimensional distribution is approximately preserved under random projections by reducing the number of data points from n to \(k\in O({\text {poly}}(d/\varepsilon ))\) in the case \(n\gg d\). Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a \((1+O(\varepsilon ))\)-approximation in terms of the \(\ell _2\) Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an \(\varepsilon \)-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over \(\mathbb {R}^d\) for \(\beta \). Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time.  相似文献   

The aim of this paper is to study the asymptotic properties of a class of kernel conditional mode estimates whenever functional stationary ergodic data are considered. To be more precise on the matter, in the ergodic data setting, we consider a random elements (XZ) taking values in some semi-metric abstract space \(E\times F\). For a real function \(\varphi \) defined on the space F and \(x\in E\), we consider the conditional mode of the real random variable \(\varphi (Z)\) given the event “\(X=x\)”. While estimating the conditional mode function, say \(\theta _\varphi (x)\), using the well-known kernel estimator, we establish the strong consistency with rate of this estimate uniformly over Vapnik–Chervonenkis classes of functions \(\varphi \). Notice that the ergodic setting offers a more general framework than the usual mixing structure. Two applications to energy data are provided to illustrate some examples of the proposed approach in time series forecasting framework. The first one consists in forecasting the daily peak of electricity demand in France (measured in Giga-Watt). Whereas the second one deals with the short-term forecasting of the electrical energy (measured in Giga-Watt per Hour) that may be consumed over some time intervals that cover the peak demand.  相似文献   

We develop a new robust stopping criterion for partial least squares regression (PLSR) component construction, characterized by a high level of stability. This new criterion is universal since it is suitable both for PLSR and extensions to generalized linear regression (PLSGLR). The criterion is based on a non-parametric bootstrap technique and must be computed algorithmically. It allows the testing of each successive component at a preset significance level \(\alpha \). In order to assess its performance and robustness with respect to various noise levels, we perform dataset simulations in which there is a preset and known number of components. These simulations are carried out for datasets characterized both by \(n>p\), with n the number of subjects and p the number of covariates, as well as for \(n<p\). We then use t-tests to compare the predictive performance of our approach with other common criteria. The stability property is in particular tested through re-sampling processes on a real allelotyping dataset. An important additional conclusion is that this new criterion gives globally better predictive performances than existing ones in both the PLSR and PLSGLR (logistic and poisson) frameworks.  相似文献   

In this paper, we consider the problem of hypotheses testing about the drift parameter \(\theta \) in the process \(\text {d}Y^{\delta }_{t} = \theta \dot{f}(t)Y^{\delta }_{t}\text {d}t + b(t)\text {d}L^{\delta }_{t}\) driven by symmetric \(\delta \)-stable Lévy process \(L^{\delta }_{t}\) with \(\dot{f}(t)\) being the derivative of a known increasing function f(t) and b(t) being known as well. We consider the hypotheses testing \(H_{0}: \theta \le 0\) and \(K_{0}: \theta =0\) against the alternatives \(H_{1}: \theta >0\) and \(K_{1}: \theta \ne 0\), respectively. For these hypotheses, we propose inverse methods, which are motivated by sequential approach, based on the first hitting time of the observed process (or its absolute value) to a pre-specified boundary or two boundaries until some given time. The applicability of these methods is illustrated. For the case \(Y^{\delta }_{0}=0\), we are able to calculate the values of boundaries and finite observed times more directly. We are able to show the consistencies of proposed tests for \(Y^{\delta }_{0}\ge 0\) with \(\delta \in (1,2]\) and for \(Y^{\delta }_{0}=0\) with \(\delta \in (0,2]\) under quite mild conditions.  相似文献   

In this paper we consider an acceptance-rejection (AR) sampler based on deterministic driver sequences. We prove that the discrepancy of an N element sample set generated in this way is bounded by \(\mathcal {O} (N^{-2/3}\log N)\), provided that the target density is twice continuously differentiable with non-vanishing curvature and the AR sampler uses the driver sequence \(\mathcal {K}_M= \{( j \alpha , j \beta ) ~~ mod~~1 \mid j = 1,\ldots ,M\},\) where \(\alpha ,\beta \) are real algebraic numbers such that \(1,\alpha ,\beta \) is a basis of a number field over \(\mathbb {Q}\) of degree 3. For the driver sequence \(\mathcal {F}_k= \{ ({j}/{F_k}, \{{jF_{k-1}}/{F_k}\} ) \mid j=1,\ldots , F_k\},\) where \(F_k\) is the k-th Fibonacci number and \(\{x\}=x-\lfloor x \rfloor \) is the fractional part of a non-negative real number x, we can remove the \(\log \) factor to improve the convergence rate to \(\mathcal {O}(N^{-2/3})\), where again N is the number of samples we accepted. We also introduce a criterion for measuring the goodness of driver sequences. The proposed approach is numerically tested by calculating the star-discrepancy of samples generated for some target densities using \(\mathcal {K}_M\) and \(\mathcal {F}_k\) as driver sequences. These results confirm that achieving a convergence rate beyond \(N^{-1/2}\) is possible in practice using \(\mathcal {K}_M\) and \(\mathcal {F}_k\) as driver sequences in the acceptance-rejection sampler.  相似文献   

A typical problem in optimal design theory is finding an experimental design that is optimal with respect to some criteria in a class of designs. The most popular criteria include the A- and D-criteria. Regular graph designs occur in many optimality results, and if the number of blocks is large enough, an A-optimal (or D-optimal) design is among them (if any exist). To explore the landscape of designs with a large number of blocks, we introduce extensions of regular graph designs. These are constructed by adding the blocks of a balanced incomplete block design repeatedly to the original design. We present the results of an exact computer search for the best regular graph designs and the best extended regular graph designs with up to 20 treatments v, block size \(k \le 10\) and replication r \(\le 10\) and \(r(k-1)-(v-1)\lfloor r(k-1)/(v-1)\rfloor \le 9\).  相似文献   

This paper discusses the contribution of Cerioli et al. (Stat Methods Appl, 2018), where robust monitoring based on high breakdown point estimators is proposed for multivariate data. The results follow years of development in robust diagnostic techniques. We discuss the issues of extending data monitoring to other models with complex structure, e.g. factor analysis, mixed linear models for which S and MM-estimators exist or deviating data cells. We emphasise the importance of robust testing that is often overlooked despite robust tests being readily available once S and MM-estimators have been defined. We mention open questions like out-of-sample inference or big data issues that would benefit from monitoring.  相似文献   

The r largest order statistics approach is widely used in extreme value analysis because it may use more information from the data than just the block maxima. In practice, the choice of r is critical. If r is too large, bias can occur; if too small, the variance of the estimator can be high. The limiting distribution of the r largest order statistics, denoted by GEV\(_r\), extends that of the block maxima. Two specification tests are proposed to select r sequentially. The first is a score test for the GEV\(_r\) distribution. Due to the special characteristics of the GEV\(_r\) distribution, the classical chi-square asymptotics cannot be used. The simplest approach is to use the parametric bootstrap, which is straightforward to implement but computationally expensive. An alternative fast weighted bootstrap or multiplier procedure is developed for computational efficiency. The second test uses the difference in estimated entropy between the GEV\(_r\) and GEV\(_{r-1}\) models, applied to the r largest order statistics and the \(r-1\) largest order statistics, respectively. The asymptotic distribution of the difference statistic is derived. In a large scale simulation study, both tests held their size and had substantial power to detect various misspecification schemes. A new approach to address the issue of multiple, sequential hypotheses testing is adapted to this setting to control the false discovery rate or familywise error rate. The utility of the procedures is demonstrated with extreme sea level and precipitation data.  相似文献   

This paper addresses the issue of estimating the expectation of a real-valued random variable of the form \(X = g(\mathbf {U})\) where g is a deterministic function and \(\mathbf {U}\) can be a random finite- or infinite-dimensional vector. Using recent results on rare event simulation, we propose a unified framework for dealing with both probability and mean estimation for such random variables, i.e. linking algorithms such as Tootsie Pop Algorithm or Last Particle Algorithm with nested sampling. Especially, it extends nested sampling as follows: first the random variable X does not need to be bounded any more: it gives the principle of an ideal estimator with an infinite number of terms that is unbiased and always better than a classical Monte Carlo estimator—in particular it has a finite variance as soon as there exists \(k \in \mathbb {R}> 1\) such that \({\text {E}}\left[ X^k \right] < \infty \). Moreover we address the issue of nested sampling termination and show that a random truncation of the sum can preserve unbiasedness while increasing the variance only by a factor up to 2 compared to the ideal case. We also build an unbiased estimator with fixed computational budget which supports a Central Limit Theorem and discuss parallel implementation of nested sampling, which can dramatically reduce its running time. Finally we extensively study the case where X is heavy-tailed.  相似文献   

This article presents procedures for testing hypothesis and interval estimation of the common mean vector in MANOVA models when the covariance matrices are unknown and unequal. The methods are based on the concepts of generalized p-value and generalized confidence interval. Some important statistical properties of the exact test and confidence region are given. For two multivariate normal populations, a minor modification to the combined tests given by Zhou and Mathew (1994a Zhou , L. P. , Mathew , T. ( 1994a ). Combining independent tests in multivariate linear models . J. Multivariate Anal. 51 : 265276 . [Google Scholar]) is proposed. Some simulation results to compare the performance of the proposed tests with others are reported. The simulation results indicate that new tests have significant gain in the power.  相似文献   

In this paper we design a sure independent ranking and screening procedure for censored regression (cSIRS, for short) with ultrahigh dimensional covariates. The inverse probability weighted cSIRS procedure is model-free in the sense that it does not specify a parametric or semiparametric regression function between the response variable and the covariates. Thus, it is robust to model mis-specification. This model-free property is very appealing in ultrahigh dimensional data analysis, particularly when there is lack of information for the underlying regression structure. The cSIRS procedure is also robust in the presence of outliers or extreme values as it merely uses the rank of the censored response variable. We establish both the sure screening and the ranking consistency properties for the cSIRS procedure when the number of covariates p satisfies \(p=o\{\exp (an)\}\), where a is a positive constant and n is the available sample size. The advantages of cSIRS over existing competitors are demonstrated through comprehensive simulations and an application to the diffuse large-B-cell lymphoma data set.  相似文献   

Baysian inference is considered for the precision matrix of the multivariate regression model with distribution of the random responses belonging to the multivariate scale mixtures of normal distributions. The posterior distribution and some identities involving expectations taken with respect to this posterior distribution are derived when the prior distribution of the parameters is from the conjugate family. The results are specialized to the case where the random responses have a matrix-t distribution and thus generalizing the results of Zellner (1976 Zellner , A. ( 1976 ). Bayesian and non-Bayesian analysis of the regression model with multivariate Student-t error terms . J. Amer. Statist. Assoc. 71 : 400405 .[Taylor & Francis Online], [Web of Science ®] [Google Scholar]) and Muirhead (1986 Muirhead , R. J. ( 1986 ). A note on some Wishart expectations . Metrika 33 : 247251 .[Crossref] [Google Scholar]).  相似文献   


Micheas and Dey (2003 Micheas , A. C. , Dey , D. K. ( 2003 ). Prior and posterior predictive p -values in the one-sided location parameter testing problem. Sankhya¯ 65 : 158178 . [Google Scholar]) reconciled classical and Bayesian p-values in the one-sided location parameter testing problem. In this article, the classical p-value is reconciled with the prior predictive p-value, for the two-sided location parameter testing problem, proving that the classical p-value coincides with the infimum of prior predictive p-values when the prior ranges in different classes of priors.  相似文献   

This article deals with the study of some properties of a mixture periodically correlated n-variate vector autoregressive (MPVAR) time series model, which extends the mixture time invariant parameter n-vector autoregressive (MVAR) model that has been recently studied by Fong et al. (2007 Fong, P.W., Li, W.K., Yau, C.W., Wong, C.S. (2007). On a mixture vector autoregressive model. The Canadian Journal of Statistics 35:135150.[Crossref], [Web of Science ®] [Google Scholar]). Our main contributions here are, on the one side, the obtaining of the second moment periodically stationary condition for a n-variate MPVARS(n; K; 2, …, 2) model; furthermore, the closed-form of the second moment is obtained and, on the other side, the estimation, via the Expectation-Maximization (EM) algorithm, of the coefficient matrices and the error variance matrix.  相似文献   

The seminal work of Stein (1956 Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. Third Berkeley Symp. Mathemat. Statist. Probab., University of California Press, 1:197206. [Google Scholar]) showed that the maximum likelihood estimator (MLE) of the mean vector of a p-dimensional multivariate normal distribution is inadmissible under the squared error loss function when p ? 3 and proposed the Stein estimator that dominates the MLE. Later, James and Stein (1961 James, W., Stein, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Mathemat. Statist. Probab., University of California Press, 1:361379. [Google Scholar]) proposed the James-Stein estimator for the same problem and received much more attention than the original Stein estimator. We re-examined the Stein estimator and conducted an analytic comparison with the James-Stein estimator. We found that the Stein estimator outperforms the James-Stein estimator under certain scenarios and derived the sufficient conditions.  相似文献   

Parametric model-based regression imputation is commonly applied to missing-data problems, but is sensitive to misspecification of the imputation model. Little and An (2004 Little , R. J. A. , An , H. ( 2004 ). Robust likelihood-based analysis of multivariate data with missing values . Statistica Sinica 14 : 949968 .[Web of Science ®] [Google Scholar]) proposed a semiparametric approach called penalized spline propensity prediction (PSPP), where the variable with missing values is modeled by a penalized spline (P-Spline) of the response propensity score, which is logit of the estimated probability of being missing given the observed variables. Variables other than the response propensity are included parametrically in the imputation model. However they only considered point estimation based on single imputation with PSPP. We consider here three approaches to standard errors estimation incorporating the uncertainty due to non response: (a) standard errors based on the asymptotic variance of the PSPP estimator, ignoring sampling error in estimating the response propensity; (b) standard errors based on the bootstrap method; and (c) multiple imputation-based standard errors using draws from the joint posterior predictive distribution of missing values under the PSPP model. Simulation studies suggest that the bootstrap and multiple imputation approaches yield good inferences under a range of simulation conditions, with multiple imputation showing some evidence of closer to nominal confidence interval coverage when the sample size is small.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号