首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 117 毫秒
1.
In data sets with many predictors, algorithms for identifying a good subset of predictors are often used. Most such algorithms do not allow for any relationships between predictors. For example, stepwise regression might select a model containing an interaction AB but neither main effect A or B. This paper develops mathematical representations of this and other relations between predictors, which may then be incorporated in a model selection procedure. A Bayesian approach that goes beyond the standard independence prior for variable selection is adopted, and preference for certain models is interpreted as prior information. Priors relevant to arbitrary interactions and polynomials, dummy variables for categorical factors, competing predictors, and restrictions on the size of the models are developed. Since the relations developed are for priors, they may be incorporated in any Bayesian variable selection algorithm for any type of linear model. The application of the methods here is illustrated via the stochastic search variable selection algorithm of George and McCulloch (1993), which is modified to utilize the new priors. The performance of the approach is illustrated with two constructed examples and a computer performance dataset.  相似文献   

2.
The authors consider the problem of simultaneous transformation and variable selection for linear regression. They propose a fully Bayesian solution to the problem, which allows averaging over all models considered including transformations of the response and predictors. The authors use the Box‐Cox family of transformations to transform the response and each predictor. To deal with the change of scale induced by the transformations, the authors propose to focus on new quantities rather than the estimated regression coefficients. These quantities, referred to as generalized regression coefficients, have a similar interpretation to the usual regression coefficients on the original scale of the data, but do not depend on the transformations. This allows probabilistic statements about the size of the effect associated with each variable, on the original scale of the data. In addition to variable and transformation selection, there is also uncertainty involved in the identification of outliers in regression. Thus, the authors also propose a more robust model to account for such outliers based on a t‐distribution with unknown degrees of freedom. Parameter estimation is carried out using an efficient Markov chain Monte Carlo algorithm, which permits moves around the space of all possible models. Using three real data sets and a simulated study, the authors show that there is considerable uncertainty about variable selection, choice of transformation, and outlier identification, and that there is advantage in dealing with all three simultaneously. The Canadian Journal of Statistics 37: 361–380; 2009 © 2009 Statistical Society of Canada  相似文献   

3.
Simple nonparametric estimates of the conditional distribution of a response variable given a covariate are often useful for data exploration purposes or to help with the specification or validation of a parametric or semi-parametric regression model. In this paper we propose such an estimator in the case where the response variable is interval-censored and the covariate is continuous. Our approach consists in adding weights that depend on the covariate value in the self-consistency equation proposed by Turnbull (J R Stat Soc Ser B 38:290–295, 1976), which results in an estimator that is no more difficult to implement than Turnbull’s estimator itself. We show the convergence of our algorithm and that our estimator reduces to the generalized Kaplan–Meier estimator (Beran, Nonparametric regression with randomly censored survival data, 1981) when the data are either complete or right-censored. We demonstrate by simulation that the estimator, bootstrap variance estimation and bandwidth selection (by rule of thumb or cross-validation) all perform well in finite samples. We illustrate the method by applying it to a dataset from a study on the incidence of HIV in a group of female sex workers from Kinshasa.  相似文献   

4.
Variable selection is an important issue in all regression analysis, and in this article, we investigate the simultaneous variable selection in joint location and scale models of the skew-t-normal distribution when the dataset under consideration involves heavy tail and asymmetric outcomes. We propose a unified penalized likelihood method which can simultaneously select significant variables in the location and scale models. Furthermore, the proposed variable selection method can simultaneously perform parameter estimation and variable selection in the location and scale models. With appropriate selection of the tuning parameters, we establish the consistency and the oracle property of the regularized estimators. These estimators are compared by simulation studies.  相似文献   

5.
The results of analyzing experimental data using a parametric model may heavily depend on the chosen model for regression and variance functions, moreover also on a possibly underlying preliminary transformation of the variables. In this paper we propose and discuss a complex procedure which consists in a simultaneous selection of parametric regression and variance models from a relatively rich model class and of Box-Cox variable transformations by minimization of a cross-validation criterion. For this it is essential to introduce modifications of the standard cross-validation criterion adapted to each of the following objectives: 1. estimation of the unknown regression function, 2. prediction of future values of the response variable, 3. calibration or 4. estimation of some parameter with a certain meaning in the corresponding field of application. Our idea of a criterion oriented combination of procedures (which usually if applied, then in an independent or sequential way) is expected to lead to more accurate results. We show how the accuracy of the parameter estimators can be assessed by a “moment oriented bootstrap procedure", which is an essential modification of the “wild bootstrap” of Härdle and Mammen by use of more accurate variance estimates. This new procedure and its refinement by a bootstrap based pivot (“double bootstrap”) is also used for the construction of confidence, prediction and calibration intervals. Programs written in Splus which realize our strategy for nonlinear regression modelling and parameter estimation are described as well. The performance of the selected model is discussed, and the behaviour of the procedures is illustrated, e.g., by an application in radioimmunological assay.  相似文献   

6.
Here we consider a multinomial probit regression model where the number of variables substantially exceeds the sample size and only a subset of the available variables is associated with the response. Thus selecting a small number of relevant variables for classification has received a great deal of attention. Generally when the number of variables is substantial, sparsity-enforcing priors for the regression coefficients are called for on grounds of predictive generalization and computational ease. In this paper, we propose a sparse Bayesian variable selection method in multinomial probit regression model for multi-class classification. The performance of our proposed method is demonstrated with one simulated data and three well-known gene expression profiling data: breast cancer data, leukemia data, and small round blue-cell tumors. The results show that compared with other methods, our method is able to select the relevant variables and can obtain competitive classification accuracy with a small subset of relevant genes.  相似文献   

7.
This paper suggests a new type of mixture regression model, in which each mixture component is explained by its own regressors. Thus, the dependent variable can be driven by one of several unobservable explanatory mechanisms, each of which has its own distinct variables. An extension of the simulated annealing algorithm is introduced to fit this general mixture model. The paper also suggests a new technique for estimating the covariance matrix of estimators in a mixture model. Finally, empirical studies of a labour supply example show that our proposed model can perform much better than conventional logistic or mixture models.  相似文献   

8.
Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.  相似文献   

9.
We propose a methodology to analyse data arising from a curve that, over its domain, switches among J states. We consider a sequence of response variables, where each response y depends on a covariate x according to an unobserved state z. The states form a stochastic process and their possible values are j=1,?…?, J. If z equals j the expected response of y is one of J unknown smooth functions evaluated at x. We call this model a switching nonparametric regression model. We develop an Expectation–Maximisation algorithm to estimate the parameters of the latent state process and the functions corresponding to the J states. We also obtain standard errors for the parameter estimates of the state process. We conduct simulation studies to analyse the frequentist properties of our estimates. We also apply the proposed methodology to the well-known motorcycle dataset treating the data as coming from more than one simulated accident run with unobserved run labels.  相似文献   

10.
This paper considers the analysis of linear models where the response variable is a linear function of observable component variables. For example, scores on two or more psychometric measures (the component variables) might be weighted and summed to construct a single response variable in a psychological study. A linear model is then fit to the response variable. The question addressed in this paper is how to optimally transform the component variables so that the response is approximately normally distributed. The transformed component variables, themselves, need not be jointly normal. Two cases are considered; in both cases, the Box-Cox power family of transformations is employed. In Case I, the coefficients of the linear transformation are known constants. In Case II, the linear function is the first principal component based on the matrix of correlations among the transformed component variables. For each case, an algorithm is described for finding the transformation powers that minimize a generalized Anderson-Darling statistic. The proposed transformation procedure is compared to likelihood-based methods by means of simulation. The proposed method rarely performed worse than likelihood-based methods and for many data sets performed substantially better. As an illustration, the algorithm is applied to a problem from rural sociology and social psychology; namely scaling family residences along an urban-rural dimension.  相似文献   

11.
Abstract

In this article we propose a new mixed-effects regression model for fractional bounded response variables. Our model allows us to incorporate covariates directly to the expected value, so we can quantify exactly the influence of these covariates in the mean of the variable of interest rather than to the conditional mean. Estimation is carried out from a Bayesian perspective. Due to the complexity of the augmented posterior distribution, we use a Hamiltonian Monte Carlo algorithm, the No-U-Turn sampler, implemented using the Stan software. A simulation study was performed showing that our model has a better performance than other traditional longitudinal models for bounded variables. Finally, we applied our beta-inflated mean mixed-effects regression model to real data which consists of utilization of credit lines in the peruvian financial system.  相似文献   

12.
In real‐data analysis, deciding the best subset of variables in regression models is an important problem. Akaike's information criterion (AIC) is often used in order to select variables in many fields. When the sample size is not so large, the AIC has a non‐negligible bias that will detrimentally affect variable selection. The present paper considers a bias correction of AIC for selecting variables in the generalized linear model (GLM). The GLM can express a number of statistical models by changing the distribution and the link function, such as the normal linear regression model, the logistic regression model, and the probit model, which are currently commonly used in a number of applied fields. In the present study, we obtain a simple expression for a bias‐corrected AIC (corrected AIC, or CAIC) in GLMs. Furthermore, we provide an ‘R’ code based on our formula. A numerical study reveals that the CAIC has better performance than the AIC for variable selection.  相似文献   

13.
This paper introduces an alternating conditional expectation (ACE) algorithm: a non-parametric approach for estimating the transformations that lead to the maximal multiple correlation of a response and a set of independent variables in regression and correlation analysis. These transformations can give the data analyst insight into the relationships between these variables so that this can be best described and non-linear relationships uncovered. Using the Bayesian information criterion (BIC), we show how to find the best closed-form approximations for the optimal ACE transformations. By means of ACE and BIC, the model fit can be considerably improved compared with the conventional linear model as demonstrated in the two simulated and two real datasets in this paper.  相似文献   

14.

This paper is motivated by our collaborative research and the aim is to model clinical assessments of upper limb function after stroke using 3D-position and 4D-orientation movement data. We present a new nonlinear mixed-effects scalar-on-function regression model with a Gaussian process prior focusing on the variable selection from a large number of candidates including both scalar and function variables. A novel variable selection algorithm has been developed, namely functional least angle regression. As it is essential for this algorithm, we studied the representation of functional variables with different methods and the correlation between a scalar and a group of mixed scalar and functional variables. We also propose a new stopping rule for practical use. This algorithm is efficient and accurate for both variable selection and parameter estimation even when the number of functional variables is very large and the variables are correlated. And thus the prediction provided by the algorithm is accurate. Our comprehensive simulation study showed that the method is superior to other existing variable selection methods. When the algorithm was applied to the analysis of the movement data, the use of the nonlinear random-effect model and the function variables significantly improved the prediction accuracy for the clinical assessment.

  相似文献   

15.
With reference to a specific dataset, we consider how to perform a flexible non‐parametric Bayesian analysis of an inhomogeneous point pattern modelled by a Markov point process, with a location‐dependent first‐order term and pairwise interaction only. A priori we assume that the first‐order term is a shot noise process, and that the interaction function for a pair of points depends only on the distance between the two points and is a piecewise linear function modelled by a marked Poisson process. Simulation of the resulting posterior distribution using a Metropolis–Hastings algorithm in the ‘conventional’ way involves evaluating ratios of unknown normalizing constants. We avoid this problem by applying a recently introduced auxiliary variable technique. In the present setting, the auxiliary variable used is an example of a partially ordered Markov point process model.  相似文献   

16.
Variable selection over a potentially large set of covariates in a linear model is quite popular. In the Bayesian context, common prior choices can lead to a posterior expectation of the regression coefficients that is a sparse (or nearly sparse) vector with a few nonzero components, those covariates that are most important. This article extends the “global‐local” shrinkage idea to a scenario where one wishes to model multiple response variables simultaneously. Here, we have developed a variable selection method for a K‐outcome model (multivariate regression) that identifies the most important covariates across all outcomes. The prior for all regression coefficients is a mean zero normal with coefficient‐specific variance term that consists of a predictor‐specific factor (shared local shrinkage parameter) and a model‐specific factor (global shrinkage term) that differs in each model. The performance of our modeling approach is evaluated through simulation studies and a data example.  相似文献   

17.
We provide a method for simultaneous variable selection and outlier identification using the mean-shift outlier model. The procedure consists of two steps: the first step is to identify potential outliers, and the second step is to perform all possible subset regressions for the mean-shift outlier model containing the potential outliers identified in step 1. This procedure is helpful for model selection while simultaneously considering outlier identification, and can be used to identify multiple outliers. In addition, we can evaluate the impact on the regression model of simultaneous omission of variables and interesting observations. In an example, we provide detailed output from the R system, and compare the results with those using posterior model probabilities as proposed by Hoeting et al. [Comput. Stat. Data Anal. 22 (1996), pp. 252-270] for simultaneous variable selection and outlier identification.  相似文献   

18.
A number of articles have discussed the way lower order polynomial and interaction terms should be handled in linear regression models. Only if all lower order terms are included in the model will the regression model be invariant with respect to coding transformations of the variables. If lower order terms are omitted, the regression model will not be well formulated. In this paper, we extend this work to examine the implications of the ordering of variables in the linear mixed-effects model. We demonstrate how linear transformations of the variables affect the model and tests of significance of fixed effects in the model. We show how the transformations modify the random effects in the model, as well as their covariance matrix and the value of the restricted log-likelihood. We suggest a variable selection strategy for the linear mixed-effects model.  相似文献   

19.
We describe the use of perfect sampling algorithms for Bayesian variable selection in a linear regression model. Starting with a basic case solved by Huang and Djurić (EURASIP J. Appl. Si. Pr. 1 (2002) 38), where the model coefficients and noise variance are assumed to be known, we generalize the model step by step to allow for other sources of randomness. We specify perfect simulation algorithms that solve these cases by incorporating various techniques including Gibbs sampling, the perfect independent Metropolis–Hastings (IMH) algorithm, and recently developed “slice coupling” algorithms. Applications to simulated data sets suggest that our algorithms perform well in identifying relevant predictor variables.  相似文献   

20.
In this article, we consider the problem of variable selection in linear regression when multicollinearity is present in the data. It is well known that in the presence of multicollinearity, performance of least square (LS) estimator of regression parameters is not satisfactory. Consequently, subset selection methods, such as Mallow's Cp, which are based on LS estimates lead to selection of inadequate subsets. To overcome the problem of multicollinearity in subset selection, a new subset selection algorithm based on the ridge estimator is proposed. It is shown that the new algorithm is a better alternative to Mallow's Cp when the data exhibit multicollinearity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号