首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
In this paper, we propose a conditional quantile independence screening approach for ultra-high-dimensional heterogeneous data given some known, significant and low-dimensional variables. The new method does not require imposing a specific model structure for the response and covariates and can detect additional features that contribute to conditional quantiles of the response given those already-identified important predictors. We also prove that the proposed procedure enjoys the ranking consistency and sure screening properties. Some simulation studies are carried out to examine the performance of advised procedure. At last, we illustrate it by a real data example.  相似文献   

Summary. Nearest neighbour algorithms are among the most popular methods used in statistical pattern recognition. The models are conceptually simple and empirical studies have shown that their performance is highly competitive against other techniques. However, the lack of a formal framework for choosing the size of the neighbourhood k is problematic. Furthermore, the method can only make discrete predictions by reporting the relative frequency of the classes in the neighbourhood of the prediction point. We present a probabilistic framework for the k -nearest-neighbour method that largely overcomes these difficulties. Uncertainty is accommodated via a prior distribution on k as well as in the strength of the interaction between neighbours. These prior distributions propagate uncertainty through to proper probabilistic predictions that have continuous support on (0, 1). The method makes no assumptions about the distribution of the predictor variables. The method is also fully automatic with no user-set parameters and empirically it proves to be highly accurate on many bench-mark data sets.  相似文献   

We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups.  相似文献   

This article is concerned with data sharpening (DS) technique in nonparametric regression under the setting where the multivariate predictor is embedded in an unknown low-dimensional manifold. Theoretical asymptotic bias is derived, which reveals that the proposed DS estimator has a reduced bias compared to the usual local linear estimator. The asymptotic normality of the DS estimator is also developed. It can be confirmed from simulation and applications to real data that the bias reduction for the DS estimator supported on unknown manifold is evident.  相似文献   

Mixed-effect models are very popular for analyzing data with a hierarchical structure. In medical applications, typical examples include repeated observations within subjects in a longitudinal design, patients nested within centers in a multicenter design. However, recently, due to the medical advances, the number of fixed-effect covariates collected from each patient can be quite large, e.g., data on gene expressions of each patient, and all of these variables are not necessarily important for the outcome. So, it is very important to choose the relevant covariates correctly for obtaining the optimal inference for the overall study. On the other hand, the relevant random effects will often be low-dimensional and pre-specified. In this paper, we consider regularized selection of important fixed-effect variables in linear mixed-effect models along with maximum penalized likelihood estimation of both fixed and random-effect parameters based on general non-concave penalties. Asymptotic and variable selection consistency with oracle properties are proved for low-dimensional cases as well as for high dimensionality of non-polynomial order of sample size (number of parameters is much larger than sample size). We also provide a suitable computationally efficient algorithm for implementation. Additionally, all the theoretical results are proved for a general non-convex optimization problem that applies to several important situations well beyond the mixed model setup (like finite mixture of regressions) illustrating the huge range of applicability of our proposal.  相似文献   

For ultrahigh-dimensional data, independent feature screening has been demonstrated both theoretically and empirically to be an effective dimension reduction method with low computational demanding. Motivated by the Buckley–James method to accommodate censoring, we propose a fused Kolmogorov–Smirnov filter to screen out the irrelevant dependent variables for ultrahigh-dimensional survival data. The proposed model-free screening method can work with many types of covariates (e.g. continuous, discrete and categorical variables) and is shown to enjoy the sure independent screening property under mild regularity conditions without requiring any moment conditions on covariates. In particular, the proposed procedure can still be powerful when covariates are strongly dependent on each other. We further develop an iterative algorithm to enhance the performance of our method while dealing with the practical situations where some covariates may be marginally unrelated but jointly related to the response. We conduct extensive simulations to evaluate the finite-sample performance of the proposed method, showing that it has favourable exhibition over the existing typical methods. As an illustration, we apply the proposed method to the diffuse large-B-cell lymphoma study.  相似文献   

In biomedical research, profiling is now commonly conducted, generating high-dimensional genomic measurements (without loss of generality, say genes). An important analysis objective is to rank genes according to their marginal associations with a disease outcome/phenotype. Clinical-covariates, including for example clinical risk factors and environmental exposures, usually exist and need to be properly accounted for. In this study, we propose conducting marginal ranking of genes using a receiver operating characteristic (ROC) based method. This method can accommodate categorical, censored survival, and continuous outcome variables in a very similar manner. Unlike logistic-model-based methods, it does not make very specific assumptions on model, making it robust. In ranking genes, we account for both the main effects of clinical-covariates and their interactions with genes, and develop multiple diagnostic accuracy improvement measurements. Using simulation studies, we show that the proposed method is effective in that genes associated with or gene–covariate interactions associated with the outcome receive high rankings. In data analysis, we observe some differences between the rankings using the proposed method and the logistic-model-based method.  相似文献   

Summary This paper investigates the effects of ordinal regressors in linear regression models and in limited dependent variable models. Each ordered categorical variable is interpreted as a rough measurement of an underlying continuous variable as it is often done in microeconometrics for the dependent variable. It is shown that using ordinal indicators only leads to correct answers in a few special cases. In most situations, the usual estimators are biased. In order to estimate the parameters of the model consistently, the indirect estimation procedure suggested by Gourieroux et al. (1993) is applied. To demonstrate this method, first a simulation study is performed and then in a second step, two real data sets are used. In the latter case, continuous regressors are transformed into categorical variables to study the behavior of the estimation procedure. The method is extended to the case of limited dependent variable models. In general, the indirect estimators lead to adequate results. Received: March 27, 2000; revised version: March 6, 2001  相似文献   

We propose a generalized estimating equations (GEE) approach to the estimation of the mean and covariance structure of bivariate time series processes of panel data. The one-step approach allows for mixed continuous and discrete dependent variables. A Monte Carlo Study is presented to compare our particular GEE estimator with more standard GEE-estimators. In the empirical illustration, we apply our estimator to the analysis of individual wage dynamics and the incidence of profit-sharing in West Germany. Our findings show that time-invariant unobserved individual ability jointly influences individual wages and participation in profit sharing schemes.  相似文献   

A Kernel Variogram Estimator for Clustered Data   总被引:3,自引:0,他引:3  
Abstract.  The variogram provides an important method for measuring the dependence of attribute values between spatial locations. Suppose that the nature of the sampling process leads to the presence of clustered data; it would be advisable to use a variogram estimator that aims to adjust for clustering of samples. In this setting, the use of a non-parametric weighted estimator, obtained by considering an inverse weight to a given neighbourhood density combined with the kernel method, seems to have a satisfactory behaviour in practice. This paper pursues a theoretical study of the cluster robust estimator, by proving that it is asymptotically unbiased as well as consistent and by providing criteria for selection of the bandwidth parameter and the neighbourhood radius. Numerical studies are also included to illustrate the performance of the considered estimator and the suggested approaches.  相似文献   

Mediation is a hypothesized causal chain among three variables. Mediation analysis for continuous response variables is well developed in the literature, and it can be shown that the indirect effect is equal to the total effect minus the direct effect. However, mediation analysis for categorical responses is still not fully developed. The purpose of this article is to propose a simpler method of analysing the mediation effect among three variables when the dependent and mediator variables are both dichotomous. We propose using the latent variable technique which in turn will adjust for the necessary condition that indirect effect is equal to the total effect minus the direct effect. An intensive simulation study is conducted to compare the proposed method with other methods in the literature. Our theoretical derivation and simulation study show that the proposed approach is simpler to use and at least as good as other approaches provided in the literature. We illustrate our approach to test for the potential mediators on the relationship between depression and obesity among children and adolescents compared to the method in Winship and Mare using National children health survey data 2011–2012.  相似文献   

We focus on principal differential analysis (PDA) of functional data for obtaining a low-dimensional representation of a collection of curves. PDA assumes there exists a linear differential operator that results in the zero-function when it is applied to each of the data curves, or equivalently, that the curves belong to a low-dimensional subspace of a normed linear space. PDA sets out to estimate this linear differential operator from the data and proceeds from there. Our contribution is to explain how subject covariates can be incorporated into a PDA analysis for graphical exploration of patterns in the data.  相似文献   

This article applies different approaches to distinguish state dependence from unobserved heterogeneity and serial correlation and, hence, test for state dependence in consumer brand choices. First, we apply a simple method proposed by Chamberlain, which involves lagged exogenous variables only. Second, we also estimate a lagged-dependent-variable specification proposed by Wooldridge. Third, we use the estimation approach suggested by Wooldridge to estimate a model with both lagged dependent and exogenous variables to distinguish between the two different sources of choice dynamics, state dependence and lagged effects of the exogenous variables. Our analysis reveals that the best approach is to use models with both lagged dependent and exogenous variables. Our findings include strong evidence for state dependence in five out of the six product categories studied in this article.  相似文献   

Many research proposals involve collecting multiple sources of information from a set of common samples, with the goal of performing an integrative analysis describing the associations between sources. We propose a method that characterizes the dominant modes of co-variation between the variables in two datasets while simultaneously performing variable selection. Our method relies on a sparse, low rank approximation of a matrix containing pairwise measures of association between the two sets of variables. We show that the proposed method shares a close connection with another group of methods for integrative data analysis – sparse canonical correlation analysis (CCA). Under some assumptions, the proposed method and sparse CCA aim to select the same subsets of variables. We show through simulation that the proposed method can achieve better variable selection accuracies than two state-of-the-art sparse CCA algorithms. Empirically, we demonstrate through the analysis of DNA methylation and gene expression data that the proposed method selects variables that have as high or higher canonical correlation than the variables selected by sparse CCA methods, which is a rather surprising finding given that objective function of the proposed method does not actually maximize the canonical correlation.  相似文献   

Statistical simulation in survey statistics is usually based on repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to explain issues regarding, e.g., sampling designs. Since the availability of real population data is in general very limited, it is necessary to generate synthetic data for such applications. The simulated data need to be as realistic as possible, while at the same time ensuring data confidentiality. This paper proposes a method for generating close-to-reality population data for complex household surveys. The procedure consists of four steps for setting up the household structure, simulating categorical variables, simulating continuous variables and splitting continuous variables into different components. It is not required to perform all four steps so that the framework is applicable to a broad class of surveys. In addition, the proposed method is evaluated in an application to the European Union Statistics on Income and Living Conditions (EU-SILC).  相似文献   

Sliced regression is an effective dimension reduction method by replacing the original high-dimensional predictors with its appropriate low-dimensional projection. It is free from any probabilistic assumption and can exhaustively estimate the central subspace. In this article, we propose to incorporate shrinkage estimation into sliced regression so that variable selection can be achieved simultaneously with dimension reduction. The new method can improve the estimation accuracy and achieve better interpretability for the reduced variables. The efficacy of proposed method is shown through both simulation and real data analysis.  相似文献   

Regression analyses are commonly performed with doubly limited continuous dependent variables; for instance, when modeling the behavior of rates, proportions and income concentration indices. Several models are available in the literature for use with such variables, one of them being the unit gamma regression model. In all such models, parameter estimation is typically performed using the maximum likelihood method and testing inferences on the model''s parameters are usually based on the likelihood ratio test. Such a test can, however, deliver quite imprecise inferences when the sample size is small. In this paper, we propose two modified likelihood ratio test statistics for use with the unit gamma regressions that deliver much more accurate inferences when the number of data points in small. Numerical (i.e. simulation) evidence is presented for both fixed dispersion and varying dispersion models, and also for tests that involve nonnested models. We also present and discuss two empirical applications.  相似文献   

Using a multivariate latent variable approach, this article proposes some new general models to analyze the correlated bounded continuous and categorical (nominal or/and ordinal) responses with and without non-ignorable missing values. First, we discuss regression methods for jointly analyzing continuous, nominal, and ordinal responses that we motivated by analyzing data from studies of toxicity development. Second, using the beta and Dirichlet distributions, we extend the models so that some bounded continuous responses are replaced for continuous responses. The joint distribution of the bounded continuous, nominal and ordinal variables is decomposed into a marginal multinomial distribution for the nominal variable and a conditional multivariate joint distribution for the bounded continuous and ordinal variables given the nominal variable. We estimate the regression parameters under the new general location models using the maximum-likelihood method. Sensitivity analysis is also performed to study the influence of small perturbations of the parameters of the missing mechanisms of the model on the maximal normal curvature. The proposed models are applied to two data sets: BMI, Steatosis and Osteoporosis data and Tehran household expenditure budgets.  相似文献   

This article uses a local-information, near-neighbor forecasting methodology as a prediction test for evidence of a noisy, chaotic data-generating process underlying the Divisia monetary-aggregate series. Using a nonparametric method known to perform well with low-dimensional chaotic processes infected by noise, accompanied by a robust test of forecast performance evaluation, we compare out-of-sample forecasting accuracy from the local-information method to forecasting accuracy from the best fitting global linear model. Our results fail to substantiate previous claims for determinism in the Divisia monetary-aggregate series because the degree of forecast improvement obtained by the local-information method is not consistent with the hypothesis of a low-dimensional attractor underlying the Divisia data.  相似文献   

Both continuous and categorical covariates are common in traditional Chinese medicine (TCM) research, especially in the clinical syndrome identification and in the risk prediction research. For groups of dummy variables which are generated by the same categorical covariate, it is important to penalize them group-wise rather than individually. In this paper, we discuss the group lasso method for a risk prediction analysis in TCM osteoporosis research. It is the first time to apply such a group-wise variable selection method in this field. It may lead to new insights of using the grouped penalization method to select appropriate covariates in the TCM research. The introduced methodology can select categorical and continuous variables, and estimate their parameters simultaneously. In our application of the osteoporosis data, four covariates (including both categorical and continuous covariates) are selected out of 52 covariates. The accuracy of the prediction model is excellent. Compared with the prediction model with different covariates, the group lasso risk prediction model can significantly decrease the error rate and help TCM doctors to identify patients with a high risk of osteoporosis in clinical practice. Simulation results show that the application of the group lasso method is reasonable for the categorical covariates selection model in this TCM osteoporosis research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号