首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 46 毫秒

Variable selection is a fundamental challenge in statistical learning if one works with data sets containing huge amount of predictors. In this artical we consider procedures popular in model selection: Lasso and adaptive Lasso. Our goal is to investigate properties of estimators based on minimization of Lasso-type penalized empirical risk with a convex loss function, in particular nondifferentiable. We obtain theorems concerning rate of convergence in estimation, consistency in model selection and oracle properties for Lasso estimators if the number of predictors is fixed, i.e. it does not depend on the sample size. Moreover, we study properties of Lasso and adaptive Lasso estimators on simulated and real data sets.  相似文献   

We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

We propose two new procedures based on multiple hypothesis testing for correct support estimation in high‐dimensional sparse linear models. We conclusively prove that both procedures are powerful and do not require the sample size to be large. The first procedure tackles the atypical setting of ordered variable selection through an extension of a testing procedure previously developed in the context of a linear hypothesis. The second procedure is the main contribution of this paper. It enables data analysts to perform support estimation in the general high‐dimensional framework of non‐ordered variable selection. A thorough simulation study and applications to real datasets using the R package mht shows that our non‐ordered variable procedure produces excellent results in terms of correct support estimation as well as in terms of mean square errors and false discovery rate, when compared to common methods such as the Lasso, the SCAD penalty, forward regression or the false discovery rate procedure (FDR).  相似文献   

The Lasso has sparked interest in the use of penalization of the log‐likelihood for variable selection, as well as for shrinkage. We are particularly interested in the more‐variables‐than‐observations case of characteristic importance for modern data. The Bayesian interpretation of the Lasso as the maximum a posteriori estimate of the regression coefficients, which have been given independent, double exponential prior distributions, is adopted. Generalizing this prior provides a family of hyper‐Lasso penalty functions, which includes the quasi‐Cauchy distribution of Johnstone and Silverman as a special case. The properties of this approach, including the oracle property, are explored, and an EM algorithm for inference in regression problems is described. The posterior is multi‐modal, and we suggest a strategy of using a set of perfectly fitting random starting values to explore modes in different regions of the parameter space. Simulations show that our procedure provides significant improvements on a range of established procedures, and we provide an example from chemometrics.  相似文献   

We develop our previous works concerning the identification of the collection of significant factors determining some, in general, nonbinary random response variable. Such identification is important, e.g., in biological and medical studies. Our approach is to examine the quality of response variable prediction by functions in (certain part of) the factors. The prediction error estimation requires some cross-validation procedure, certain prediction algorithm, and estimation of the penalty function. Using simulated data, we demonstrate the efficiency of our method. We prove a new central limit theorem for introduced regularized estimates under some natural conditions for arrays of exchangeable random variables.  相似文献   

Semiparametric regression models with multiple covariates are commonly encountered. When there are covariates not associated with response variable, variable selection may lead to sparser models, more lucid interpretations and more accurate estimation. In this study, we adopt a sieve approach for the estimation of nonparametric covariate effects in semiparametric regression models. We adopt a two-step iterated penalization approach for variable selection. In the first step, a mixture of the Lasso and group Lasso penalties are employed to conduct the first-round variable selection and obtain the initial estimate. In the second step, a mixture of the weighted Lasso and weighted group Lasso penalties, with weights constructed using the initial estimate, are employed for variable selection. We show that the proposed iterated approach has the variable selection consistency property, even when number of unknown parameters diverges with sample size. Numerical studies, including simulation and analysis of a diabetes dataset, show satisfactory performance of the proposed approach.  相似文献   

Non‐parametric estimation of functional relationships is an important part of data analysis, particularly in the exploratory stages. This paper considers non‐parametric estimation of the mean functions in family studies using weighted robust estimating equations while retaining a fully parametric model for the covariance structure. The proposed procedure allows an exploratory examination of complex pedigree data that is an invaluable aid in determining appropriate models. This is illustrated by an examination of the relationship between IQ and the level of a particular protein in individuals collected as part of a large family study.  相似文献   

In this paper, we study a nonparametric additive regression model suitable for a wide range of time series applications. Our model includes a periodic component, a deterministic time trend, various component functions of stochastic explanatory variables, and an AR(p) error process that accounts for serial correlation in the regression error. We propose an estimation procedure for the nonparametric component functions and the parameters of the error process based on smooth backfitting and quasimaximum likelihood methods. Our theory establishes convergence rates and the asymptotic normality of our estimators. Moreover, we are able to derive an oracle‐type result for the estimators of the AR parameters: Under fairly mild conditions, the limiting distribution of our parameter estimators is the same as when the nonparametric component functions are known. Finally, we illustrate our estimation procedure by applying it to a sample of climate and ozone data collected on the Antarctic Peninsula.  相似文献   

We propose a flexible functional approach for modelling generalized longitudinal data and survival time using principal components. In the proposed model the longitudinal observations can be continuous or categorical data, such as Gaussian, binomial or Poisson outcomes. We generalize the traditional joint models that treat categorical data as continuous data by using some transformations, such as CD4 counts. The proposed model is data-adaptive, which does not require pre-specified functional forms for longitudinal trajectories and automatically detects characteristic patterns. The longitudinal trajectories observed with measurement error or random error are represented by flexible basis functions through a possibly nonlinear link function, combining dimension reduction techniques resulting from functional principal component (FPC) analysis. The relationship between the longitudinal process and event history is assessed using a Cox regression model. Although the proposed model inherits the flexibility of non-parametric methods, the estimation procedure based on the EM algorithm is still parametric in computation, and thus simple and easy to implement. The computation is simplified by dimension reduction for random coefficients or FPC scores. An iterative selection procedure based on Akaike information criterion (AIC) is proposed to choose the tuning parameters, such as the knots of spline basis and the number of FPCs, so that appropriate degree of smoothness and fluctuation can be addressed. The effectiveness of the proposed approach is illustrated through a simulation study, followed by an application to longitudinal CD4 counts and survival data which were collected in a recent clinical trial to compare the efficiency and safety of two antiretroviral drugs.  相似文献   

In the paper we consider minimisation of U-statistics with the weighted Lasso penalty and investigate their asymptotic properties in model selection and estimation. We prove that the use of appropriate weights in the penalty leads to the procedure that behaves like the oracle that knows the true model in advance, i.e. it is model selection consistent and estimates nonzero parameters with the standard rate. For the unweighted Lasso penalty, we obtain sufficient and necessary conditions for model selection consistency of estimators. The obtained results strongly based on the convexity of the loss function that is the main assumption of the paper. Our theorems can be applied to the ranking problem as well as generalised regression models. Thus, using U-statistics we can study more complex models (better describing real problems) than usually investigated linear or generalised linear models.  相似文献   

This article investigates nonparametric estimation of variance functions for functional data when the mean function is unknown. We obtain asymptotic results for the kernel estimator based on squared residuals. Similar to the finite dimensional case, our asymptotic result shows the smoothness of the unknown mean function has an effect on the rate of convergence. Our simulation studies demonstrate that estimator based on residuals performs much better than that based on conditional second moment of the responses.  相似文献   

We propose a new estimator, the thresholded scaled Lasso, in high-dimensional threshold regressions. First, we establish an upper bound on the ? estimation error of the scaled Lasso estimator of Lee, Seo, and Shin. This is a nontrivial task as the literature on high-dimensional models has focused almost exclusively on ?1 and ?2 estimation errors. We show that this sup-norm bound can be used to distinguish between zero and nonzero coefficients at a much finer scale than would have been possible using classical oracle inequalities. Thus, our sup-norm bound is tailored to consistent variable selection via thresholding. Our simulations show that thresholding the scaled Lasso yields substantial improvements in terms of variable selection. Finally, we use our estimator to shed further empirical light on the long-running debate on the relationship between the level of debt (public and private) and GDP growth. Supplementary materials for this article are available online.  相似文献   

This paper describes inference methods for functional data under the assumption that the functional data of interest are smooth latent functions, characterized by a Gaussian process, which have been observed with noise over a finite set of time points. The methods we propose are completely specified in a Bayesian environment that allows for all inferences to be performed through a simple Gibbs sampler. Our main focus is in estimating and describing uncertainty in the covariance function. However, these models also encompass functional data estimation, functional regression where the predictors are latent functions, and an automatic approach to smoothing parameter selection. Furthermore, these models require minimal assumptions on the data structure as the time points for observations do not need to be equally spaced, the number and placement of observations are allowed to vary among functions, and special treatment is not required when the number of functional observations is less than the dimensionality of those observations. We illustrate the effectiveness of these models in estimating latent functional data, capturing variation in the functional covariance estimate, and in selecting appropriate smoothing parameters in both a simulation study and a regression analysis of medfly fertility data.  相似文献   

In practice, it is not uncommon to encounter the situation that a discrete response is related to both a functional random variable and multiple real-value random variables whose impact on the response is nonlinear. In this paper, we consider the generalized partial functional linear additive models (GPFLAM) and present the estimation procedure. In GPFLAM, the nonparametric functions are approximated by polynomial splines and the infinite slope function is estimated based on the principal component basis function approximations. We obtain the estimator by maximizing the quasi-likelihood function. We investigate the finite sample properties of the estimation procedure via Monte Carlo simulation studies and illustrate our proposed model by a real data analysis.  相似文献   

We discuss the impact of tuning parameter selection uncertainty in the context of shrinkage estimation and propose a methodology to account for problems arising from this issue: Transferring established concepts from model averaging to shrinkage estimation yields the concept of shrinkage averaging estimation (SAE) which reflects the idea of using weighted combinations of shrinkage estimators with different tuning parameters to improve overall stability, predictive performance and standard errors of shrinkage estimators. Two distinct approaches for an appropriate weight choice, both of which are inspired by concepts from the recent literature of model averaging, are presented: The first approach relates to an optimal weight choice with regard to the predictive performance of the final weighted estimator and its implementation can be realized via quadratic programming. The second approach has a fairly different motivation and considers the construction of weights via a resampling experiment. Focusing on Ridge, Lasso and Random Lasso estimators, the properties of the proposed shrinkage averaging estimators resulting from these strategies are explored by means of Monte-Carlo studies and are compared to traditional approaches where the tuning parameter is simply selected via cross validation criteria. The results show that the proposed SAE methodology can improve an estimators’ overall performance and reveal and incorporate tuning parameter uncertainty. As an illustration, selected methods are applied to some recent data from a study on leadership behavior in life science companies.  相似文献   

The adaptive least absolute shrinkage and selection operator (Lasso) and least absolute deviation (LAD)-Lasso are two attractive shrinkage methods for simultaneous variable selection and regression parameter estimation. While the adaptive Lasso is efficient for small magnitude errors, LAD-Lasso is robust against heavy-tailed errors and severe outliers. In this article, we consider a data-driven convex combination of these two modern procedures to produce a robust adaptive Lasso, which not only enjoys the oracle properties, but synthesizes the advantages of the adaptive Lasso and LAD-Lasso. It fully adapts to different error structures including the infinite variance case and automatically chooses the optimal weight to achieve both robustness and high efficiency. Extensive simulation studies demonstrate a good finite sample performance of the robust adaptive Lasso. Two data sets are analyzed to illustrate the practical use of the procedure.  相似文献   

A number of nonstationary models have been developed to estimate extreme events as function of covariates. A quantile regression (QR) model is a statistical approach intended to estimate and conduct inference about the conditional quantile functions. In this article, we focus on the simultaneous variable selection and parameter estimation through penalized quantile regression. We conducted a comparison of regularized Quantile Regression model with B-Splines in Bayesian framework. Regularization is based on penalty and aims to favor parsimonious model, especially in the case of large dimension space. The prior distributions related to the penalties are detailed. Five penalties (Lasso, Ridge, SCAD0, SCAD1 and SCAD2) are considered with their equivalent expressions in Bayesian framework. The regularized quantile estimates are then compared to the maximum likelihood estimates with respect to the sample size. A Markov Chain Monte Carlo (MCMC) algorithms are developed for each hierarchical model to simulate the conditional posterior distribution of the quantiles. Results indicate that the SCAD0 and Lasso have the best performance for quantile estimation according to Relative Mean Biais (RMB) and the Relative Mean-Error (RME) criteria, especially in the case of heavy distributed errors. A case study of the annual maximum precipitation at Charlo, Eastern Canada, with the Pacific North Atlantic climate index as covariate is presented.  相似文献   

When functional data are not homogenous, for example, when there are multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this article, we propose a new estimation procedure for the mixture of Gaussian processes, to incorporate both functional and inhomogenous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from expectation-maximization (EM) algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset.  相似文献   

In practical survey sampling, missing data are unavoidable due to nonresponse, rejected observations by editing, disclosure control, or outlier suppression. We propose a calibrated imputation approach so that valid point and variance estimates of the population (or domain) totals can be computed by the secondary users using simple complete‐sample formulae. This is especially helpful for variance estimation, which generally require additional information and tools that are unavailable to the secondary users. Our approach is natural for continuous variables, where the estimation may be either based on reweighting or imputation, including possibly their outlier‐robust extensions. We also propose a multivariate procedure to accommodate the estimation of the covariance matrix between estimated population totals, which facilitates variance estimation of the ratios or differences among the estimated totals. We illustrate the proposed approach using simulation data in supplementary materials that are available online.  相似文献   

Determination of the best subset is an important step in vector autoregressive (VAR) modeling. Traditional methods either conduct subset selection and parameter estimation separately or compute expensively. In this article, we propose a VAR model selection procedure using adaptive Lasso, for it is computational efficient and can select subset and estimate parameters simultaneously. By proper choice of tuning parameters, we can choose the correct subset and obtain the asymptotic normality of the non zero parameters. Simulation studies and real data analysis show that adaptive Lasso performs better than existing methods in VAR model fitting and prediction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号