首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

Empirical Bayes is a versatile approach to “learn from a lot” in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well‐known model‐based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss “formal” empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross‐validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed “co‐data”. In particular, we present two novel examples that allow for co‐data: first, a Bayesian spike‐and‐slab setting that facilitates inclusion of multiple co‐data sources and types and, second, a hybrid empirical Bayes–full Bayes ridge regression approach for estimation of the posterior predictive interval.  相似文献   


Among the statistical methods to model stochastic behaviours of objects, clustering is a preliminary technique to recognize similar patterns within a group of observations in a data set. Various distances to measure differences among objects could be invoked to cluster data through numerous clustering methods. When variables in hand contain geometrical information of objects, such metrics should be adequately adapted. In fact, statistical methods for these typical data are endowed with a geometrical paradigm in a multivariate sense. In this paper, a procedure for clustering shape data is suggested employing appropriate metrics. Then, the best shape distance candidate as well as a suitable agglomerative method for clustering the simulated shape data are provided by considering cluster validation measures. The results are implemented in a real life application.  相似文献   

The current literature deals with the change-point problem only in the context of the obser¬vation of a single sequence. In this paper, inference will be based on the observation of TV sequences of random variables, each sequence containing one change-point. This extension allows the effective use of bootstrap and empirical Bayes methods, both of which are not feasible in the single-path context. Two classes of these “multi-path” change-point problems are considered. If the change-point is assumed to occur at the the same position in each sequence, then the terminology “fixed-tau multi-path change-point” will be used. In other cases, one may expect the change-point to occur at random positions in each sequence, according to some distribution, a “random-tau multi-path change-point” problem. Examples and simulations are given.  相似文献   

Multiple-response (or pick any/c) categorical variables summarize responses to survey questions that ask “pick any” from a set of item responses. Extensions to loglinear model methodology are proposed to model associations between these variables across all their items simultaneously. Because individual item responses to a multiple-response categorical variable are likely to be correlated, the usual chi-square distributional approximations for model-comparison statistics are not appropriate. Adjusted statistics and a new bootstrap procedure are developed to facilitate distributional approximations. Odds ratio and standardized Pearson residual measures are also developed to estimate specific associations and examine deviations from a specified model.  相似文献   

We study estimation and inference in settings where the interest is in the effect of a potentially endogenous regressor on some outcome. To address the endogeneity, we exploit the presence of additional variables. Like conventional instrumental variables, these variables are correlated with the endogenous regressor. However, unlike conventional instrumental variables, they also have direct effects on the outcome, and thus are “invalid” instruments. Our novel identifying assumption is that the direct effects of these invalid instruments are uncorrelated with the effects of the instruments on the endogenous regressor. We show that in this case the limited-information-maximum-likelihood (liml) estimator is no longer consistent, but that a modification of the bias-corrected two-stage-least-square (tsls) estimator is consistent. We also show that conventional tests for over-identifying restrictions, adapted to the many instruments setting, can be used to test for the presence of these direct effects. We recommend that empirical researchers carry out such tests and compare estimates based on liml and the modified version of bias-corrected tsls. We illustrate in the context of two applications that such practice can be illuminating, and that our novel identifying assumption has substantive empirical content.  相似文献   

We propose a sequential test for predictive ability for recursively assessing whether some economic variables have explanatory content for another variable. In the forecasting literature it is common to assess predictive ability by using “one-shot” tests at each estimation period. We show that this practice leads to size distortions, selects overfitted models and provides spurious evidence of in-sample predictive ability, and may lower the forecast accuracy of the model selected by the test. The usefulness of the proposed test is shown in well-known empirical applications to the real-time predictive content of money for output and the selection between linear and nonlinear models.  相似文献   

Many survey questions allow respondents to pick any number out of c possible categorical responses or “items”. These kinds of survey questions often use the terminology “choose all that apply” or “pick any”. Often of interest is determining if the marginal response distributions of each item differ among r different groups of respondents. Agresti and Liu (1998, 1999) call this a test for multiple marginal independence (MMI). If respondents are allowed to pick only 1 out of c responses, the hypothesis test may be performed using the Pearson chi-square test of independence. However, since respondents may pick more or less than 1 response, the test's assumptions that responses are made independently of each other is violated. Recently, a few MMI testing methods have been proposed. Loughin and Scherer (1998) propose using a bootstrap method based on a modified version of the Pearson chi-square test statistic. Agresti and Liu (1998, 1999) propose using marginal logit models, quasisymmetric loglinear models, and a few methods based on Pearson chi-square test statistics. Decady and Thomas (1999) propose using a Rao-Scott adjusted chi-squared test statistic. There has not been a full investigation of these MMI testing methods. The purpose here is to evaluate the proposed methods and propose a few new methods. Recommendations are given to guide the practitioner in choosing which MMI testing methods to use.  相似文献   

Nonparametric families of aging distributions have been the subject of investigation for more than three decades. Both probabilistic and statistical properties of these distributions were studied for such families as “increasing failure rate”, “new better than used”, “new better than used in expectation”, and “harmonic new better than used in expectation”. In the present work, moments inequalities are derived for the above-mentioned four families that demonstrate that if the mean life is finite for any of them then all higher-order moments exist. Next, based on these inequalities, new testing procedures for exponentiality against any one of the above classes are introduced and studied showing that they are simpler than most earlier ones and hold high relative efficiency for some commonly used alternatives.  相似文献   

Ghosh and Lahiri (1987a,b) considered simultaneous estimation of several strata means and variances where each stratum contains a finite number of elements, under the assumption that the posterior expectation of any stratum mean is a linear function of the sample observations - the so called“posterior linearity” property. In this paper we extend their result by retaining the “posterior linearity“ property of each stratum mean but allowing the superpopulation model whose mean as well as the variance-covariance structure changes from stratum to stratum. The performance of the proposed empirical Bayes estimators are found to be satisfactory both in terms of “asymptotic optimality” (Robbins (1955)) and “relative savings loss” (Efron and Morris (1973)).  相似文献   

幸福不仅是影响人们行为的重要因素,更是人们生活追求的最终目标。本文基于中国综合社会调查(CGSS)数据,创新性地从环境“二维化”这一研究视角,将环境因素划分为两个维度,即客观存在的环境污染因素与居民主观的环境行为,采用优化的两阶段有序Probit回归模型,沿着“客观存在的环境污染因素——居民幸福感”和“居民幸福感——居民主观的环境行为”两条研究思路展开实证分析,在实证过程中引入控制变量及工具变量,同时对经济因素及地区因素可能导致的组群差异进行考量和比较。结果显示,无论是环境污染对居民幸福感,还是居民幸福感对居民环境行为,其影响均是显著的,且具有异质性;环境污染通过影响居民的身体健康、生活质量和社会活动对居民幸福感产生影响,而居民环境行为则因个体收入和居民幸福感程度的不同产生差异。本文对环境与居民幸福感之间的作用机制展开深入分析,为政府出台环境政策的制定、提升居民幸福感及居民环境行为贡献度奠定理论基础。  相似文献   

In this article, the Brier score is used to investigate the importance of clustering for the frailty survival model. For this purpose, two versions of the Brier score are constructed, i.e., a “conditional Brier score” and a “marginal Brier score.” Both versions of the Brier score show how the clustering effects and the covariate effects affect the predictive ability of the frailty model separately. Using a Bayesian and a likelihood approach, point estimates and 95% credible/confidence intervals are computed. The estimation properties of both procedures are evaluated in an extensive simulation study for both versions of the Brier score. Further, a validation strategy is developed to calculate an internally validated point estimate and credible/confidence interval. The ensemble of the developments is applied to a dental dataset.  相似文献   

The squared error loss function applied to Bayesian predictive distributions is investigated as a variable selection criterion in linear regression equations. It is illustrated that “cost-free” variables may be eliminated if they are poor predictors. Regression models where the predictors are fixed and where they are stochastic are both considered. An empirical examination of the criterion and a comparison with other techniques are presented.  相似文献   

This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

The promising methodology of the “Statistical Learning Theory” for the estimation of multimodal distribution is thoroughly studied. The “tail” is estimated through Hill's, UH and moment methods. The threshold value is determined by nonparametric bootstrap and the minimum mean square error criterion. Further, the “body” is estimated by the nonparametric structural risk minimization method of the empirical distribution function under the regression set-up. As an illustration, rainfall data for the meteorological subdivision of Orissa, India during the period 1871–2006 are used. It is shown that Hill's method has performed the best for tail density. Finally, the combined estimated “body” and “tail” of the multimodal distribution is shown to capture the multimodality present in the data.  相似文献   

The theoretical and empirical implications of omitted variables, particularly dynamic adjustment effects, are studied. In particular, the attempt to model for such omissions by including possibly irrelevant variables is investigated. This extends the existing knowledge of misspecification analysis in several directions. Ordinary least squares is the estimation technique under study, as has been the case in several recent and related studies. In our empirical example, the question of seasonal variation in interest rates is addressed. We deal with the related issue of deterministic versus stochastic detrending and demonstrate that it can be usefully cast in the context of “misspecification analysis” in dynamic models developed in this article.  相似文献   

In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.  相似文献   

王星  马璇 《统计研究》2015,32(10):74-81
文章旨在研究受航空业动态定价机制影响下的机票价格序列变点估计模型,文中分析了机票价格u8序列数据的结构特点,提出了可用于高噪声数据环境下、阶梯状、带明显多变点的多阶段序列变点估计框架,该框架中级联组合了DBSCAN算法、EM-高斯混合模型聚类、凝聚层次聚类算法和基于乘积划分模型的变点估计方法等多种成熟的数据分析方法,通过对“北京-昆明”航线航班的实证分析,验证了数据分析框架的有效性和普遍适用性。  相似文献   

面板数据聚类方法及应用   总被引:7,自引:0,他引:7  
 基于面板数据的时序特征和截面特征,综合考虑面板数据“绝对指标”,“增量指标”及其“时序波动”特征,在重构面板数据相似性测度的距离函数和Ward聚类算法的基础上,提出了面板数据聚类方法。并以2003-2007年财政金融面板数据为例,对中国14个沿海开放城市进行了聚类分析,显示了良好的应用性。  相似文献   

We re-examine the criteria of “hyper-admissibility” and “necessary bestness”, for the choice of estimator, from the point of view of their relevance to the design of actual surveys. Both these criteria give rise to a unique choice of estimator (viz. the Horvitz-Thompson estimator ?HT) whatever be the character under investigation or sample design. However, we show here that the “principal hyper-surfaces” (or “domains”) of dimension one (which are practically uninteresting)play the key role in arriving at the unique choice. A variance estimator v1(?HT) (due to Horvitz-Thompson), which takes negative values “often”, is shown to be uniquely “hyperadmissible” in a wide class of unbiased estimators of the variance of ?HT. Extensive empirical evidence on the superiority of the Sen-Yates-Grundy variance estimator v2(?HT) over v1(?HT) is presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号