首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In real‐data analysis, deciding the best subset of variables in regression models is an important problem. Akaike's information criterion (AIC) is often used in order to select variables in many fields. When the sample size is not so large, the AIC has a non‐negligible bias that will detrimentally affect variable selection. The present paper considers a bias correction of AIC for selecting variables in the generalized linear model (GLM). The GLM can express a number of statistical models by changing the distribution and the link function, such as the normal linear regression model, the logistic regression model, and the probit model, which are currently commonly used in a number of applied fields. In the present study, we obtain a simple expression for a bias‐corrected AIC (corrected AIC, or CAIC) in GLMs. Furthermore, we provide an ‘R’ code based on our formula. A numerical study reveals that the CAIC has better performance than the AIC for variable selection.  相似文献   

2.
ABSTRACT

Inflated data are prevalent in many situations and a variety of inflated models with extensions have been derived to fit data with excessive counts of some particular responses. The family of information criteria (IC) has been used to compare the fit of models for selection purposes. Yet despite the common use in statistical applications, there are not too many studies evaluating the performance of IC in inflated models. In this study, we studied the performance of IC for data with dual-inflated data. The new zero- and K-inflated Poisson (ZKIP) regression model and conventional inflated models including Poisson regression and zero-inflated Poisson (ZIP) regression were fitted for dual-inflated data and the performance of IC were compared. The effect of sample sizes and the proportions of inflated observations towards selection performance were also examined. The results suggest that the Bayesian information criterion (BIC) and consistent Akaike information criterion (CAIC) are more accurate than the Akaike information criterion (AIC) in terms of model selection when the true model is simple (i.e. Poisson regression (POI)). For more complex models, such as ZIP and ZKIP, the AIC was consistently better than the BIC and CAIC, although it did not reach high levels of accuracy when sample size and the proportion of zero observations were small. The AIC tended to over-fit the data for the POI, whereas the BIC and CAIC tended to under-parameterize the data for ZIP and ZKIP. Therefore, it is desirable to study other model selection criteria for dual-inflated data with small sample size.  相似文献   

3.
Abstract

A convention in designing randomized clinical trials has been to choose sample sizes that yield specified statistical power when testing hypotheses about treatment response. Manski and Tetenov recently critiqued this convention and proposed enrollment of sufficiently many subjects to enable near-optimal treatment choices. This article develops a refined version of that analysis applicable to trials comparing aggressive treatment of patients with surveillance. The need for a refined analysis arises because the earlier work assumed that there is only a primary health outcome of interest, without secondary outcomes. An important aspect of choice between surveillance and aggressive treatment is that the latter may have side effects. One should then consider how the primary outcome and side effects jointly determine patient welfare. This requires new analysis of sample design. As a case study, we reconsider a trial comparing nodal observation and lymph node dissection when treating patients with cutaneous melanoma. Using a statistical power calculation, the investigators assigned 971 patients to dissection and 968 to observation. We conclude that assigning 244 patients to each option would yield findings that enable suitably near-optimal treatment choice. Thus, a much smaller sample size would have sufficed to inform clinical practice.  相似文献   

4.
This paper is concerned with the problem of selecting variables in two-group discriminant analysis for high-dimensional data with fewer observations than the dimension. We consider a selection criterion based on approximately unbiased for AIC type of risk. When the dimension is large compared to the sample size, AIC type of risk cannot be defined. We propose AIC by replacing maximum likelihood estimator with ridge-type estimator. This idea follows Srivastava and Kubokawa (2008). It has been further extended by Yamamura et al. (2010). Simulation revealed that the proposed AIC performs well.  相似文献   

5.
ABSTRACT

Recently, sponsors and regulatory authorities pay much attention on the multiregional trial because it can shorten the drug lag or the time lag for approval, simultaneous drug development, submission, and approval in the world. However, many studies have shown that genetic determinants may mediate variability among persons in response to a drug. Thus, some therapeutics benefit part of treated patients. It means that the assumption of homogeneous effect size is not suitable for multiregional trials. In this paper, we conduct the sample size determination of a multiregional clinical trial calculated by fixed effect and random effect under the assumption of heterogeneous effect size. The performances of fixed effect and random effect on allocating sample size on a specific region are compared by statistical criteria for consistency between the region of interest and overall results.  相似文献   

6.
In this paper, we consider a regression analysis for a missing data problem in which the variables of primary interest are unobserved under a general biased sampling scheme, an outcome‐dependent sampling (ODS) design. We propose a semiparametric empirical likelihood method for accessing the association between a continuous outcome response and unobservable interesting factors. Simulation study results show that ODS design can produce more efficient estimators than the simple random design of the same sample size. We demonstrate the proposed approach with a data set from an environmental study for the genetic effects on human lung function in COPD smokers. The Canadian Journal of Statistics 40: 282–303; 2012 © 2012 Statistical Society of Canada  相似文献   

7.
Abstract

We propose a new class of two-stage parameter estimation methods for semiparametric ordinary differential equation (ODE) models. In the first stage, state variables are estimated using a penalized spline approach; In the second stage, form of numerical discretization algorithms for an ODE solver is used to formulate estimating equations. Estimated state variables from the first stage are used to obtain more data points for the second stage. Asymptotic properties for the proposed estimators are established. Simulation studies show that the method performs well, especially for small sample. Real life use of the method is illustrated using Influenza specific cell-trafficking study.  相似文献   

8.
Abstract

We develop an exact approach for the determination of the minimum sample size for estimating a Poisson parameter such that the pre-specified levels of relative precision and confidence are guaranteed. The exact computation is made possible by reducing infinitely many evaluations of coverage probability to finitely many evaluations. The theory for supporting such a reduction is that the minimum of coverage probability with respect to the parameter in an interval is attained at a discrete set of finitely many elements. Computational mechanisms have been developed to further reduce the computational complexity. An explicit bound for the minimum sample size is established.  相似文献   

9.
Different longitudinal study designs require different statistical analysis methods and different methods of sample size determination. Statistical power analysis is a flexible approach to sample size determination for longitudinal studies. However, different power analyses are required for different statistical tests which arises from the difference between different statistical methods. In this paper, the simulation-based power calculations of F-tests with Containment, Kenward-Roger or Satterthwaite approximation of degrees of freedom are examined for sample size determination in the context of a special case of linear mixed models (LMMs), which is frequently used in the analysis of longitudinal data. Essentially, the roles of some factors, such as variance–covariance structure of random effects [unstructured UN or factor analytic FA0], autocorrelation structure among errors over time [independent IND, first-order autoregressive AR1 or first-order moving average MA1], parameter estimation methods [maximum likelihood ML and restricted maximum likelihood REML] and iterative algorithms [ridge-stabilized Newton-Raphson and Quasi-Newton] on statistical power of approximate F-tests in the LMM are examined together, which has not been considered previously. The greatest factor affecting statistical power is found to be the variance–covariance structure of random effects in the LMM. It appears that the simulation-based analysis in this study gives an interesting insight into statistical power of approximate F-tests for fixed effects in LMMs for longitudinal data.  相似文献   

10.
Chapter Notes     
Tests for redundancy of variables in linear two-group discriminant analysis are well known and frequently used. We give a survey of similar tests, including the one-sample T 2 as a special case, in the situation in which only the mean vector (but no covariance matrix) is available in one sample. Then we show that a relation between linear regression and discriminant functions found by Fisher (1936) can be generalized to this situation. Relating regression and discriminant analysis to a multivariate linear model sheds more light on the relationship between them. Practical and didactical advantages of the regression approach to T 2 tests and discriminant analysis are outlined.  相似文献   

11.
Summary.  The problem motivating the paper is the determination of sample size in clinical trials under normal likelihoods and at the substantive testing stage of a financial audit where normality is not an appropriate assumption. A combination of analytical and simulation-based techniques within the Bayesian framework is proposed. The framework accommodates two different prior distributions: one is the general purpose fitting prior distribution that is used in Bayesian analysis and the other is the expert subjective prior distribution, the sampling prior which is believed to generate the parameter values which in turn generate the data. We obtain many theoretical results and one key result is that typical non-informative prior distributions lead to very small sample sizes. In contrast, a very informative prior distribution may either lead to a very small or a very large sample size depending on the location of the centre of the prior distribution and the hypothesized value of the parameter. The methods that are developed are quite general and can be applied to other sample size determination problems. Some numerical illustrations which bring out many other aspects of the optimum sample size are given.  相似文献   

12.
For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative hypothesis requires complex analytic approximations, and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say p≤20. On the other hand, assuming that the data dimension p as well as the number q of regression variables are fixed while the sample size n grows, several asymptotic approximations are proposed in the literature for Wilk's Λ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension p and a large sample size n. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null hypothesis and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large p and large n context, but also for moderately large data dimensions such as p=30 or p=50. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in multivariate analysis of variance which is valid for high-dimensional data.  相似文献   

13.
Sample size calculation is a critical issue in clinical trials because a small sample size leads to a biased inference and a large sample size increases the cost. With the development of advanced medical technology, some patients can be cured of certain chronic diseases, and the proportional hazards mixture cure model has been developed to handle survival data with potential cure information. Given the needs of survival trials with potential cure proportions, a corresponding sample size formula based on the log-rank test statistic for binary covariates has been proposed by Wang et al. [25]. However, a sample size formula based on continuous variables has not been developed. Herein, we presented sample size and power calculations for the mixture cure model with continuous variables based on the log-rank method and further modified it by Ewell's method. The proposed approaches were evaluated using simulation studies for synthetic data from exponential and Weibull distributions. A program for calculating necessary sample size for continuous covariates in a mixture cure model was implemented in R.  相似文献   

14.
Abstract

In this article, we propose a new projected PCA to determine the number of factors. We project variables of interest into the space spanned by cross sectional averages of variables. And then we construct the eigenvalue tests and the information criteria to estimate the number of factors. We derive the large sample consistency and conduct finite sample simulations to demonstrate the better performances of our estimators. In order to show the edge of our estimators in real data analysis, we revisit a large house price data set for which the number of factors is hard to select.  相似文献   

15.
Abstract

For clinical trials, molecular heterogeneity has played a more important role recently. Many novel clinical trial designs prospectively incorporate molecular information to evaluation of treatment effects. In this paper, an adaptive procedure incorporating a non-pre-specified genomic biomarker is employed in the interim of a conventional trial. A non-pre-specified binary genomic biomarker, which is predictive of treatment effect, is used to classify study patients into two mutually exclusive subgroups at the interim review. According to the observations at the interim stage, adaptations such as adjusting sample size or shifting eligibility of study patients are then made in case of different scenarios.  相似文献   

16.
余壮雄  王美今 《统计研究》2010,27(12):86-91
 本文基于数据双侧归并的一般化设定探讨了回归方程中包含归并数据时的参数估计问题。对于某些变量存在数据归并的线性模型,由于样本似然函数非常复杂,普通的一阶优化条件没有解析解,Newton-Raphson迭代也难以收敛。我们基于EM算法来计算参数的ML估计,推导了对应的参数迭代方程,给出了参数的一个闭式解。特别是,当数据双侧归并比例达到100%时,被归并的连续变量退化为虚拟变量的形式,对此,我们建议使用AIC或SC来识别回归方程中的虚拟变量是否为结构变化抑或是变量归并。  相似文献   

17.
ABSTRACT

This paper extends the classical methods of analysis of a two-way contingency table to the fuzzy environment for two cases: (1) when the available sample of observations is reported as imprecise data, and (2) the case in which we prefer to categorize the variables based on linguistic terms rather than as crisp quantities. For this purpose, the α-cuts approach is used to extend the usual concepts of the test statistic and p-value to the fuzzy test statistic and fuzzy p-value. In addition, some measures of association are extended to the fuzzy version in order to evaluate the dependence in such contingency tables. Some practical examples are provided to explain the applicability of the proposed methods in real-world problems.  相似文献   

18.
ABSTRACT

Very often researchers plan a balanced design for cluster randomization clinical trials in conducting medical research, but unavoidable circumstances lead to unbalanced data. By adopting three or more levels of nested designs, they usually ignore the higher level of nesting and consider only two levels, this situation leads to underestimation of variance at higher levels. While calculating the sample size for three-level nested designs, in order to achieve desired power, intra-class correlation coefficients (ICCs) at individual level as well as higher levels need to be considered and must be provided along with respective standard errors. In the present paper, the standard errors of analysis of variance (ANOVA) estimates of ICCs for three-level unbalanced nested design are derived. To conquer the strong appeal of distributional assumptions, balanced design, equality of variances between clusters and large sample, general expressions for standard errors of ICCs which can be deployed in unbalanced cluster randomization trials are postulated. The expressions are evaluated on real data as well as highly unbalanced simulated data.  相似文献   

19.
Abstract

We propose a unified approach for multilevel sample selection models using a generalized result on skew distributions arising from selection. If the underlying distributional assumption is normal, then the resulting density for the outcome is the continuous component of the sample selection density and has links with the closed skew-normal distribution (CSN). The CSN distribution provides a framework which simplifies the derivation of the conditional expectation of the observed data. This generalizes the Heckman’s two-step method to a multilevel sample selection model. Finite-sample performance of the maximum likelihood estimator of this model is studied through a Monte Carlo simulation.  相似文献   

20.
ABSTRACT

Large sample properties of Life-Table estimator are discussed for interval censored bivariate survival data. We restrict our attention to the situation where response times within pairs are not distinguishable, and the univariate survival distribution is the same for any individual within any pair. The large sample properties are applied to test for equality of two distributions with correlated response times where treatments are applied to different independent sets of cohorts. Data, which can be separated into two independent sets, from an angioplasty study where more than one procedure is performed on some patients are used to illustrate this methodology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号