期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang Trent Geisler Herman Ray Ying Xie 《Journal of applied statistics》2022,49(13):3257

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective. 相似文献

2.

The Gamma-count distribution in the analysis of experimental underdispersed data

Walmes Marques Zeviani Paulo Justiniano Ribeiro Jr Wagner Hugo Bonat Silvia Emiko Shimakura Joel Augusto Muniz 《Journal of applied statistics》2014,41(12):2616-2626

Event counts are response variables with non-negative integer values representing the number of times that an event occurs within a fixed domain such as a time interval, a geographical area or a cell of a contingency table. Analysis of counts by Gaussian regression models ignores the discreteness, asymmetry and heteroscedasticity and is inefficient, providing unrealistic standard errors or possibly negative predictions of the expected number of events. The Poisson regression is the standard model for count data with underlying assumptions on the generating process which may be implausible in many applications. Statisticians have long recognized the limitation of imposing equidispersion under the Poisson regression model. A typical situation is when the conditional variance exceeds the conditional mean, in which case models allowing for overdispersion are routinely used. Less reported is the case of underdispersion with fewer modeling alternatives and assessments available in the literature. One of such alternatives, the Gamma-count model, is adopted here in the analysis of an agronomic experiment designed to investigate the effect of levels of defoliation on different phenological states upon the number of cotton bolls. Data set and code for analysis are available as online supplements. Results show improvements over the Poisson model and the semi-parametric quasi-Poisson model in capturing the observed variability in the data. Estimating rather than assuming the underlying variance process leads to important insights into the process. 相似文献

3.

Functional approach of flexibly modelling generalized longitudinal data and survival time

Fang Yao 《Journal of statistical planning and inference》2008

We propose a flexible functional approach for modelling generalized longitudinal data and survival time using principal components. In the proposed model the longitudinal observations can be continuous or categorical data, such as Gaussian, binomial or Poisson outcomes. We generalize the traditional joint models that treat categorical data as continuous data by using some transformations, such as CD4 counts. The proposed model is data-adaptive, which does not require pre-specified functional forms for longitudinal trajectories and automatically detects characteristic patterns. The longitudinal trajectories observed with measurement error or random error are represented by flexible basis functions through a possibly nonlinear link function, combining dimension reduction techniques resulting from functional principal component (FPC) analysis. The relationship between the longitudinal process and event history is assessed using a Cox regression model. Although the proposed model inherits the flexibility of non-parametric methods, the estimation procedure based on the EM algorithm is still parametric in computation, and thus simple and easy to implement. The computation is simplified by dimension reduction for random coefficients or FPC scores. An iterative selection procedure based on Akaike information criterion (AIC) is proposed to choose the tuning parameters, such as the knots of spline basis and the number of FPCs, so that appropriate degree of smoothness and fluctuation can be addressed. The effectiveness of the proposed approach is illustrated through a simulation study, followed by an application to longitudinal CD4 counts and survival data which were collected in a recent clinical trial to compare the efficiency and safety of two antiretroviral drugs. 相似文献

4.

Quantile regression based on a weighted approach under semi-competing risks data

《Journal of Statistical Computation and Simulation》2012,82(14):2793-2807

In this article, we investigate the quantile regression analysis for semi-competing risks data in which a non-terminal event may be dependently censored by a terminal event. Due to the dependent censoring, the estimation of quantile regression coefficients on the non-terminal event becomes difficult. In order to handle this problem, we assume Archimedean Copula to specify the dependence of the non-terminal event and the terminal event. Portnoy [Censored regression quantiles. J Amer Statist Assoc. 2003;98:1001–1012] considered the quantile regression model under right-censoring data. We extend his approach to construct a weight function, and then impose the weight function to estimate the quantile regression parameter for the non-terminal event under semi-competing risks data. We also prove the consistency and asymptotic properties for the proposed estimator. According to the simulation studies, the performance of our proposed method is good. We also apply our suggested approach to analyse a real data. 相似文献

5.

Variable Selection for Panel Count Data via Non-Concave Penalized Estimating Function

XINGWEI TONG XIN HE LIUQUAN SUN JIANGUO SUN 《Scandinavian Journal of Statistics》2009,36(4):620-635

Abstract. Variable selection is an important issue in all regression analyses, and in this paper we discuss this in the context of regression analysis of panel count data. Panel count data often occur in long-term studies that concern occurrence rate of a recurrent event, and their analysis has recently attracted a great deal of attention. However, there does not seem to exist any established approach for variable selection with respect to panel count data. For the problem, we adopt the idea behind the non-concave penalized likelihood approach and develop a non-concave penalized estimating function approach. The proposed methodology selects variables and estimates regression coefficients simultaneously, and an algorithm is presented for this process. We show that the proposed procedure performs as well as the oracle procedure in that it yields the estimates as if the correct submodel were known. Simulation studies are conducted for assessing the performance of the proposed approach and suggest that it works well for practical situations. An illustrative example from a cancer study is provided. 相似文献

6.

Variable selection for recurrent event data via nonconcave penalized estimating function

Xingwei Tong Liang Zhu Jianguo Sun 《Lifetime data analysis》2009,15(2):197-215

Variable selection is an important issue in all regression analysis and in this paper, we discuss this in the context of regression analysis of recurrent event data. Recurrent event data often occur in long-term studies in which individuals may experience the events of interest more than once and their analysis has recently attracted a great deal of attention (Andersen et al., Statistical models based on counting processes, 1993; Cook and Lawless, Biometrics 52:1311–1323, 1996, The analysis of recurrent event data, 2007; Cook et al., Biometrics 52:557–571, 1996; Lawless and Nadeau, Technometrics 37:158-168, 1995; Lin et al., J R Stat Soc B 69:711–730, 2000). However, it seems that there are no established approaches to the variable selection with respect to recurrent event data. For the problem, we adopt the idea behind the nonconcave penalized likelihood approach proposed in Fan and Li (J Am Stat Assoc 96:1348–1360, 2001) and develop a nonconcave penalized estimating function approach. The proposed approach selects variables and estimates regression coefficients simultaneously and an algorithm is presented for this process. We show that the proposed approach performs as well as the oracle procedure in that it yields the estimates as if the correct submodel was known. Simulation studies are conducted for assessing the performance of the proposed approach and suggest that it works well for practical situations. The proposed methodology is illustrated by using the data from a chronic granulomatous disease study. 相似文献

7.

Adaptive varying-coefficient linear models

Jianqing Fan Qiwei Yao Zongwu Cai 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2003,65(1):57-80

Summary. Varying-coefficient linear models arise from multivariate nonparametric regression, non-linear time series modelling and forecasting, functional data analysis, longitudinal data analysis and others. It has been a common practice to assume that the varying coefficients are functions of a given variable, which is often called an index . To enlarge the modelling capacity substantially, this paper explores a class of varying-coefficient linear models in which the index is unknown and is estimated as a linear combination of regressors and/or other variables. We search for the index such that the derived varying-coefficient model provides the least squares approximation to the underlying unknown multidimensional regression function. The search is implemented through a newly proposed hybrid backfitting algorithm. The core of the algorithm is the alternating iteration between estimating the index through a one-step scheme and estimating coefficient functions through one-dimensional local linear smoothing. The locally significant variables are selected in terms of a combined use of the t -statistic and the Akaike information criterion. We further extend the algorithm for models with two indices. Simulation shows that the methodology proposed has appreciable flexibility to model complex multivariate non-linear structure and is practically feasible with average modern computers. The methods are further illustrated through the Canadian mink–muskrat data in 1925–1994 and the pound–dollar exchange rates in 1974–1983. 相似文献

8.

Semiparametric additive marginal regression models for multiple type recurrent events

X Chen Q Wang J Cai V Shankar 《Lifetime data analysis》2012,18(4):504-527

Recurrent event data are often encountered in biomedical research, for example, recurrent infections or recurrent hospitalizations for patients after renal transplant. In many studies, there are more than one type of events of interest. Cai and Schaube (Lifetime Data Anal 10:121-138, 2004) advocated a proportional marginal rate model for multiple type recurrent event data. In this paper, we propose a general additive marginal rate regression model. Estimating equations approach is used to obtain the estimators of regression coefficients and baseline rate function. We prove the consistency and asymptotic normality of the proposed estimators. The finite sample properties of our estimators are demonstrated by simulations. The proposed methods are applied to the India renal transplant study to examine risk factors for bacterial, fungal and viral infections. 相似文献

9.

Seemingly unrelated regression tree

Jaeoh Kim 《Journal of applied statistics》2019,46(7):1177-1195

Nonparametric seemingly unrelated regression provides a powerful alternative to parametric seemingly unrelated regression for relaxing the linearity assumption. The existing methods are limited, particularly with sharp changes in the relationship between the predictor variables and the corresponding response variable. We propose a new nonparametric method for seemingly unrelated regression, which adopts a tree-structured regression framework, has satisfiable prediction accuracy and interpretability, no restriction on the inclusion of categorical variables, and is less vulnerable to the curse of dimensionality. Moreover, an important feature is constructing a unified tree-structured model for multivariate data, even though the predictor variables corresponding to the response variable are entirely different. This unified model can offer revelatory insights such as underlying economic meaning. We propose the key factors of tree-structured regression, which are an impurity function detecting complex nonlinear relationships between the predictor variables and the response variable, split rule selection with negligible selection bias, and tree size determination solving underfitting and overfitting problems. We demonstrate our proposed method using simulated data and illustrate it using data from the Korea stock exchange sector indices. 相似文献

10.

Additive hazards model with truncated and doubly censored data

Pao-Sheng Shen 《Journal of applied statistics》2013,40(7):1520-1532

In longitudinal studies, the additive hazard model is often used to analyze covariate effects on the duration time, defined as the elapsed time between the first and the second event. In this article, we consider the situation when the first event suffers partly interval censoring and the second event suffers left truncation and right-censoring. We proposed a two-step estimation procedure for estimating the regression coefficients of the additive hazards model. A simulation study is conducted to investigate the performance of the proposed estimator. The proposed method is applied to the Centers for Disease Control acquired immune deficiency syndrome blood transfusion data. 相似文献

11.

Joint GEEs for multivariate correlated data with incomplete binary outcomes

G. Inan R. Yucel 《Journal of applied statistics》2017,44(11):1920-1937

This study considers a fully-parametric but uncongenial multiple imputation (MI) inference to jointly analyze incomplete binary response variables observed in a correlated data settings. Multiple imputation model is specified as a fully-parametric model based on a multivariate extension of mixed-effects models. Dichotomized imputed datasets are then analyzed using joint GEE models where covariates are associated with the marginal mean of responses with response-specific regression coefficients and a Kronecker product is accommodated for cluster-specific correlation structure for a given response variable and correlation structure between multiple response variables. The validity of the proposed MI-based JGEE (MI-JGEE) approach is assessed through a Monte Carlo simulation study under different scenarios. The simulation results, which are evaluated in terms of bias, mean-squared error, and coverage rate, show that MI-JGEE has promising inferential properties even when the underlying multiple imputation is misspecified. Finally, Adolescent Alcohol Prevention Trial data are used for illustration. 相似文献

12.

Lasso penalized semiparametric regression on high-dimensional recurrent event data via coordinate descent

Tong Tong Wu 《Journal of Statistical Computation and Simulation》2013,83(6):1145-1155

This paper studies a fast computational algorithm for variable selection on high-dimensional recurrent event data. Based on the lasso penalized partial likelihood function for the response process of recurrent event data, a coordinate descent algorithm is used to accelerate the estimation of regression coefficients. This algorithm is capable of selecting important predictors for underdetermined problems where the number of predictors far exceeds the number of cases. The selection strength is controlled by a tuning constant that is determined by a generalized cross-validation method. Our numerical experiments on simulated and real data demonstrate the good performance of penalized regression in model building for recurrent event data in high-dimensional settings. 相似文献

13.

Testing for constancy in varying coefficient models

Mohamed Ahkim 《统计学通讯:理论与方法》2018,47(4):890-911

We consider varying coefficient models, which are an extension of the classical linear regression models in the sense that the regression coefficients are replaced by functions in certain variables (for example, time), the covariates are also allowed to depend on other variables. Varying coefficient models are popular in longitudinal data and panel data studies, and have been applied in fields such as finance and health sciences. We consider longitudinal data and estimate the coefficient functions by the flexible B-spline technique. An important question in a varying coefficient model is whether an estimated coefficient function is statistically different from a constant (or zero). We develop testing procedures based on the estimated B-spline coefficients by making use of nice properties of a B-spline basis. Our method allows longitudinal data where repeated measurements for an individual can be correlated. We obtain the asymptotic null distribution of the test statistic. The power of the proposed testing procedures are illustrated on simulated data where we highlight the importance of including the correlation structure of the response variable and on real data. 相似文献

14.

Analyzing panel count data with a dependent observation process and a terminal event

Hui Zhao Yang Li Jianguo Sun 《Revue canadienne de statistique》2013,41(1):174-191

Panel count data occur in many fields and a number of approaches have been developed. However, most of these approaches are for situations where there is no terminal event and the observation process is independent of the underlying recurrent event process unconditionally or conditional on the covariates. In this paper, we discuss a more general situation where the observation process is informative and there exists a terminal event which precludes further occurrence of the recurrent events of interest. For the analysis, a semiparametric transformation model is presented for the mean function of the underlying recurrent event process among survivors. To estimate the regression parameters, an estimating equation approach is proposed in which an inverse survival probability weighting technique is used. The asymptotic distribution of the proposed estimates is provided. Simulation studies are conducted and suggest that the proposed approach works well for practical situations. An illustrative example is provided. The Canadian Journal of Statistics 41: 174–191; 2013 © 2012 Statistical Society of Canada 相似文献

15.

Generalized log-gamma regression models with cure fraction

Ortega EM Cancho VG Paula GA 《Lifetime data analysis》2009,15(1):79-106

In this paper, the generalized log-gamma regression model is modified to allow the possibility that long-term survivors may be present in the data. This modification leads to a generalized log-gamma regression model with a cure rate, encompassing, as special cases, the log-exponential, log-Weibull and log-normal regression models with a cure rate typically used to model such data. The models attempt to simultaneously estimate the effects of explanatory variables on the timing acceleration/deceleration of a given event and the surviving fraction, that is, the proportion of the population for which the event never occurs. The normal curvatures of local influence are derived under some usual perturbation schemes and two martingale-type residuals are proposed to assess departures from the generalized log-gamma error assumption as well as to detect outlying observations. Finally, a data set from the medical area is analyzed. 相似文献

16.

Logistic regression in meta-analysis using aggregate data

Bei-Hung Chang Stuart Lipsitz Christine Waternaux 《Journal of applied statistics》2000,27(4):411-424

We derived two methods to estimate the logistic regression coefficients in a meta-analysis when only the 'aggregate' data (mean values) from each study are available. The estimators we proposed are the discriminant function estimator and the reverse Taylor series approximation. These two methods of estimation gave similar estimators using an example of individual data. However, when aggregate data were used, the discriminant function estimators were quite different from the other two estimators. A simulation study was then performed to evaluate the performance of these two estimators as well as the estimator obtained from the model that simply uses the aggregate data in a logistic regression model. The simulation study showed that all three estimators are biased. The bias increases as the variance of the covariate increases. The distribution type of the covariates also affects the bias. In general, the estimator from the logistic regression using the aggregate data has less bias and better coverage probabilities than the other two estimators. We concluded that analysts should be cautious in using aggregate data to estimate the parameters of the logistic regression model for the underlying individual data. 相似文献

17.

A Robust Score Test for Testing Several Coefficients of Variation with Unknown Underlying Distributions

Tsung-Shan Tsou 《统计学通讯:理论与方法》2013,42(9):1350-1360

A parametric robust test is proposed for comparing several coefficients of variation. This test is derived by properly correcting the normal likelihood function according to the technique suggested by Royall and Tsou. The proposed test statistic is asymptotically valid for general random variables, as long as their underlying distributions have finite fourth moments.

Simulation studies and real data analyses are provided to demonstrate the effectiveness of the novel robust procedure. 相似文献

18.

The analysis of multivariate recurrent events with partially missing event types

Chen BE Cook RJ 《Lifetime data analysis》2009,15(1):41-58

In many clinical studies, subjects are at risk of experiencing more than one type of potentially recurrent event. In some situations, however, the occurrence of an event is observed, but the specific type is not determined. We consider the analysis of this type of incomplete data when the objectives are to summarize features of conditional intensity functions and associated treatment effects, and to study the association between different types of event. Here we describe a likelihood approach based on joint models for the multi-type recurrent events where parameter estimation is obtained from a Monte-Carlo EM algorithm. Simulation studies show that the proposed method gives unbiased estimators for regression coefficients and variance–covariance parameters, and the coverage probabilities of confidence intervals for regression coefficients are close to the nominal level. When the distribution of the frailty variable is misspecified, the method still provides estimators of the regression coefficients with good properties. The proposed method is applied to a motivating data set from an asthma study in which exacerbations were to be sub-typed by cellular analysis of sputum samples as eosinophilic or non-eosinophilic. 相似文献

19.

Sparse Bayesian variable selection in multinomial probit regression model with application to high-dimensional data classification

Yang Aijun Xiang Liming Lin Jinguan 《统计学通讯:理论与方法》2017,46(12):6137-6150

Here we consider a multinomial probit regression model where the number of variables substantially exceeds the sample size and only a subset of the available variables is associated with the response. Thus selecting a small number of relevant variables for classification has received a great deal of attention. Generally when the number of variables is substantial, sparsity-enforcing priors for the regression coefficients are called for on grounds of predictive generalization and computational ease. In this paper, we propose a sparse Bayesian variable selection method in multinomial probit regression model for multi-class classification. The performance of our proposed method is demonstrated with one simulated data and three well-known gene expression profiling data: breast cancer data, leukemia data, and small round blue-cell tumors. The results show that compared with other methods, our method is able to select the relevant variables and can obtain competitive classification accuracy with a small subset of relevant genes. 相似文献

20.

Inference in Semi‐Parametric Dynamic Models for Repeated Count Data

下载免费PDF全文

Brajendra C. Sutradhar K.V. Vineetha Warriyar Nan Zheng 《Australian & New Zealand Journal of Statistics》2016,58(3):397-434

This paper deals with a longitudinal semi‐parametric regression model in a generalised linear model setup for repeated count data collected from a large number of independent individuals. To accommodate the longitudinal correlations, we consider a dynamic model for repeated counts which has decaying auto‐correlations as the time lag increases between the repeated responses. The semi‐parametric regression function involved in the model contains a specified regression function in some suitable time‐dependent covariates and a non‐parametric function in some other time‐dependent covariates. As far as the inference is concerned, because the non‐parametric function is of secondary interest, we estimate this function consistently using the independence assumption‐based well‐known quasi‐likelihood approach. Next, the proposed longitudinal correlation structure and the estimate of the non‐parametric function are used to develop a semi‐parametric generalised quasi‐likelihood approach for consistent and efficient estimation of the regression effects in the parametric regression function. The finite sample performance of the proposed estimation approach is examined through an intensive simulation study based on both large and small samples. Both balanced and unbalanced cluster sizes are incorporated in the simulation study. The asymptotic performances of the estimators are given. The estimation methodology is illustrated by reanalysing the well‐known health care utilisation data consisting of counts of yearly visits to a physician by 180 individuals for four years and several important primary and secondary covariates. 相似文献