首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
A new modeling approach called ‘recursive segmentation’ is proposed to support the supervised exploration and identification of subgroups or clusters. It is based on the frameworks of recursive partitioning and the Patient Rule Induction Method (PRIM). Through combining these methods, recursive segmentation aims to exploit their respective strengths while reducing their weaknesses. Consequently, recursive segmentation can be applied in a very general way, that is in any (multivariate) regression, classification or survival (time-to-event) problem, using conditional inference, evolutionary learning or the CART algorithm, with predictor variables of any scale and with missing values. Furthermore, results of a synthetic example and a benchmark application study that comprises 26 data sets suggest that recursive segmentation achieves a competitive prediction accuracy and provides more accurate definitions of subgroups by models of less complexity as compared to recursive partitioning and PRIM. An application to the German Breast Cancer Study Group data demonstrates the improved interpretability and reliability of results produced by the new approach. The method is made publicly available through the R-package rseg (http://rseg.r-forge.r-project.org/).  相似文献   

2.
It is common to have experiments in which it is not possible to observe the exact lifetimes but only the interval where they occur. This sort of data presents a high number of ties and it is called grouped or interval-censored survival data. Regression methods for grouped data are available in the statistical literature. The regression structure considers modeling the probability of a subject's survival past a visit time conditional on his survival at the previous visit. Two approaches are presented: assuming that lifetimes come from (1) a continuous proportional hazards model and (2) a logistic model. However, there may be situations in which none of the models are adequate for a particular data set. This article proposes the generalized log-normal model as an alternative model for discrete survival data. This model was introduced by Chen (1995 Chen , G. ( 1995 ). Generalized Log-normal distributions with reliability application . Comput. Stat. Data Anal. 19 : 300319 . [Google Scholar]) and it is extended in this article for grouped survival data. A real example related to a Chagas disease illustrates the proposed model.  相似文献   

3.
4.
Zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models are recommended for handling excessive zeros in count data. For various reasons, researchers may not address zero inflation. This paper helps educate researchers on (1) the importance of accounting for zero inflation and (2) the consequences of misspecifying the statistical model. Using simulations, we found that when the zero inflation in the data was ignored, estimation was poor and statistically significant findings were missed. When overdispersion within the zero-inflated data was ignored, poor estimation and inflated Type I errors resulted. Recommendations on when to use the ZINB and ZIP models are provided. In an illustration using a two-step model selection procedure (likelihood ratio test and the Vuong test), the ZIP model was correctly identified only when the distributions had moderate means and sample sizes and did not correctly identify the ZINB model or the zero inflation in the ZIP and ZINB distributions.  相似文献   

5.
Classification and regression tree has been useful in medical research to construct algorithms for disease diagnosis or prognostic prediction. Jin et al. 7 Jin, H., Lu, Y., Harris, R. T., Black, D., Stone, K., Hochberg, M. and Genant, H. 2004. Classification algorithms for hip fracture prediction base on recursive partitioning methods. Med. Decis. Mak., 24: 386398. (doi:10.1177/0272989X04267009)[Crossref], [PubMed], [Web of Science ®] [Google Scholar] developed a robust and cost-saving tree (RACT) algorithm with application in classification of hip fracture risk after 5-year follow-up based on the data from the Study of Osteoporotic Fractures (SOF). Although conventional recursive partitioning algorithms have been well developed, they still have some limitations. Binary splits may generate a big tree with many layers, but trinary splits may produce too many nodes. In this paper, we propose a classification approach combining trinary splits and binary splits to generate a trinary–binary tree. A new non-inferiority test of entropy is used to select the binary or trinary splits. We apply the modified method in SOF to construct a trinary–binary classification rule for predicting risk of osteoporotic hip fracture. Our new classification tree has good statistical utility: it is statistically non-inferior to the optimum binary tree and the RACT based on the testing sample and is also cost-saving. It may be useful in clinical applications: femoral neck bone mineral density, age, height loss and weight gain since age 25 can identify subjects with elevated 5-year hip fracture risk without loss of statistical efficiency.  相似文献   

6.
In this paper, we consider the statistical inference for the success probability in the case of start-up demonstration tests in which rejection of units is possible when a pre-fixed number of failures is observed before the required number of consecutive successes are achieved for acceptance of the unit. Since the expected value of the stopping time is not a monotone function of the unknown parameter, the method of moments is not useful in this situation. Therefore, we discuss two estimation methods for the success probability: (1) the maximum likelihood estimation (MLE) via the expectation-maximization (EM) algorithm and (2) Bayesian estimation with a beta prior. We examine the small-sample properties of the MLE and Bayesian estimator. Finally, we present an example to illustrate the method of inference discussed here.  相似文献   

7.
《Econometric Reviews》2013,32(4):425-443
The integer-valued AR1 model is generalized to encompass some of the more likely features of economic time series of count data. The generalizations come at the price of loosing exact distributional properties. For most specifications the first and second order both conditional and unconditional moments can be obtained. Hence estimation, testing and forecasting are feasible and can be based on least squares or GMM techniques. An illustration based on the number of plants within an industrial sector is considered.  相似文献   

8.
A suitable measure of association for two ordered variables is the doubly cumulative chi-squared statistic (Hirotsu, 1994 Hirotsu , C. ( 1994 ). Modelling and analysing the generalized interaction . Proc. Third IEEE Conf. Control Applic. 2 : 12831288 .[Crossref] [Google Scholar]). This statistic is obtained by considering the cumulative sum of cell frequencies across the variables. In this article, we explore the development of correspondence analysis which takes into account the presence of two ordered variables by partitioning the doubly cumulative chi-squared statistic.  相似文献   

9.
《Econometric Reviews》2013,32(3):383-393
ABSTRACT

This paper considers computation of fitted values and marginal effects in the Box–Cox regression model. Two methods, 1 the “smearing” technique suggested by Duan (see Ref. [10] Duan, N. 1983. Smearing Estimate: A Nonparametric Retransformation Method. J. Amer. Statistical Assoc., 78: 605610. [Taylor & Francis Online], [Web of Science ®] [Google Scholar]) and 2 direct numerical integration, are examined and compared with the “naive” method often used in econometrics.  相似文献   

10.
In this paper, we propose a methodology to analyze longitudinal data through distances between pairs of observations (or individuals) with regard to the explanatory variables used to fit continuous response variables. Restricted maximum-likelihood and generalized least squares are used to estimate the parameters in the model. We applied this new approach to study the effect of gender and exposure on the deviant behavior variable with respect to tolerance for a group of youths studied over a period of 5 years. Were performed simulations where we compared our distance-based method with classic longitudinal analysis with both AR(1) and compound symmetry correlation structures. We compared them under Akaike and Bayesian information criterions, and the relative efficiency of the generalized variance of the errors of each model. We found small gains in the proposed model fit with regard to the classical methodology, particularly in small samples, regardless of variance, correlation, autocorrelation structure and number of time measurements.  相似文献   

11.
In this article, we propose a robust statistical approach to select an appropriate error distribution, in a classical multiplicative heteroscedastic model. In a first step, unlike to the traditional approach, we do not use any GARCH-type estimation of the conditional variance. Instead, we propose to use a recently developed nonparametric procedure [31 D. Mercurio and V. Spokoiny, Statistical inference for time-inhomogeneous volatility models, Ann. Stat. 32 (2004), pp. 577602.[Crossref], [Web of Science ®] [Google Scholar]]: the local adaptive volatility estimation. The motivation for using this method is to avoid a possible model misspecification for the conditional variance. In a second step, we suggest a set of estimation and model selection procedures (Berk–Jones tests, kernel density-based selection, censored likelihood score, and coverage probability) based on the so-obtained residuals. These methods enable to assess the global fit of a set of distributions as well as to focus on their behaviour in the tails, giving us the capacity to map the strengths and weaknesses of the candidate distributions. A bootstrap procedure is provided to compute the rejection regions in this semiparametric context. Finally, we illustrate our methodology throughout a small simulation study and an application on three time series of daily returns (UBS stock returns, BOVESPA returns and EUR/USD exchange rates).  相似文献   

12.
Generalized degrees of freedom (GDF), as defined by Ye (1998 Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association 93(441):120131. [Google Scholar] JASA 93:120–131), represent the sensitivity of model fits to perturbations of the data. Such GDF can be computed for any statistical model, making it possible, in principle, to derive the effective number of parameters in machine-learning approaches and thus compute information-theoretical measures of fit. We compare GDF with cross-validation and find that the latter provides a less computer-intensive and more robust alternative. For Bernoulli-distributed data, GDF estimates were unstable and inconsistently sensitive to the number of data points perturbed simultaneously. Cross-validation, in contrast, performs well also for binary data, and for very different machine-learning approaches.  相似文献   

13.
Rubin (1976 Rubin, D.B. (1976). Inference and missing data. Biometrika 63(3):581592.[Crossref], [Web of Science ®] [Google Scholar]) derived general conditions under which inferences that ignore missing data are valid. These conditions are sufficient but not generally necessary, and therefore may be relaxed in some special cases. We consider here the case of frequentist estimation of a conditional cdf subject to missing outcomes. We partition a set of data into outcome, conditioning, and latent variables, all of which potentially affect the probability of a missing response. We describe sufficient conditions under which a complete-case estimate of the conditional cdf of the outcome given the conditioning variable is unbiased. We use simulations on a renal transplant data set (Dienemann et al.) to illustrate the implications of these results.  相似文献   

14.
This article presents some applications of time-series procedures to solve two typical problems that arise when analyzing demographic information in developing countries: (1) unavailability of annual time series of population growth rates (PGRs) and their corresponding population time series and (2) inappropriately defined population growth goals in official population programs. These problems are considered as situations that require combining information of population time series. Firstly, we suggest the use of temporal disaggregation techniques to combine census data with vital statistics information in order to estimate annual PGRs. Secondly, we apply multiple restricted forecasting to combine the official targets on future PGRs with the disaggregated series. Then, we propose a mechanism to evaluate the compatibility of the demographic goals with the annual data. We apply the aforementioned procedures to data of the Mexico City Metropolitan Zone divided by concentric rings and conclude that the targets established in the official program are not feasible. Hence, we derive future PGRs that are both in line with the official targets and with the historical demographic behavior. We conclude that growth population programs should be based on this kind of analysis to be supported empirically. So, through specialized multivariate time-series techniques, we propose to obtain first an optimal estimate of a disaggregate vector of population time series and then, produce restricted forecasts in agreement with some data-based population policies here derived.  相似文献   

15.
Kadilar and Cingi [Ratio estimators in simple random sampling, Appl. Math. Comput. 151 (3) (2004), pp. 893–902] introduced some ratio-type estimators of finite population mean under simple random sampling. Recently, Kadilar and Cingi [New ratio estimators using correlation coefficient, Interstat 4 (2006), pp. 1–11] have suggested another form of ratio-type estimators by modifying the estimator developed by Singh and Tailor [Use of known correlation coefficient in estimating the finite population mean, Stat. Transit. 6 (2003), pp. 655–560]. Kadilar and Cingi [Improvement in estimating the population mean in simple random sampling, Appl. Math. Lett. 19 (1) (2006), pp. 75–79] have suggested yet another class of ratio-type estimators by taking a weighted average of the two known classes of estimators referenced above. In this article, we propose an alternative form of ratio-type estimators which are better than the competing ratio, regression, and other ratio-type estimators considered here. The results are also supported by the analysis of three real data sets that were considered by Kadilar and Cingi.  相似文献   

16.
The shared-parameter model and its so-called hierarchical or random-effects extension are widely used joint modeling approaches for a combination of longitudinal continuous, binary, count, missing, and survival outcomes that naturally occurs in many clinical and other studies. A random effect is introduced and shared or allowed to differ between two or more repeated measures or longitudinal outcomes, thereby acting as a vehicle to capture association between the outcomes in these joint models. It is generally known that parameter estimates in a linear mixed model (LMM) for continuous repeated measures or longitudinal outcomes allow for a marginal interpretation, even though a hierarchical formulation is employed. This is not the case for the generalized linear mixed model (GLMM), that is, for non-Gaussian outcomes. The aforementioned joint models formulated for continuous and binary or two longitudinal binomial outcomes, using the LMM and GLMM, will naturally have marginal interpretation for parameters associated with the continuous outcome but a subject-specific interpretation for the fixed effects parameters relating covariates to binary outcomes. To derive marginally meaningful parameters for the binary models in a joint model, we adopt the marginal multilevel model (MMM) due to Heagerty [13] and Heagerty and Zeger [14] and formulate a joint MMM for two longitudinal responses. This enables to (1) capture association between the two responses and (2) obtain parameter estimates that have a population-averaged interpretation for both outcomes. The model is applied to two sets of data. The results are compared with those obtained from the existing approaches such as generalized estimating equations, GLMM, and the model of Heagerty [13]. Estimates were found to be very close to those from single analysis per outcome but the joint model yields higher precision and allows for quantifying the association between outcomes. Parameters were estimated using maximum likelihood. The model is easy to fit using available tools such as the SAS NLMIXED procedure.  相似文献   

17.
We are concerned with cumulative regression models for an ordered categorical response variable Y. We propose two methods to build partial residuals from regression on a subset Z1 of covariates Z., which take into regard the ordinal character of the response. The first method makes use of a multivariate GLM-representation of the model and produces residual measures for diagnostic purposes. The second uses a latent continuous variable model and yields new (adjusted) ordinal data Y*. Both methods are illustrated by a data set from forestry.  相似文献   

18.
A study of the densities of ratios of independently distributed random variables following the pathway model, Mathai (2005 Mathai, A.M. (2005). A pathway to matrix - variate gamma and normal densities. Linear Algebra Appl. 396:317328.[Crossref], [Web of Science ®] [Google Scholar]), is carried out. The density functions of these random variables are obtained in terms of H-function. The particular cases of the integral forms are shown to be associated with Tsallis statistics and Beck–Cohen superstatistics. Many other special functions coming under the general density are also included. We plotted the density function of the ratio of these random variables for the different values of the pathway parameters. Real-life application of the results in communication theory related to the signal–noise ratio is illustrated.  相似文献   

19.
In this paper we build on an approach proposed by Zou et al. (2014) for nonparametric changepoint detection. This approach defines the best segmentation for a data set as the one which minimises a penalised cost function, with the cost function defined in term of minus a non-parametric log-likelihood for data within each segment. Minimising this cost function is possible using dynamic programming, but their algorithm had a computational cost that is cubic in the length of the data set. To speed up computation, Zou et al. (2014) resorted to a screening procedure which means that the estimated segmentation is no longer guaranteed to be the global minimum of the cost function. We show that the screening procedure adversely affects the accuracy of the changepoint detection method, and show how a faster dynamic programming algorithm, pruned exact linear time (PELT) (Killick et al. 2012), can be used to find the optimal segmentation with a computational cost that can be close to linear in the amount of data. PELT requires a penalty to avoid under/over-fitting the model which can have a detrimental effect on the quality of the detected changepoints. To overcome this issue we use a relatively new method, changepoints over a range of penalties (Haynes et al. 2016), which finds all of the optimal segmentations for multiple penalty values over a continuous range. We apply our method to detect changes in heart-rate during physical activity.  相似文献   

20.
Cerciello and Giudici (2014 Cerciello, P., Giudici, P. (2014). Bayesian credit ratings. Commun. Stat. Theory Methods. 43:867878.[Taylor &; Francis Online], [Web of Science ®] [Google Scholar]) proposed a Bayesian approach to improve the ordinal variable selection in credit rating assessment. However, no comparison has been made with other methods and the predictive power was not tested. This study proposes an integrated framework of random forest (RF)-based methods and Bayesian model averaging (BMA) to validate and investigate the ordinal variable importance in evaluating credit risk and predicting default in greater depth. The proposed approach was superior to the Cerciello and Giudici method in terms of predictive accuracy and interpretability when applied to a European credit risk database.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号