首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT

In this paper, we propose an adaptive stochastic gradient boosting tree for classification studies with imbalanced data. The adjustment of cost-sensitivity and the predictive threshold are integrated together with a composite criterion into the original stochastic gradient boosting tree to deal with the issues of the imbalanced data structure. Numerical study shows that the proposed method can significantly enhance the classification accuracy for the minority class with only a small loss in the true negative rate for the majority class. We discuss the relation of the cost-sensitivity to the threshold manipulation using simulations. An illustrative example of the analysis of suboptimal health-state data in traditional Chinese medicine is discussed.  相似文献   

2.
Imbalanced data brings biased classification and causes the low accuracy of the classification of the minority class. In this article, we propose a methodology to select grouped variables using the area under the ROC with an adjustable prediction cut point. The proposed method enhance the accuracy of classification for the minority class by maximizing the true positive rate. Simulation results show that the proposed method is appropriate for both the categorical and continuous covariates. An illustrative example of the analysis of the SHS data in TCM is discussed to show the reasonable application of the proposed method.  相似文献   

3.
This article considers panel data models in the presence of a large number of potential predictors and unobservable common factors. The model is estimated by the regularization method together with the principal components procedure. We propose a panel information criterion for selecting the regularization parameter and the number of common factors under a diverging number of predictors. Under the correct model specification, we show that the proposed criterion consistently identifies the true model. If the model is instead misspecified, the proposed criterion achieves asymptotically efficient model selection. Simulation results confirm these theoretical arguments.  相似文献   

4.
In binary regression, imbalanced data result from the presence of values equal to zero (or one) in a proportion that is significantly greater than the corresponding real values of one (or zero). In this work, we evaluate two methods developed to deal with imbalanced data and compare them to the use of asymmetric links. The results based on simulation study show, that correction methods do not adequately correct bias in the estimation of regression coefficients and that the models with power links and reverse power considered produce better results for certain types of imbalanced data. Additionally, we present an application for imbalanced data, identifying the best model among the various ones proposed. The parameters are estimated using a Bayesian approach, considering the Hamiltonian Monte-Carlo method, utilizing the No-U-Turn Sampler algorithm and the comparisons of models were developed using different criteria for model comparison, predictive evaluation and quantile residuals.  相似文献   

5.
Linear discriminant analysis between two populations is considered in this paper. Error rate is reviewed as a criterion for selection of variables, and a stepwise procedure is outlined that selects variables on the basis of empirical estimates of error. Problems with assessment of the selected variables are highlighted. A leave-one-out method is proposed for estimating the true error rate of the selected variables, or alternatively of the selection procedure itself. Monte Carlo simulations, of multivariate binary as well as multivariate normal data, demonstrate the feasibility of the proposed method and indicate its much greater accuracy relative to that of other available methods.  相似文献   

6.
The generalized estimating equation is a popular method for analyzing correlated response data. It is important to determine a proper working correlation matrix at the time of applying the generalized estimating equation since an improper selection sometimes results in inefficient parameter estimates. We propose a criterion for the selection of an appropriate working correlation structure. The proposed criterion is based on a statistic to test the hypothesis that the covariance matrix equals a given matrix, and also measures the discrepancy between the covariance matrix estimator and the specified working covariance matrix. We evaluated the performance of the proposed criterion through simulation studies assuming that for each subject, the number of observations remains the same. The results revealed that when the proposed criterion was adopted, the proportion of selecting a true correlation structure was generally higher than that when other competing approaches were adopted. The proposed criterion was applied to longitudinal wheeze data, and it was suggested that the resultant correlation structure was the most accurate.  相似文献   

7.
Most classification models have presented an imbalanced learning state when dealing with the imbalanced datasets. This article proposes a novel approach for learning from imbalanced datasets, which based on an improved SMOTE (synthetic Minority Over-sampling technique) algorithm. By organically combining the over-sampling and the under-sampling method, this approach aims to choose neighbors targetedly and synthesize samples with different strategy. Experiments show that most classifiers have achieved an ideal performance on the classification problem of the positive and negative class after dealing imbalanced datasets with our algorithm.  相似文献   

8.
We study the focused information criterion and frequentist model averaging and their application to post‐model‐selection inference for weighted composite quantile regression (WCQR) in the context of the additive partial linear models. With the non‐parametric functions approximated by polynomial splines, we show that, under certain conditions, the asymptotic distribution of the frequentist model averaging WCQR‐estimator of a focused parameter is a non‐linear mixture of normal distributions. This asymptotic distribution is used to construct confidence intervals that achieve the nominal coverage probability. With properly chosen weights, the focused information criterion based WCQR estimators are not only robust to outliers and non‐normal residuals but also can achieve efficiency close to the maximum likelihood estimator, without assuming the true error distribution. Simulation studies and a real data analysis are used to illustrate the effectiveness of the proposed procedure.  相似文献   

9.
Summary.  When a treatment has a positive average causal effect (ACE) on an intermediate variable or surrogate end point which in turn has a positive ACE on a true end point, the treatment may have a negative ACE on the true end point due to the presence of unobserved confounders, which is called the surrogate paradox. A criterion for surrogate end points based on ACEs has recently been proposed to avoid the surrogate paradox. For a continuous or ordinal discrete end point, the distributional causal effect (DCE) may be a more appropriate measure for a causal effect than the ACE. We discuss criteria for surrogate end points based on DCEs. We show that commonly used models, such as generalized linear models and Cox's proportional hazard models, can make the sign of the DCE of the treatment on the true end point determinable by the sign of the DCE of the treatment on the surrogate even if the models include unobserved confounders. Furthermore, for a general distribution without any assumption of parametric models, we give a sufficient condition for a distributionally consistent surrogate and prove that it is almost necessary.  相似文献   

10.
Abstract. The cross‐validation (CV) criterion is known to be asecond‐order unbiased estimator of the risk function measuring the discrepancy between the candidate model and the true model, as well as the generalized information criterion (GIC) and the extended information criterion (EIC). In the present article, we show that the 2kth‐order unbiased estimator can be obtained using a linear combination from the leave‐one‐out CV criterion to the leave‐k‐out CV criterion. The proposed scheme is unique in that a bias smaller than that of a jackknife method can be obtained without any analytic calculation, that is, it is not necessary to obtain the explicit form of several terms in an asymptotic expansion of the bias. Furthermore, the proposed criterion can be regarded as a finite correction of a bias‐corrected CV criterion by using scalar coefficients in a bias‐corrected EIC obtained by the bootstrap iteration.  相似文献   

11.
Overdispersion has been a common phenomenon in count data and usually treated with the negative binomial model. This paper shows that measurement errors in covariates in general also lead to overdispersion on the observed data if the true data generating process is indeed the Poisson regression. This kind of overdispersion cannot be treated using the negative binomial model, as otherwise, biases will occur. To provide consistent estimates, we propose a new type of corrected score estimator assuming that the distribution of the latent variables is known. The consistency and asymptotic normality of the proposed estimator are established. Simulation results show that this estimator has good finite sample performance. We also illustrate that the Akaike information criterion and Bayesian information criterion work well for selecting the correct model if the true model is the errors-in-variables Poisson regression.  相似文献   

12.
In practical settings such as microarray data analysis, multiple hypotheses with dependence within but not between equal-sized blocks often need to be tested. We consider an adaptive BH procedure to test the hypotheses. Under the condition of positive regression dependence on a subset of the true null hypotheses, the proposed adaptive procedure is shown to control the false discovery rate. The proposed approach is compared to the existing methods in simulation under block dependence and totally uniform pairwise dependence. It is observed that the proposed method performs better than the existing methods in several situations.  相似文献   

13.
非平衡数据集的改进SMOTE再抽样算法   总被引:1,自引:0,他引:1       下载免费PDF全文
薛薇 《统计研究》2012,29(6):95-98
非平衡数据集的不均衡学习特点通常表现为负类的分类效果不理想。改进SMOTE再抽样算法,将过抽样和欠抽样方式有机结合,有针对性地选择近邻并采用不同策略合成样本。实验表明,分类器在经此算法处理后的非平衡数据集的正负两类上,均可获得较理想的分类效果。  相似文献   

14.
Regularized variable selection is a powerful tool for identifying the true regression model from a large number of candidates by applying penalties to the objective functions. The penalty functions typically involve a tuning parameter that controls the complexity of the selected model. The ability of the regularized variable selection methods to identify the true model critically depends on the correct choice of the tuning parameter. In this study, we develop a consistent tuning parameter selection method for regularized Cox's proportional hazards model with a diverging number of parameters. The tuning parameter is selected by minimizing the generalized information criterion. We prove that, for any penalty that possesses the oracle property, the proposed tuning parameter selection method identifies the true model with probability approaching one as sample size increases. Its finite sample performance is evaluated by simulations. Its practical use is demonstrated in The Cancer Genome Atlas breast cancer data.  相似文献   

15.
ABSTRACT

In this paper, we study a novelly robust variable selection and parametric component identification simultaneously in varying coefficient models. The proposed estimator is based on spline approximation and two smoothly clipped absolute deviation (SCAD) penalties through rank regression, which is robust with respect to heavy-tailed errors or outliers in the response. Furthermore, when the tuning parameter is chosen by modified BIC criterion, we show that the proposed procedure is consistent both in variable selection and the separation of varying and constant coefficients. In addition, the estimators of varying coefficients possess the optimal convergence rate under some assumptions, and the estimators of constant coefficients have the same asymptotic distribution as their counterparts obtained when the true model is known. Simulation studies and a real data example are undertaken to assess the finite sample performance of the proposed variable selection procedure.  相似文献   

16.
Abstract

In this article, we propose a new regression method called general composite quantile regression (GCQR) which releases the unrealistic finite error variance assumption being imposed by the traditional least squares (LS) method. Unlike the recently proposed composite quantile regression (CQR) method, our proposed GCQR allows any continuous non-uniform density/weight function. As a result, determination of the number of uniform quantile positions is not required. Most importantly, the proposed GCQR criterion can be readily transformed to a linear programing problem, which substantially reduces the computing time. Our theoretical and empirical results show that the GCQR is generally efficient than the CQR and LS if the weight function is appropriately chosen. The oracle properties of the penalized GCQR are also provided. Our simulation results are consistent with the derived theoretical findings. A real data example is analyzed to demonstrate our methodologies.  相似文献   

17.
A placebo‐controlled randomized clinical trial is required to demonstrate that an experimental treatment is superior to its corresponding placebo on multiple coprimary endpoints. This is particularly true in the field of neurology. In fact, clinical trials for neurological disorders need to show the superiority of an experimental treatment over a placebo in two coprimary endpoints. Unfortunately, these trials often fail to detect a true treatment effect for the experimental treatment versus the placebo owing to an unexpectedly high placebo response rate. Sequential parallel comparison design (SPCD) can be used to address this problem. However, the SPCD has not yet been discussed in relation to clinical trials with coprimary endpoints. In this article, our aim was to develop a hypothesis‐testing method and a method for calculating the corresponding sample size for the SPCD with two coprimary endpoints. In a simulation, we show that the proposed hypothesis‐testing method achieves the nominal type I error rate and power and that the proposed sample size calculation method has adequate power accuracy. In addition, the usefulness of our methods is confirmed by returning to an SPCD trial with a single primary endpoint of Alzheimer disease‐related agitation.  相似文献   

18.
The borrowing of historical control data can be an efficient way to improve the treatment effect estimate of the current control group in a randomized clinical trial. When the historical and current control data are consistent, the borrowing of historical data can increase power and reduce Type I error rate. However, when these 2 sources of data are inconsistent, it may result in a combination of biased estimates, reduced power, and inflation of Type I error rate. In some situations, inconsistency between historical and current control data may be caused by a systematic variation in the measured baseline prognostic factors, which can be appropriately addressed through statistical modeling. In this paper, we propose a Bayesian hierarchical model that can incorporate patient‐level baseline covariates to enhance the appropriateness of the exchangeability assumption between current and historical control data. The performance of the proposed method is shown through simulation studies, and its application to a clinical trial design for amyotrophic lateral sclerosis is described. The proposed method is developed for scenarios involving multiple imbalanced prognostic factors and thus has meaningful implications for clinical trials evaluating new treatments for heterogeneous diseases such as amyotrophic lateral sclerosis.  相似文献   

19.
The autoregressive (AR) model is a popular method for fitting and prediction in analyzing time-dependent data, where selecting an accurate model among considered orders is a crucial issue. Two commonly used selection criteria are the Akaike information criterion and the Bayesian information criterion. However, the two criteria are known to suffer potential problems regarding overfit and underfit, respectively. Therefore, using them would perform well in some situations, but poorly in others. In this paper, we propose a new criterion in terms of the prediction perspective based on the concept of generalized degrees of freedom for AR model selection. We derive an approximately unbiased estimator of mean-squared prediction errors based on a data perturbation technique for selecting the order parameter, where the estimation uncertainty involved in a modeling procedure is considered. Some numerical experiments are performed to illustrate the superiority of the proposed method over some commonly used order selection criteria. Finally, the methodology is applied to a real data example to predict the weekly rate of return on the stock price of Taiwan Semiconductor Manufacturing Company and the results indicate that the proposed method is satisfactory.  相似文献   

20.
Autoregressive model is a popular method for analysing the time dependent data, where selection of order parameter is imperative. Two commonly used selection criteria are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), which are known to suffer the potential problems regarding overfit and underfit, respectively. To our knowledge, there does not exist a criterion in the literature that can satisfactorily perform under various situations. Therefore, in this paper, we focus on forecasting the future values of an observed time series and propose an adaptive idea to combine the advantages of AIC and BIC but to mitigate their weaknesses based on the concept of generalized degrees of freedom. Instead of applying a fixed criterion to select the order parameter, we propose an approximately unbiased estimator of mean squared prediction errors based on a data perturbation technique for fairly comparing between AIC and BIC. Then use the selected criterion to determine the final order parameter. Some numerical experiments are performed to show the superiority of the proposed method and a real data set of the retail price index of China from 1952 to 2008 is also applied for illustration.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号