首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Threshold selection for regional peaks-over-threshold data   总被引:1,自引:0,他引:1  
A hurdle in the peaks-over-threshold approach for analyzing extreme values is the selection of the threshold. A method is developed to reduce this obstacle in the presence of multiple, similar data samples. This is for instance the case in many environmental applications. The idea is to combine threshold selection methods into a regional method. Regionalized versions of the threshold stability and the mean excess plot are presented as graphical tools for threshold selection. Moreover, quantitative approaches based on the bootstrap distribution of the spatially averaged Kolmogorov–Smirnov and Anderson–Darling test statistics are introduced. It is demonstrated that the proposed regional method leads to an increased sensitivity for too low thresholds, compared to methods that do not take into account the regional information. The approach can be used for a wide range of univariate threshold selection methods. We test the methods using simulated data and present an application to rainfall data from the Dutch water board Vallei en Veluwe.  相似文献   

2.
Abstract

Measuring the accuracy of diagnostic tests is crucial in many application areas including medicine, machine learning and credit scoring. The receiver operating characteristic (ROC) curve and surface are useful tools to assess the ability of diagnostic tests to discriminate between ordered classes or groups. To define these diagnostic tests, selecting the optimal thresholds that maximize the accuracy of these tests is required. One procedure that is commonly used to find the optimal thresholds is by maximizing what is known as Youden’s index. This article presents nonparametric predictive inference (NPI) for selecting the optimal thresholds of a diagnostic test. NPI is a frequentist statistical method that is explicitly aimed at using few modeling assumptions, enabled through the use of lower and upper probabilities to quantify uncertainty. Based on multiple future observations, the NPI approach is presented for selecting the optimal thresholds for two-group and three-group scenarios. In addition, a pairwise approach has also been presented for the three-group scenario. The article ends with an example to illustrate the proposed methods and a simulation study of the predictive performance of the proposed methods along with some classical methods such as Youden index. The NPI-based methods show some interesting results that overcome some of the issues concerning the predictive performance of Youden’s index.  相似文献   

3.
A threshold autoregressive (TAR) model is an important class of nonlinear time series models that possess many desirable features such as asymmetric limit cycles and amplitude-dependent frequencies. Statistical inference for the TAR model encounters a major difficulty in the estimation of thresholds, however. This article develops an efficient procedure to estimate the thresholds. The procedure first transforms multiple-threshold detection to a regression variable selection problem, and then employs a group orthogonal greedy algorithm to obtain the threshold estimates. Desirable theoretical results are derived to lend support to the proposed methodology. Simulation experiments are conducted to illustrate the empirical performances of the method. Applications to U.S. GNP data are investigated.  相似文献   

4.
In longitudinal studies of biomarkers, an outcome of interest is the time at which a biomarker reaches a particular threshold. The CD4 count is a widely used marker of human immunodeficiency virus progression. Because of the inherent variability of this marker, a single CD4 count below a relevant threshold should be interpreted with caution. Several studies have applied persistence criteria, designating the outcome as the time to the occurrence of two consecutive measurements less than the threshold. In this paper, we propose a method to estimate the time to attainment of two consecutive CD4 counts less than a meaningful threshold, which takes into account the patient‐specific trajectory and measurement error. An expression for the expected time to threshold is presented, which is a function of the fixed effects, random effects and residual variance. We present an application to human immunodeficiency virus‐positive individuals from a seroprevalent cohort in Durban, South Africa. Two thresholds are examined, and 95% bootstrap confidence intervals are presented for the estimated time to threshold. Sensitivity analysis revealed that results are robust to truncation of the series and variation in the number of visits considered for most patients. Caution should be exercised when interpreting the estimated times for patients who exhibit very slow rates of decline and patients who have less than three measurements. We also discuss the relevance of the methodology to the study of other diseases and present such applications. We demonstrate that the method proposed is computationally efficient and offers more flexibility than existing frameworks. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

5.
With advancement of technologies such as genomic sequencing, predictive biomarkers have become a useful tool for the development of personalized medicine. Predictive biomarkers can be used to select subsets of patients, which are most likely to benefit from a treatment. A number of approaches for subgroup identification were proposed over the last years. Although overviews of subgroup identification methods are available, systematic comparisons of their performance in simulation studies are rare. Interaction trees (IT), model‐based recursive partitioning, subgroup identification based on differential effect, simultaneous threshold interaction modeling algorithm (STIMA), and adaptive refinement by directed peeling were proposed for subgroup identification. We compared these methods in a simulation study using a structured approach. In order to identify a target population for subsequent trials, a selection of the identified subgroups is needed. Therefore, we propose a subgroup criterion leading to a target subgroup consisting of the identified subgroups with an estimated treatment difference no less than a pre‐specified threshold. In our simulation study, we evaluated these methods by considering measures for binary classification, like sensitivity and specificity. In settings with large effects or huge sample sizes, most methods perform well. For more realistic settings in drug development involving data from a single trial only, however, none of the methods seems suitable for selecting a target population. Using the subgroup criterion as alternative to the proposed pruning procedures, STIMA and IT can improve their performance in some settings. The methods and the subgroup criterion are illustrated by an application in amyotrophic lateral sclerosis.  相似文献   

6.
Statistical learning is emerging as a promising field where a number of algorithms from machine learning are interpreted as statistical methods and vice-versa. Due to good practical performance, boosting is one of the most studied machine learning techniques. We propose algorithms for multivariate density estimation and classification. They are generated by using the traditional kernel techniques as weak learners in boosting algorithms. Our algorithms take the form of multistep estimators, whose first step is a standard kernel method. Some strategies for bandwidth selection are also discussed with regard both to the standard kernel density classification problem, and to our 'boosted' kernel methods. Extensive experiments, using real and simulated data, show an encouraging practical relevance of the findings. Standard kernel methods are often outperformed by the first boosting iterations and in correspondence of several bandwidth values. In addition, the practical effectiveness of our classification algorithm is confirmed by a comparative study on two real datasets, the competitors being trees including AdaBoosting with trees.  相似文献   

7.
Abstract

Research involving administrative healthcare data to study patient outcomes requires the investigator to account for the patient’s disease burden in order to reduce the potential for biased results. Here we develop a comorbidity summary score based on variable importance measures derived from several statistical and machine learning methods and show it has superior predictive performance to the Elixhauser and Charlson indices when used to predict 1-year, 5-year, and 10-year mortality. We used two large Veterans Administration cohorts to develop and validate the summary score and compared predictive performance using the area under ROC curve (AUC) and the Brier score.  相似文献   

8.
Abstract.  Prediction error is critical to assess model fit and evaluate model prediction. We propose the cross-validation (CV) and approximated CV methods for estimating prediction error under the Bregman divergence (BD), which embeds nearly all of the commonly used loss functions in the regression, classification procedures and machine learning literature. The approximated CV formulas are analytically derived, which facilitate fast estimation of prediction error under BD. We then study a data-driven optimal bandwidth selector for local-likelihood estimation that minimizes the overall prediction error or equivalently the covariance penalty. It is shown that the covariance penalty and CV methods converge to the same mean-prediction-error-criterion. We also propose a lower-bound scheme for computing the local logistic regression estimates and demonstrate that the algorithm monotonically enhances the target local likelihood and converges. The idea and methods are extended to the generalized varying-coefficient models and additive models.  相似文献   

9.
Software packages usually report the results of statistical tests using p-values. Users often interpret these values by comparing them with standard thresholds, for example, 0.1, 1, and 5%, which is sometimes reinforced by a star rating (***, **, and *, respectively). We consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, for example, by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals that cover [0, 1] and that can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-value thresholds occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.  相似文献   

10.
We study the design of multi-armed parallel group clinical trials to estimate personalized treatment rules that identify the best treatment for a given patient with given covariates. Assuming that the outcomes in each treatment arm are given by a homoscedastic linear model, with possibly different variances between treatment arms, and that the trial subjects form a random sample from an unselected overall population, we optimize the (possibly randomized) treatment allocation allowing the allocation rates to depend on the covariates. We find that, for the case of two treatments, the approximately optimal allocation rule does not depend on the value of the covariates but only on the variances of the responses. In contrast, for the case of three treatments or more, the optimal treatment allocation does depend on the values of the covariates as well as the true regression coefficients. The methods are illustrated with a recently published dietary clinical trial.  相似文献   

11.
This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble. Our method is shown to improve test set accuracy over random forests by up to 70% on example problems from the UCI repository.  相似文献   

12.
Neural networks are a popular machine learning tool, particularly in applications such as protein structure prediction; however, overfitting can pose an obstacle to their effective use. Due to the large number of parameters in a typical neural network, one may obtain a network fit that perfectly predicts the learning data, yet fails to generalize to other data sets. One way of reducing the size of the parmeter space is to alter the network topology so that some edges are removed; however it is often not immediately apparent which edges should be eliminated. We propose a data-adaptive method of selecting an optimal network architecture using a deletion/substitution/addition algorithm. Results of this approach to classification are presented on simulated data and the breast cancer data of Wolberg and Mangasarian [1990. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Nat. Acad. Sci. 87, 9193–9196].  相似文献   

13.
Feed-forward neural networks—also known as multi-layer perceptrons—are now widely used for regression and classification. In parallel but slightly earlier, a family of methods for flexible regression and discrimination were developed in multivariate statistics, and tree-induction methods have been developed in both machine learning and statistics. We expound and compare these approaches in the context of a number of examples.  相似文献   

14.
Measuring the accuracy of diagnostic tests is crucial in many application areas including medicine, machine learning, and credit scoring. The receiver operating characteristic (ROC) surface is a useful tool to assess the ability of a diagnostic test to discriminate among three-ordered classes or groups. In this article, nonparametric predictive inference (NPI) for three-group ROC analysis for ordinal outcomes is presented. NPI is a frequentist statistical method that is explicitly aimed at using few modeling assumptions, enabled through the use of lower and upper probabilities to quantify uncertainty. This article also includes results on the volumes under the ROC surfaces and consideration of the choice of decision thresholds for the diagnosis. Two examples are provided to illustrate our method.  相似文献   

15.
Feed-forward neural networks—also known as multi-layer perceptrons—are now widely used for regression and classification. In parallel but slightly earlier, a family of methods for flexible regression and discrimination were developed in multivariate statistics, and tree-induction methods have been developed in both machine learning and statistics. We expound and compare these approaches in the context of a number of examples.  相似文献   

16.
《随机性模型》2013,29(2-3):695-724
Abstract

We consider two variants of a two-station tandem network with blocking. In both variants the first server ceases to work when the queue length at the second station hits a ‘blocking threshold.’ In addition, in variant 2 the first server decreases its service rate when the second queue exceeds a ‘slow-down threshold, ’ which is smaller than the blocking level. In both variants the arrival process is Poisson and the service times at both stations are exponentially distributed. Note, however, that in case of slow-downs, server 1 works at a high rate, a slow rate, or not at all, depending on whether the second queue is below or above the slow-down threshold or at the blocking threshold, respectively. For variant 1, i.e., only blocking, we concentrate on the geometric decay rate of the number of jobs in the first buffer and prove that for increasing blocking thresholds the sequence of decay rates decreases monotonically and at least geometrically fast to max1, ρ2}, where ρ i is the load at server i. The methods used in the proof also allow us to clarify the asymptotic queue length distribution at the second station. Then we generalize the analysis to variant 2, i.e., slow-down and blocking, and establish analogous results.  相似文献   

17.
Parameter estimation of the generalized Pareto distribution—Part II   总被引:1,自引:0,他引:1  
This is the second part of a paper which focuses on reviewing methods for estimating the parameters of the generalized Pareto distribution (GPD). The GPD is a very important distribution in the extreme value context. It is commonly used for modeling the observations that exceed very high thresholds. The ultimate success of the GPD in applications evidently depends on the parameter estimation process. Quite a few methods exist in the literature for estimating the GPD parameters. Estimation procedures, such as the maximum likelihood (ML), the method of moments (MOM) and the probability weighted moments (PWM) method were described in Part I of the paper. We shall continue to review methods for estimating the GPD parameters, in particular methods that are robust and procedures that use the Bayesian methodology. As in Part I, we shall focus on those that are relatively simple and straightforward to be applied to real world data.  相似文献   

18.
This paper presents a new Metropolis-adjusted Langevin algorithm (MALA) that uses convex analysis to simulate efficiently from high-dimensional densities that are log-concave, a class of probability distributions that is widely used in modern high-dimensional statistics and data analysis. The method is based on a new first-order approximation for Langevin diffusions that exploits log-concavity to construct Markov chains with favourable convergence properties. This approximation is closely related to Moreau–Yoshida regularisations for convex functions and uses proximity mappings instead of gradient mappings to approximate the continuous-time process. The proposed method complements existing MALA methods in two ways. First, the method is shown to have very robust stability properties and to converge geometrically for many target densities for which other MALA are not geometric, or only if the step size is sufficiently small. Second, the method can be applied to high-dimensional target densities that are not continuously differentiable, a class of distributions that is increasingly used in image processing and machine learning and that is beyond the scope of existing MALA and HMC algorithms. To use this method it is necessary to compute or to approximate efficiently the proximity mappings of the logarithm of the target density. For several popular models, including many Bayesian models used in modern signal and image processing and machine learning, this can be achieved with convex optimisation algorithms and with approximations based on proximal splitting techniques, which can be implemented in parallel. The proposed method is demonstrated on two challenging high-dimensional and non-differentiable models related to image resolution enhancement and low-rank matrix estimation that are not well addressed by existing MCMC methodology.  相似文献   

19.
Ordinal classification is an important area in statistical machine learning, where labels exhibit a natural order. One of the major goals in ordinal classification is to correctly predict the relative order of instances. We develop a novel concordance-based approach to ordinal classification, where a concordance function is introduced and a penalized smoothed method for optimization is designed. Variable selection using the L 1 $$ {L}_1 $$ penalty is incorporated for sparsity considerations. Within the set of classification rules that maximize the concordance function, we find optimal thresholds to predict labels by minimizing a loss function. After building the classifier, we derive nonparametric estimation of class conditional probabilities. The asymptotic properties of the estimators as well as the variable selection consistency are established. Extensive simulations and real data applications show the robustness and advantage of the proposed method in terms of classification accuracy, compared with other existing methods.  相似文献   

20.
Various methods for clustering mixed-mode data are compared. It is found that a method based on a finite mixture model in which the observed categorical variables are generated from underlying continuous variables out-performs more conventional methods when applied to artificially generated data. This method also performs best when applied to Fisher's iris data in which two of the variables are categorized by applying thresholds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号