首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Model selection methods are important to identify the best approximating model. To identify the best meaningful model, purpose of the model should be clearly pre-stated. The focus of this paper is model selection when the modelling purpose is classification. We propose a new model selection approach designed for logistic regression model selection where main modelling purpose is classification. The method is based on the distance between the two clustering trees. We also question and evaluate the performances of conventional model selection methods based on information theory concepts in determining best logistic regression classifier. An extensive simulation study is used to assess the finite sample performances of the cluster tree based and the information theoretic model selection methods. Simulations are adjusted for whether the true model is in the candidate set or not. Results show that the new approach is highly promising. Finally, they are applied to a real data set to select a binary model as a means of classifying the subjects with respect to their risk of breast cancer.  相似文献   

2.
In many clinical applications, understanding when measurement of new markers is necessary to provide added accuracy to existing prediction tools could lead to more cost effective disease management. Many statistical tools for evaluating the incremental value (IncV) of the novel markers over the routine clinical risk factors have been developed in recent years. However, most existing literature focuses primarily on global assessment. Since the IncVs of new markers often vary across subgroups, it would be of great interest to identify subgroups for which the new markers are most/least useful in improving risk prediction. In this paper we provide novel statistical procedures for systematically identifying potential traditional-marker based subgroups in whom it might be beneficial to apply a new model with measurements of both the novel and traditional markers. We consider various conditional time-dependent accuracy parameters for censored failure time outcome to assess the subgroup-specific IncVs. We provide non-parametric kernel-based estimation procedures to calculate the proposed parameters. Simultaneous interval estimation procedures are provided to account for sampling variation and adjust for multiple testing. Simulation studies suggest that our proposed procedures work well in finite samples. The proposed procedures are applied to the Framingham Offspring Study to examine the added value of an inflammation marker, C-reactive protein, on top of the traditional Framingham risk score for predicting 10-year risk of cardiovascular disease.  相似文献   

3.
A marker's capacity to predict risk of a disease depends on disease prevalence in the target population and its classification accuracy, i.e. its ability to discriminate diseased subjects from non-diseased subjects. The latter is often considered an intrinsic property of the marker; it is independent of disease prevalence and hence more likely to be similar across populations than risk prediction measures. In this paper, we are interested in evaluating the population-specific performance of a risk prediction marker in terms of positive predictive value (PPV) and negative predictive value (NPV) at given thresholds, when samples are available from the target population as well as from another population. A default strategy is to estimate PPV and NPV using samples from the target population only. However, when the marker's classification accuracy as characterized by a specific point on the receiver operating characteristics (ROC) curve is similar across populations, borrowing information across populations allows increased efficiency in estimating PPV and NPV. We develop estimators that optimally combine information across populations. We apply this methodology to a cross-sectional study where we evaluate PCA3 as a risk prediction marker for prostate cancer among subjects with or without previous negative biopsy.  相似文献   

4.
Absolute risk is the probability that a cause-specific event occurs in a given time interval in the presence of competing events. We present methods to estimate population-based absolute risk from a complex survey cohort that can accommodate multiple exposure-specific competing risks. The hazard function for each event type consists of an individualized relative risk multiplied by a baseline hazard function, which is modeled nonparametrically or parametrically with a piecewise exponential model. An influence method is used to derive a Taylor-linearized variance estimate for the absolute risk estimates. We introduce novel measures of the cause-specific influences that can guide modeling choices for the competing event components of the model. To illustrate our methodology, we build and validate cause-specific absolute risk models for cardiovascular and cancer deaths using data from the National Health and Nutrition Examination Survey. Our applications demonstrate the usefulness of survey-based risk prediction models for predicting health outcomes and quantifying the potential impact of disease prevention programs at the population level.  相似文献   

5.
Markov regression models are useful tools for estimating the impact of risk factors on rates of transition between multiple disease states. Alzheimer's disease (AD) is an example of a multi-state disease process in which great interest lies in identifying risk factors for transition. In this context, non-homogeneous models are required because transition rates change as subjects age. In this report we propose a non-homogeneous Markov regression model that allows for reversible and recurrent disease states, transitions among multiple states between observations, and unequally spaced observation times. We conducted simulation studies to demonstrate performance of estimators for covariate effects from this model and compare performance with alternative models when the underlying non-homogeneous process was correctly specified and under model misspecification. In simulation studies, we found that covariate effects were biased if non-homogeneity of the disease process was not accounted for. However, estimates from non-homogeneous models were robust to misspecification of the form of the non-homogeneity. We used our model to estimate risk factors for transition to mild cognitive impairment (MCI) and AD in a longitudinal study of subjects included in the National Alzheimer's Coordinating Center's Uniform Data Set. Using our model, we found that subjects with MCI affecting multiple cognitive domains were significantly less likely to revert to normal cognition.  相似文献   

6.
Observation of adverse drug reactions during drug development can cause closure of the whole programme. However, if association between the genotype and the risk of an adverse event is discovered, then it might suffice to exclude patients of certain genotypes from future recruitment. Various sequential and non‐sequential procedures are available to identify an association between the whole genome, or at least a portion of it, and the incidence of adverse events. In this paper we start with a suspected association between the genotype and the risk of an adverse event and suppose that the genetic subgroups with elevated risk can be identified. Our focus is determination of whether the patients identified as being at risk should be excluded from further studies of the drug. We propose using a utility function to determine the appropriate action, taking into account the relative costs of suffering an adverse reaction and of failing to alleviate the patient's disease. Two illustrative examples are presented, one comparing patients who suffer from an adverse event with contemporary patients who do not, and the other making use of a reference control group. We also illustrate two classification methods, LASSO and CART, for identifying patients at risk, but we stress that any appropriate classification method could be used in conjunction with the proposed utility function. Our emphasis is on determining the action to take rather than on providing definitive evidence of an association. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

7.
We have developed a new approach to determine the threshold of a biomarker that maximizes the classification accuracy of a disease. We consider a Bayesian estimation procedure for this purpose and illustrate the method using a real data set. In particular, we determine the threshold for Apolipoprotein B (ApoB), Apolipoprotein A1 (ApoA1) and the ratio for the classification of myocardial infarction (MI). We first conduct a literature review and construct prior distributions. We then develop classification rules based on the posterior distribution of the location and scale parameters for these biomarkers. We identify the threshold for ApoB and ApoA1, and the ratio as 0.908 (gram/liter), 1.138 (gram/liter) and 0.808, respectively. We also observe that the threshold for disease classification varies substantially across different age and ethnic groups. Next, we identify the most informative predictor for MI among the three biomarkers. Based on this analysis, ApoA1 appeared to be a stronger predictor than ApoB for MI classification. Given that we have used this data set for illustration only, the results will require further investigation for use in clinical applications. However, the approach developed in this article can be used to determine the threshold of any continuous biomarker for a binary disease classification.  相似文献   

8.
Differential analysis techniques are commonly used to offer scientists a dimension reduction procedure and an interpretable gateway to variable selection, especially when confronting high-dimensional genomic data. Huang et al. used a gene expression profile of breast cancer cell lines to identify genomic markers which are highly correlated with in vitro sensitivity of a drug Dasatinib. They considered three statistical methods to identify differentially expressed genes and finally used the results from the intersection. But the statistical methods that are used in the paper are not sufficient to select the genomic markers. In this paper we used three alternative statistical methods to select a combined list of genomic markers and compared the genes that were proposed by Huang et al. We then proposed to use sparse principal component analysis (Sparse PCA) to identify a final list of genomic markers. The Sparse PCA incorporates correlation into account among the genes and helps to draw a successful genomic markers discovery. We present a new and a small set of genomic markers to separate out the groups of patients effectively who are sensitive to the drug Dasatinib. The analysis procedure will also encourage scientists in identifying genomic markers that can help to separate out two groups.  相似文献   

9.
The area under the ROC curve (AUC) can be interpreted as the probability that the classification scores of a diseased subject is larger than that of a non-diseased subject for a randomly sampled pair of subjects. From the perspective of classification, we want to find a way to separate two groups as distinctly as possible via AUC. When the difference of the scores of a marker is small, its impact on classification is less important. Thus, a new diagnostic/classification measure based on a modified area under the ROC curve (mAUC) is proposed, which is defined as a weighted sum of two AUCs, where the AUC with the smaller difference is assigned a lower weight, and vice versa. Using mAUC is robust in the sense that mAUC gets larger as AUC gets larger as long as they are not equal. Moreover, in many diagnostic situations, only a specific range of specificity is of interest. Under normal distributions, we show that if the AUCs of two markers are within similar ranges, the larger mAUC implies the larger partial AUC for a given specificity. This property of mAUC will help to identify the marker with the higher partial AUC, even when the AUCs are similar. Two nonparametric estimates of an mAUC and their variances are given. We also suggest the use of mAUC as the objective function for classification, and the use of the gradient Lasso algorithm for classifier construction and marker selection. Application to simulation datasets and real microarray gene expression datasets show that our method finds a linear classifier with a higher ROC curve than some other existing linear classifiers, especially in the range of low false positive rates.  相似文献   

10.
Summary.  When evaluating potential interventions for cancer prevention, it is necessary to compare benefits and harms. With new study designs, new statistical approaches may be needed to facilitate this comparison. A case in point arose in a proposed genetic substudy of a randomized trial of tamoxifen versus placebo in asymptomatic women who were at high risk for breast cancer. Although the randomized trial showed that tamoxifen substantially reduced the risk of breast cancer, the harms from tamoxifen were serious and some were life threaten-ing. In hopes of finding a subset of women with inherited risk genes who derive greater bene-fits from tamoxifen, we proposed a nested case–control study to test some trial subjects for various genes and new statistical methods to extrapolate benefits and harms to the general population. An important design question is whether or not the study should target common low penetrance genes. Our calculations show that useful results are only likely with rare high penetrance genes.  相似文献   

11.
Case-control family data are now widely used to examine the role of gene-environment interactions in the etiology of complex diseases. In these types of studies, exposure levels are obtained retrospectively and, frequently, information on most risk factors of interest is available on the probands but not on their relatives. In this work we consider correlated failure time data arising from population-based case-control family studies with missing genotypes of relatives. We present a new method for estimating the age-dependent marginalized hazard function. The proposed technique has two major advantages: (1) it is based on the pseudo full likelihood function rather than a pseudo composite likelihood function, which usually suffers from substantial efficiency loss; (2) the cumulative baseline hazard function is estimated using a two-stage estimator instead of an iterative process. We assess the performance of the proposed methodology with simulation studies, and illustrate its utility on a real data example.  相似文献   

12.
Herpes Simplex Virus Type 2 (HSV-2) facilitates the sexual acquisition and transmission of HIV-1 infection and is highly prevalent in most regions experiencing severe HIV epidemics. In sub-Saharan Africa, where HIV infection is a public health burden, the prevalence of HSV-2 is substantially high. The high prevalence of HSV-2 and the association between HSV-2 infection and HIV-1 acquisition could play a significant role in the spread of HIV-1 in the region. The objective of our study was to identify risk factors for HSV-2 and HIV-1 infections among men in sub-Saharan Africa. We used a joint response model that accommodates the interdependence between the two infections in assessing their risk factors. Simulation studies show superiority of the joint response model compared to the traditional models which ignore the dependence between the two infections. We found higher odds of having HSV-2/HIV-1 among older men, in men who had multiple sexual partners, abused alcohol, or reported symptoms of sexually transmitted infections. These findings suggest that interventions that identify and control the risk factors of the two infections should be part of HIV-1 prevention programs in sub-Saharan Africa where antiretroviral therapy is not readily available.  相似文献   

13.
In this paper we present a perspective on the overall process of developing classifiers for real-world classification problems. Specifically, we identify, categorize and discuss the various problem-specific factors that influence the development process. Illustrative examples are provided to demonstrate the iterative nature of the process of applying classification algorithms in practice. In addition, we present a case study of a large scale classification application using the process framework described, providing an end-to-end example of the iterative nature of the application process. The paper concludes that the process of developing classification applications for operational use involves many factors not normally considered in the typical discussion of classification models and algorithms.  相似文献   

14.
In this paper, we identified risk factors for chronic obstructive pulmonary disease (COPD) and proposed a nomogram for COPD. Data were from the 6th Korean National Health and Nutrition Examination Survey (2013–2015). First, a chi-square test was performed to identify risk factors about incidence of COPD. A nomogram was then constructed using the naïve Bayesian classifier model in order to visualize risk factors of COPD. The nomogram shows that asthma had the strongest effect on COPD incidence. We additionally compared Bayesian nomogram with logistic regression model nomogram. Finally, a ROC curve and calibration plot were used to assess the nomogram.  相似文献   

15.
Coronary artery calcium is a marker of coronary artery disease and measures the progression of atherosclerosis. It is measured by electron beam computed tomography, and the measured amount of coronary artery calcium is highly skewed to the right and left censored. The distribution of coronary artery calcium appears to be Weibull. We propose a Weibull regression model and we analyze the data using these techniques. Our analysis is based on data from the Spokane Heart Study, which is a cohort of about a thousand subjects that are assessed every two years for coronary artery calcium and risk factors of coronary artery disease. The major focus of the heart study is to determine the natural history of atherosclerosis in its early phase, and we analyze the data as a cross-sectional study with 859 subjects. We would also like to highlight the use of Weibull regression techniques in situations like this, where we have extreme right skewed data. Our main emphasis will be on examining the effect of the traditional risk factors of age, gender, lipid profile (cholesterol and HDL), patient history of lipid abnormality, hypertension, and smoking, and other family history risks on coronary artery calcium. We found that the most important factors influencing the disease were age, sex, and patient history of smoking and lipid abnormality.  相似文献   

16.
In recent years, many vaccines have been developed for the prevention of a variety of diseases. Although the primary objective of vaccination is to prevent disease, vaccination can also reduce the severity of disease in those individuals who develop breakthrough disease. Observations of apparent mitigation of breakthrough disease in vaccine recipients have been reported for a number of vaccine‐preventable diseases such as Herpes Zoster, Influenza, Rotavirus, and Pertussis. The burden‐of‐illness (BOI) score was developed to incorporate the incidence of disease as well as the severity and duration of disease. A severity‐of‐illness score S > 0 is assigned to individuals who develop disease and a score of 0 is assigned to uninfected individuals. In this article, we derive the vaccine efficacy statistic (which is the standard statistic for presenting efficacy outcomes in vaccine clinical trials) based on BOI scores, and we extend the method to adjust for baseline covariates. Also, we illustrate it with data from a clinical trial in which the efficacy of a Herpes Zoster vaccine was evaluated.  相似文献   

17.
Familial aggregation studies seek to identify diseases that cluster in families. These studies are often carried out as a first step in the search for hereditary factors affecting the risk of disease. It is necessary to account for age at disease onset to avoid potential misclassification of family members who are disease-free at the time of study participation or who die before developing disease. This is especially true for late-onset diseases, such as prostate cancer or Alzheimer's disease. We propose a discrete time model that accounts for the age at disease onset and allows the familial association to vary with age and to be modified by covariates, such as pedigree relationship. The parameters of the model have interpretations as conditional log-odds and log-odds ratios, which can be viewed as discrete time conditional cross hazard ratios. These interpretations are appealing for cancer risk assessment. Properties of this model are explored in simulation studies, and the method is applied to a large family study of cancer conducted by the National Cancer Institute-sponsored Cancer Genetics Network (CGN).  相似文献   

18.
Summary.  Statistical agencies that own different databases on overlapping subjects can benefit greatly from combining their data. These benefits are passed on to secondary data analysts when the combined data are disseminated to the public. Sometimes combining data across agencies or sharing these data with the public is not possible: one or both of these actions may break promises of confidentiality that have been given to data subjects. We describe an approach that is based on two stages of multiple imputation that facilitates data sharing and dissemination under restrictions of confidentiality. We present new inferential methods that properly account for the uncertainty that is caused by the two stages of imputation. We illustrate the approach by using artificial and genuine data.  相似文献   

19.
To evaluate the clinical utility of new risk markers, a crucial step is to measure their predictive accuracy with prospective studies. However, it is often infeasible to obtain marker values for all study participants. The nested case-control (NCC) design is a useful cost-effective strategy for such settings. Under the NCC design, markers are only ascertained for cases and a fraction of controls sampled randomly from the risk sets. The outcome dependent sampling generates a complex data structure and therefore a challenge for analysis. Existing methods for analyzing NCC studies focus primarily on association measures. Here, we propose a class of non-parametric estimators for commonly used accuracy measures. We derived asymptotic expansions for accuracy estimators based on both finite population and Bernoulli sampling and established asymptotic equivalence between the two. Simulation results suggest that the proposed procedures perform well in finite samples. The new procedures were illustrated with data from the Framingham Offspring study.  相似文献   

20.
Assessing the absolute risk for a future disease event in presently healthy individuals has an important role in the primary prevention of cardiovascular diseases (CVD) and other chronic conditions. In this paper, we study the use of non‐parametric Bayesian hazard regression techniques and posterior predictive inferences in the risk assessment task. We generalize our previously published Bayesian multivariate monotonic regression procedure to a survival analysis setting, combined with a computationally efficient estimation procedure utilizing case–base sampling. To achieve parsimony in the model fit, we allow for multidimensional relationships within specified subsets of risk factors, determined either on a priori basis or as a part of the estimation procedure. We apply the proposed methods for 10‐year CVD risk assessment in a Finnish population. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号