首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Abstract Calculation of a confidence interval for intraclass correlation to assess inter‐rater reliability is problematic when the number of raters is small and the rater effect is not negligible. Intervals produced by existing methods are uninformative: the lower bound is often close to zero, even in cases where the reliability is good and the sample size is large. In this paper, we show that this problem is unavoidable without extra assumptions and we propose two new approaches. The first approach assumes that the raters are sufficiently trained and is related to a sensitivity analysis. The second approach is based on a model with fixed rater effect. Using either approach, we obtain conservative and informative confidence intervals even from samples with only two raters. We illustrate our point with data on the development of neuromotor functions in children and adolescents.  相似文献   

Correlated binary data arise in many ophthalmological and otolaryngological clinical trials. To test the homogeneity of prevalences among different groups is an important issue when conducting these trials. The equal correlation coefficients model proposed by Donner in 1989 is a popular model handling correlated binary data. The asymptotic chi-square test works well when the sample size is large. However, it would fail to maintain the type I error rate when the sample size is relatively small. In this paper, we propose several exact methods to deal with small sample scenarios. Their performances are compared with respect to type I error rate and power. The ‘M approach’ and the ‘E + M approach’ seem to outperform the others. A real work example is given to further explain how these approaches work. Finally, the computational efficiency of the exact methods is discussed as a pressing issue of future work.  相似文献   

Agreement among raters is an important issue in medicine, as well as in education and psychology. The agreement among two raters on a nominal or ordinal rating scale has been investigated in many articles. The multi-rater case with normally distributed ratings has also been explored at length. However, there is a lack of research on multiple raters using an ordinal rating scale. In this simulation study, several methods were compared with analyze rater agreement. The special case that was focused on was the multi-rater case using a bounded ordinal rating scale. The proposed methods for agreement were compared within different settings. Three main ordinal data simulation settings were used (normal, skewed and shifted data). In addition, the proposed methods were applied to a real data set from dermatology. The simulation results showed that the Kendall's W and mean gamma highly overestimated the agreement in data sets with shifts in data. ICC4 for bounded data should be avoided in agreement studies with rating scales<5, where this method highly overestimated the simulated agreement. The difference in bias for all methods under study, except the mean gamma and Kendall's W, decreased as the rating scale increased. The bias of ICC3 was consistent and small for nearly all simulation settings except the low agreement setting in the shifted data set. Researchers should be careful in selecting agreement methods, especially if shifts in ratings between raters exist and may apply more than one method before any conclusions are made.  相似文献   

An analysis of inter-rater agreement is presented. We study the problem with several raters using a Bayesian model based on the Dirichlet distribution. Inter-rater agreement, including global and partial agreement, is studied by determining the joint posterior distribution of the raters. Posterior distributions are computed with a direct resampling technique. Our method is illustrated with an example involving four residents, who are diagnosing 12 psychiatric patients suspected of having a thought disorder. Initially employing analytical and resampling methods, total agreement between the four is examined with a Bayesian testing technique. Later, partial agreement is examined by determining the posterior probability of certain orderings among the rater means. The power of resampling is revealed by its ability to compute complex multiple integrals that represent various posterior probabilities of agreement and disagreement between several raters.  相似文献   

In many clinical trials, the assessment of the response to interventions can include a large variety of outcome variables which are generally correlated. The use of multiple significance tests is likely to increase the chance of detecting a difference in at least one of the outcomes between two treatments. Furthermore, univariate tests do not take into account the correlation structure. A new test is proposed that uses information from the interim analysis in a two-stage design to form the rejection region boundaries at the second stage. Initially, the test uses Hotelling’s T2 at the end of the first stage allowing only, for early acceptance of the null hypothesis and an O’Brien ‘type’ procedure at the end of the second stage. This test allows one to ‘cheat’ and look at the data at the interim analysis to form rejection regions at the second stage, provided one uses the correct distribution of the final test statistic. This distribution is derived and the power of the new test is compared to the power of three common procedures for testing multiple outcomes: Bonferroni’s inequality, Hotelling’s T2and O’Brien’s test. O’Brien’s test has the best power to detect a difference when the outcomes are thought to be affected in exactly the same direction and the same magnitude or in exactly the same relative effects as those proposed prior to data collection. However, the statistic is not robust to deviations in the alternative parameters proposed a priori, especially for correlated outcomes. The proposed new statistic and the derivation of its distribution allows investigators to consider information from the first stage of a two-stage design and consequently base the final test on the direction observed at the first stage or modify the statistic if the direction differs significantly from what was expected a prior.  相似文献   

In the traditional plan for assessing the reliability of a measurement system, a number of raters each measure the same group of subjects. If the system has a large number of raters, we recommend a new set of plans that has two advantages over the traditional plan. First, the proposed plans provide greater precision for estimating the intraclass correlation coefficient with the same total number of measurements. Second, the plans are flexible and can be adapted to constraints on the number of times any subject can be assessed or the number of times any rater can make an assessment. We provide a simple tool for planning a reliability study, access to the software for the planning in the case where there are constraints and an example to demonstrate the analysis of data from the proposed plans. The Canadian Journal of Statistics 39: 344–355; 2011 © 2011 Statistical Society of Canada  相似文献   

In recent years, the spatial lattice data has been a motivating issue for researches. Modeling of binary variables observed at locations on a spatial lattice has been sufficiently investigated and the autologistic model is a popular tool for analyzing these data. But, there are many situations where binary responses are clustered in several uncorrelated lattices, and only a few studies were found to investigate the modeling of binary data distributed in such spatial structure. Besides, due to spatial dependency in data exact likelihood analyses is not possible. Bayesian inference, for the autologistic function due to intractability of its normalizing-constant, often has limitations and difficulties. In this study, spatially correlated binary data clustered in uncorrelated lattices are modeled via autologistic regression and IBF (inverse Bayes formulas) sampler with help of introducing latent variables, is extended for posterior analysis and parameter estimation. The proposed methodology is illustrated using simulated and real observations.  相似文献   

Two types of bivariate models for categorical response variables are introduced to deal with special categories such as ‘unsure’ or ‘unknown’ in combination with other ordinal categories, while taking additional hierarchical data structures into account. The latter is achieved by the use of different covariance structures for a trivariate random effect. The models are applied to data from the INSIDA survey, where interest goes to the effect of covariates on the association between HIV risk perception (quadrinomial with an ‘unknown risk’ category) and HIV infection status (binary). The final model combines continuation-ratio with cumulative link logits for the risk perception, together with partly correlated and partly shared trivariate random effects for the household level. The results indicate that only age has a significant effect on the association between HIV risk perception and infection status. The proposed models may be useful in various fields of application such as social and biomedical sciences, epidemiology and public health.  相似文献   

In behavioral, educational and medical practice, interventions are often personalized over time using strategies that are based on individual behaviors and characteristics and changes in symptoms, severity, or adherence that are a result of one's treatment. Such strategies that more closely mimic real practice, are known as dynamic treatment regimens (DTRs). A sequential multiple assignment randomized trial (SMART) is a multi-stage trial design that can be used to construct effective DTRs. This article reviews a simple to use ‘weighted and replicated’ estimation technique for comparing DTRs embedded in a SMART design using logistic regression for a binary, end-of-study outcome variable. Based on a Wald test that compares two embedded DTRs of interest from the ‘weighted and replicated’ regression model, a sample size calculation is presented with a corresponding user-friendly applet to aid in the process of designing a SMART. The analytic models and sample size calculations are presented for three of the more commonly used two-stage SMART designs. Simulations for the sample size calculation show the empirical power reaches expected levels. A data analysis example with corresponding code is presented in the appendix using data from a SMART developing an effective DTR in autism.  相似文献   

A note on the correlation structure of transformed Gaussian random fields   总被引:1,自引:0,他引:1  
Transformed Gaussian random fields can be used to model continuous time series and spatial data when the Gaussian assumption is not appropriate. The main features of these random fields are specified in a transformed scale, while for modelling and parameter interpretation it is useful to establish connections between these features and those of the random field in the original scale. This paper provides evidence that for many ‘normalizing’ transformations the correlation function of a transformed Gaussian random field is not very dependent on the transformation that is used. Hence many commonly used transformations of correlated data have little effect on the original correlation structure. The property is shown to hold for some kinds of transformed Gaussian random fields, and a statistical explanation based on the concept of parameter orthogonality is provided. The property is also illustrated using two spatial datasets and several ‘normalizing’ transformations. Some consequences of this property for modelling and inference are also discussed.  相似文献   

The authors describe a model‐based kappa statistic for binary classifications which is interpretable in the same manner as Scott's pi and Cohen's kappa, yet does not suffer from the same flaws. They compare this statistic with the data‐driven and population‐based forms of Scott's pi in a population‐based setting where many raters and subjects are involved, and inference regarding the underlying diagnostic procedure is of interest. The authors show that Cohen's kappa and Scott's pi seriously underestimate agreement between experts classifying subjects for a rare disease; in contrast, the new statistic is robust to changes in prevalence. The performance of the three statistics is illustrated with simulations and prostate cancer data.  相似文献   

Cohen’s kappa, a special case of the weighted kappa, is a chance‐corrected index used extensively to quantify inter‐rater agreement in validation and reliability studies. In this paper, it is shown that in inter‐rater agreement for 2 × 2 tables, for two raters having the same number of opposite ratings, the weighted kappa, Cohen’s kappa, Peirce, Yule, Maxwell and Pilliner and Fleiss indices are identical. This implies that the weights in the weighted kappa are less important under such assumptions. Equivalently, it is shown that for two partitions of the same data set, resulting from two clustering algorithms having the same number of clusters with equal cluster sizes, these similarity indices are identical. Hence, an important characterisation is formulated relating equal numbers of clusters with the same cluster sizes to the presence/absence of a trait in a reliability study. Two numerical examples that exemplify the implication of this relationship are presented.  相似文献   

This paper addresses the problem of comparing the fit of latent class and latent trait models when the indicators are binary and the contingency table is sparse. This problem is common in the analysis of data from large surveys, where many items are associated with an unobservable variable. A study of human resource data illustrates: (1) how the usual goodness-of-fit tests, model selection and cross-validation criteria can be inconclusive; (2) how model selection and evaluation procedures from time series and economic forecasting can be applied to extend residual analysis in this context.  相似文献   

A discrimination procedure, based on the location model is described and suggested for use in situation where the discriminating variables are mixtures of continuous and binary variables. Some procedures that have been previously employed, in a similar situation, like Fisher's linear discriminant function and the logistic regression were compared with this method using error rate (ER). Optimal ERs for these procedures are reported using real and simulated data for the case of varying sample size and number of continuous and binary variables and were used as a measure for assessing the performance of the various procedures. The suggested procedure performed considerably better in the cases considered and never did produce a result that is poor when compared with other procedures. Hence, the suggested procedure might be considered for such situations.  相似文献   

Model dependent and robust test statistics constructed using a generalized estimating equations extension of logistic regression applicable to the analysis of correlated binary outcome data are shown to have relatively simple algebraic expressions in stratified analyses where all variables are measured at the cluster level These expressions are used to demonstrate the close relationship to standard procedures which assume that subjects responses are independent, to prove that the asymptotic validity of model dependent test statistics is assured if the average correlation between cluster members is constant, and that this assumption can be relaxed when there are the same number of subjects in each cluster.  相似文献   

Ensuring a standard of assessment in situations where marking panels are used is fraught with difficulties, particularly where essay-type responses are to be marked. This paper discusses statistical process control procedures, similar to those used in industry, to help moderate marking quality when ‘double-marking’ or ‘partial double-marking’ are used. When questions are assessed by the same two markers, the scores assigned to responses by each marker may be adjusted so that their assessments are on average equal in terms of location and scale. The paper also discusses methods of controlling sequential assessment, and demonstrates the application of these techniques in evaluating marker consistency, using data from school leaving examinations in geography.  相似文献   

Summary. In the psychosocial and medical sciences, some studies are designed to assess the agreement between different raters and/or different instruments. Often the same sample will be used to compare the agreement between two or more assessment methods for simplicity and to take advantage of the positive correlation of the ratings. Although sample size calculations have become an important element in the design of research projects, such methods for agreement studies are scarce. We adapt the generalized estimating equations approach for modelling dependent κ -statistics to estimate the sample size that is required for dependent agreement studies. We calculate the power based on a Wald test for the equality of two dependent κ -statistics. The Wald test statistic has a non-central χ 2-distribution with non-centrality parameter that can be estimated with minimal assumptions. The method proposed is useful for agreement studies with two raters and two instruments, and is easily extendable to multiple raters and multiple instruments. Furthermore, the method proposed allows for rater bias. Power calculations for binary ratings under various scenarios are presented. Analyses of two biomedical studies are used for illustration.  相似文献   


Online consumer product ratings data are increasing rapidly. While most of the current graphical displays mainly represent the average ratings, Ho and Quinn proposed an easily interpretable graphical display based on an ordinal item response theory (IRT) model, which successfully accounts for systematic interrater differences. Conventionally, the discrimination parameters in IRT models are constrained to be positive, particularly in the modeling of scored data from educational tests. In this article, we use real-world ratings data to demonstrate that such a constraint can have a great impact on the parameter estimation. This impact on estimation was explained through rater behavior. We also discuss correlation among raters and assess the prediction accuracy for both the constrained and the unconstrained models. The results show that the unconstrained model performs better when a larger fraction of rater pairs exhibit negative correlations in ratings.  相似文献   


Scott’s pi and Cohen’s kappa are widely used for assessing the degree of agreement between two raters with binary outcomes. However, many authors have pointed out its paradoxical behavior, that comes from the dependence on the prevalence of a trait under study. To overcome the limitation, Gwet [Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61(1):29–48] proposed an alternative and more stable agreement coefficient referred to as the AC1. In this article, we discuss a likelihood-based inference of the AC1 in the case of two raters with binary outcomes. Construction of confidence intervals is mainly discussed. In addition, hypothesis testing, and sample size estimation are also presented.  相似文献   

We consider observational studies in pregnancy where the outcome of interest is spontaneous abortion (SAB). This at first sight is a binary ‘yes’ or ‘no’ variable, albeit there is left truncation as well as right-censoring in the data. Women who do not experience SAB by gestational week 20 are ‘cured’ from SAB by definition, that is, they are no longer at risk. Our data is different from the common cure data in the literature, where the cured subjects are always right-censored and not actually observed to be cured. We consider a commonly used cure rate model, with the likelihood function tailored specifically to our data. We develop a conditional nonparametric maximum likelihood approach. To tackle the computational challenge we adopt an EM algorithm making use of “ghost copies” of the data, and a closed form variance estimator is derived. Under suitable assumptions, we prove the consistency of the resulting estimator which involves an unbounded cumulative baseline hazard function, as well as the asymptotic normality. Simulation results are carried out to evaluate the finite sample performance. We present the analysis of the motivating SAB study to illustrate the advantages of our model addressing both occurrence and timing of SAB, as compared to existing approaches in practice.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号