期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The disagreeable behaviour of the kappa statistic

Laura Flight Steven A. Julious 《Pharmaceutical statistics》2015,14(1):74-78

It is often of interest to measure the agreement between a number of raters when an outcome is nominal or ordinal. The kappa statistic is used as a measure of agreement. The statistic is highly sensitive to the distribution of the marginal totals and can produce unreliable results. Other statistics such as the proportion of concordance, maximum attainable kappa and prevalence and bias adjusted kappa should be considered to indicate how well the kappa statistic represents agreement in the data. Each kappa should be considered and interpreted based on the context of the data being analysed. Copyright © 2014 JohnWiley & Sons, Ltd. 相似文献

2.

Estimation of symmetric disagreement using a uniform association model for ordinal agreement data

Serpil Aktaş Tülay Saraçbaşı 《AStA Advances in Statistical Analysis》2009,93(3):335-343

The Cohen kappa is probably the most widely used measure of agreement. Measuring the degree of agreement or disagreement in square contingency tables by two raters is mostly of interest. Modeling the agreement provides more information on the pattern of the agreement rather than summarizing the agreement by kappa coefficient. Additionally, the disagreement models in the literature they mentioned are proposed for the nominal scales. Disagreement and uniform association models are aggregated as a new model for the ordinal scale agreement data, thus in this paper, symmetric disagreement plus uniform association model that aims separating the association from the disagreement is proposed. Proposed model is applied to real uterine cancer data. 相似文献

3.

On Modelling Agreement and Category Distinguishability on an Ordinal Scale

Lianyan Fu Man-Lai Tang Ning-Zhong Shi 《统计学通讯:理论与方法》2013,42(24):4413-4426

It is quite common that raters may need to classify a sample of subjects on a categorical scale. Perfect agreement can rarely be observed partly because of different perceptions about the meanings of the category labels between raters and partly because of factors such as intrarater variability. Usually, category indistinguishability occurs between adjacent categories. In this article, we propose a simple log-linear model combining ordinal scale information and category distinguishability between ordinal categories for modelling agreement between two raters. For the proposed model, no score assignment is required to the ordinal categories. An algorithm and statistical properties will be provided. 相似文献

4.

Comparing the methods of measuring multi-rater agreement on an ordinal rating scale: a simulation study with an application to real data

Y. Sertdemir H. R. Burgut Z. N. Alparslan I. Unal S. Gunasti 《Journal of applied statistics》2013,40(7):1506-1519

Agreement among raters is an important issue in medicine, as well as in education and psychology. The agreement among two raters on a nominal or ordinal rating scale has been investigated in many articles. The multi-rater case with normally distributed ratings has also been explored at length. However, there is a lack of research on multiple raters using an ordinal rating scale. In this simulation study, several methods were compared with analyze rater agreement. The special case that was focused on was the multi-rater case using a bounded ordinal rating scale. The proposed methods for agreement were compared within different settings. Three main ordinal data simulation settings were used (normal, skewed and shifted data). In addition, the proposed methods were applied to a real data set from dermatology. The simulation results showed that the Kendall's W and mean gamma highly overestimated the agreement in data sets with shifts in data. ICC₄ for bounded data should be avoided in agreement studies with rating scales<5, where this method highly overestimated the simulated agreement. The difference in bias for all methods under study, except the mean gamma and Kendall's W, decreased as the rating scale increased. The bias of ICC₃ was consistent and small for nearly all simulation settings except the low agreement setting in the shifted data set. Researchers should be careful in selecting agreement methods, especially if shifts in ratings between raters exist and may apply more than one method before any conclusions are made. 相似文献

5.

A bootstrap method for comparing correlated kappa coefficients

《Journal of Statistical Computation and Simulation》2012,82(11):1009-1015

Cohen's kappa coefficient is traditionally used to quantify the degree of agreement between two raters on a nominal scale. Correlated kappas occur in many settings (e.g., repeated agreement by raters on the same individuals, concordance between diagnostic tests and a gold standard) and often need to be compared. While different techniques are now available to model correlated κ coefficients, they are generally not easy to implement in practice. The present paper describes a simple alternative method based on the bootstrap for comparing correlated kappa coefficients. The method is illustrated by examples and its type I error studied using simulations. The method is also compared with the generalized estimating equations of the second order and the weighted least-squares methods. 相似文献

6.

A BAYESIAN ANALYSIS FOR INTER-RATER AGREEMENT

《统计学通讯:模拟与计算》2013,42(3):437-446

An analysis of inter-rater agreement is presented. We study the problem with several raters using a Bayesian model based on the Dirichlet distribution. Inter-rater agreement, including global and partial agreement, is studied by determining the joint posterior distribution of the raters. Posterior distributions are computed with a direct resampling technique. Our method is illustrated with an example involving four residents, who are diagnosing 12 psychiatric patients suspected of having a thought disorder. Initially employing analytical and resampling methods, total agreement between the four is examined with a Bayesian testing technique. Later, partial agreement is examined by determining the posterior probability of certain orderings among the rater means. The power of resampling is revealed by its ability to compute complex multiple integrals that represent various posterior probabilities of agreement and disagreement between several raters. 相似文献

7.

Calculating power for the comparison of dependent κ-coefficients

Hung-Mo Lin John M. Williamson Stuart R. Lipsitz 《Journal of the Royal Statistical Society. Series C, Applied statistics》2003,52(4):391-404

Summary. In the psychosocial and medical sciences, some studies are designed to assess the agreement between different raters and/or different instruments. Often the same sample will be used to compare the agreement between two or more assessment methods for simplicity and to take advantage of the positive correlation of the ratings. Although sample size calculations have become an important element in the design of research projects, such methods for agreement studies are scarce. We adapt the generalized estimating equations approach for modelling dependent κ -statistics to estimate the sample size that is required for dependent agreement studies. We calculate the power based on a Wald test for the equality of two dependent κ -statistics. The Wald test statistic has a non-central χ ²-distribution with non-centrality parameter that can be estimated with minimal assumptions. The method proposed is useful for agreement studies with two raters and two instruments, and is easily extendable to multiple raters and multiple instruments. Furthermore, the method proposed allows for rater bias. Power calculations for binary ratings under various scenarios are presented. Analyses of two biomedical studies are used for illustration. 相似文献

8.

Confidence intervals for the interrater agreement measure kappa

Virginia Foard Flack 《统计学通讯:理论与方法》2013,42(4):953-968

The asympotic normal approximation to the distribution of the estimated measure [kcirc] for evaluating agreement between two raters has been shown to perform poorly for small sample sizes when the true kappa is nonzero. This paper examines the use of skewness corrections and transformations of [kcirc] on the attained confidence levels. Small sample simulations demonstrate the improvement in the agreement between the desired and actual levels of confidence intervals and hypothesis tests that incorporate these corrections. 相似文献

9.

A measure of interrater absolute agreement for ordinal categorical data

Bove Giuseppe Conti Pier Luigi Marella Daniela 《Statistical Methods and Applications》2021,30(3):927-945

Statistical Methods & Applications - A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe... 相似文献

10.

A simple method for estimating a regression model for κ between a pair of raters

Stuart R. Lipsitz John Williamson Neil Klar Joseph Ibrahim & Michael Parzen 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2001,164(3):449-465

Agreement studies commonly occur in medical research, for example, in the review of X-rays by radiologists, blood tests by a panel of pathologists and the evaluation of psychopathology by a panel of raters. In these studies, often two observers rate the same subject for some characteristic with a discrete number of levels. The κ-coefficient is a popular measure of agreement between the two raters. The κ-coefficient may depend on covariates, i.e. characteristics of the raters and/or the subjects being rated. Our research was motivated by two agreement problems. The first is a study of agreement between a pastor and a co-ordinator of Christian education on whether they feel that the congregation puts enough emphasis on encouraging members to work for social justice (yes versus no). We wish to model the κ-coefficient as a function of covariates such as political orientation (liberal versus conservative) of the pastor and co-ordinator. The second example is a spousal education study, in which we wish to model the κ-coefficient as a function of covariates such as the highest degree of the father of the wife and the father of the husband. We propose a simple method to estimate the regression model for the κ-coefficient, which consists of two logistic (or multinomial logistic) regressions and one linear regression for binary data. The estimates can be easily obtained in any generalized linear model software program. 相似文献

11.

Weighted kappa as a function of unweighted kappas

N. Moradzadeh M. Ganjali 《统计学通讯:模拟与计算》2017,46(5):3769-3780

The kappa coefficient is a widely used measure for assessing agreement on a nominal scale. Weighted kappa is an extension of Cohen's kappa that is commonly used for measuring agreement on an ordinal scale. In this article, it is shown that weighted kappa can be computed as a function of unweighted kappas. The latter coefficients are kappa coefficients that correspond to smaller contingency tables that are obtained by merging categories. 相似文献

12.

Some properties of Lin–Wong divergence on the past lifetime data

M. Khalili F. Yousefzadeh 《统计学通讯:理论与方法》2018,47(14):3464-3476

Measures of statistical divergence are used to assess mutual similarities between distributions of multiple variables through a variety of methodologies including Shannon entropy and Csiszar divergence. Modified measures of statistical divergence are introduced throughout the present article. Those modified measures are related to the Lin–Wong (LW) divergence applied on the past lifetime data. Accordingly, the relationship between Fisher information and the LW divergence measure was explored when applied on the past lifetime data. Throughout this study, a number of relations are proposed between various assessment methods which implement the Jensen–Shannon, Jeffreys, and Hellinger divergence measures. Also, relations between the LW measure and the Kullback–Leibler (KL) measures for past lifetime data were examined. Furthermore, the present study discusses the relationship between the proposed ordering scheme and the distance interval between LW and KL measures under certain conditions. 相似文献

13.

Measuring association between nominal categorical variables: an alternative to the Goodman–Kruskal lambda

Tarald O. Kvålseth 《Journal of applied statistics》2018,45(6):1118-1132

As a measure of association between two nominal categorical variables, the lambda coefficient or Goodman–Kruskal's lambda has become a most popular measure. Its popularity is primarily due to its simple and meaningful definition and interpretation in terms of the proportional reduction in error when predicting a random observation's category for one variable given (versus not knowing) its category for the other variable. It is an asymmetric measure, although a symmetric version is available. The lambda coefficient does, however, have a widely recognized limitation: it can equal zero even when there is no independence between the variables and when all other measures take on positive values. In order to mitigate this problem, an alternative lambda coefficient is introduced in this paper as a slight modification of the Goodman–Kruskal lambda. The properties of the new measure are discussed and a symmetric form is introduced. A statistical inference procedure is developed and a numerical example is provided. 相似文献

14.

Statistical inference of agreement coefficient between two raters with binary outcomes

Tetsuji Ohyama 《统计学通讯:理论与方法》2020,49(10):2529-2539

Abstract

Scott’s pi and Cohen’s kappa are widely used for assessing the degree of agreement between two raters with binary outcomes. However, many authors have pointed out its paradoxical behavior, that comes from the dependence on the prevalence of a trait under study. To overcome the limitation, Gwet [Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61(1):29–48] proposed an alternative and more stable agreement coefficient referred to as the AC₁. In this article, we discuss a likelihood-based inference of the AC₁ in the case of two raters with binary outcomes. Construction of confidence intervals is mainly discussed. In addition, hypothesis testing, and sample size estimation are also presented. 相似文献

15.

Confidence intervals on intraclass correlation coefficients in a balanced two-factor random design

Kye Gilder Naitee Ting Lili Tian Joseph C. Cappelleri R. Choudary Hanumara 《Journal of statistical planning and inference》2007

A modified large-sample (MLS) approach and a generalized confidence interval (GCI) approach are proposed for constructing confidence intervals for intraclass correlation coefficients. Two particular intraclass correlation coefficients are considered in a reliability study. Both subjects and raters are assumed to be random effects in a balanced two-factor design, which includes subject-by-rater interaction. Computer simulation is used to compare the coverage probabilities of the proposed MLS approach (GiTTCH) and GCI approaches with the Leiva and Graybill [1986. Confidence intervals for variance components in the balanced two-way model with interaction. Comm. Statist. Simulation Comput. 15, 301–322] method. The competing approaches are illustrated with data from a gauge repeatability and reproducibility study. The GiTTCH method maintains at least the stated confidence level for interrater reliability. For intrarater reliability, the coverage is accurate in several circumstances but can be liberal in some circumstances. The GCI approach provides reasonable coverage for lower confidence bounds on interrater reliability, but its corresponding upper bounds are too liberal. Regarding intrarater reliability, the GCI approach is not recommended because the lower bound coverage is liberal. Comparing the overall performance of the three methods across a wide array of scenarios, the proposed modified large-sample approach (GiTTCH) provides the most accurate coverage for both interrater and intrarater reliability. 相似文献

16.

Cohen’s quadratically weighted kappa is higher than linearly weighted kappa for tridiagonal agreement tables

《Statistical Methodology》2012,9(3):440-444

相似文献

17.

Two simple measures of variability for categorical data

Erindi Allaj 《Journal of applied statistics》2018,45(8):1497-1516

This paper proposes two new variability measures for categorical data. The first variability measure is obtained as one minus the square root of the sum of the squares of the relative frequencies of the different categories. The second measure is obtained by standardizing the first measure. The measures proposed are functions of the variability measure proposed by Gini [Variabilitá e Mutuabilitá Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche, C. Cuppini, Bologna, 1912] and approximate the coefficient of nominal variation introduced by Kvålseth [Coefficients of variation for nominal and ordinal categorical data, Percept. Motor Skills 80 (1995), pp. 843–847] when the number of categories increases. Different mathematical properties of the proposed variability measures are studied and analyzed. Several examples illustrate how the variability measures can be interpreted and used in practice. 相似文献

18.

TESTING THE AGREEMENT OF TWO QUANTITATIVE ASSAYS IN INDIVIDUAL MEANS

《统计学通讯:理论与方法》2013,42(8):1283-1299

ABSTRACT

Existing approaches for the statistical evaluation of the agreement of two quantitative assays in terms of individual means are either based on a linear model and some stringent assumptions or comparisons of averages of individual means. Furthermore, the related statistical tests for some of these approaches are not valid in the sense that the sizes of these tests are not exactly the same as the nominal size even asymptotically. In this paper we propose a new method, which produces exact statistical tests that are easy to compute. When independent replicates are available, the proposed method requires very little or no assumption on the individual error variances. Simulation results show that the proposed tests perform better than some existing tests. Some examples are presented for illustration. 相似文献

19.

A Note on Item Response Theory Modeling for Online Customer Ratings

Chien-Lang Su Sun-Hao Chang Ruby Chiu-Hsing Weng 《The American statistician》2020,74(1):53-63

ABSTRACT

Online consumer product ratings data are increasing rapidly. While most of the current graphical displays mainly represent the average ratings, Ho and Quinn proposed an easily interpretable graphical display based on an ordinal item response theory (IRT) model, which successfully accounts for systematic interrater differences. Conventionally, the discrimination parameters in IRT models are constrained to be positive, particularly in the modeling of scored data from educational tests. In this article, we use real-world ratings data to demonstrate that such a constraint can have a great impact on the parameter estimation. This impact on estimation was explained through rater behavior. We also discuss correlation among raters and assess the prediction accuracy for both the constrained and the unconstrained models. The results show that the unconstrained model performs better when a larger fraction of rater pairs exhibit negative correlations in ratings. 相似文献

20.

Confidence Intervals for Intraclass Correlation in Inter-Rater Reliability

Valentin Rousson Theo Gasser Burkhardt Seifert 《Scandinavian Journal of Statistics》2003,30(3):617-624

Abstract Calculation of a confidence interval for intraclass correlation to assess inter‐rater reliability is problematic when the number of raters is small and the rater effect is not negligible. Intervals produced by existing methods are uninformative: the lower bound is often close to zero, even in cases where the reliability is good and the sample size is large. In this paper, we show that this problem is unavoidable without extra assumptions and we propose two new approaches. The first approach assumes that the raters are sufficiently trained and is related to a sensitivity analysis. The second approach is based on a model with fixed rater effect. Using either approach, we obtain conservative and informative confidence intervals even from samples with only two raters. We illustrate our point with data on the development of neuromotor functions in children and adolescents. 相似文献