首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
In the traditional plan for assessing the reliability of a measurement system, a number of raters each measure the same group of subjects. If the system has a large number of raters, we recommend a new set of plans that has two advantages over the traditional plan. First, the proposed plans provide greater precision for estimating the intraclass correlation coefficient with the same total number of measurements. Second, the plans are flexible and can be adapted to constraints on the number of times any subject can be assessed or the number of times any rater can make an assessment. We provide a simple tool for planning a reliability study, access to the software for the planning in the case where there are constraints and an example to demonstrate the analysis of data from the proposed plans. The Canadian Journal of Statistics 39: 344–355; 2011 © 2011 Statistical Society of Canada  相似文献   

2.
Cohen’s kappa, a special case of the weighted kappa, is a chance‐corrected index used extensively to quantify inter‐rater agreement in validation and reliability studies. In this paper, it is shown that in inter‐rater agreement for 2 × 2 tables, for two raters having the same number of opposite ratings, the weighted kappa, Cohen’s kappa, Peirce, Yule, Maxwell and Pilliner and Fleiss indices are identical. This implies that the weights in the weighted kappa are less important under such assumptions. Equivalently, it is shown that for two partitions of the same data set, resulting from two clustering algorithms having the same number of clusters with equal cluster sizes, these similarity indices are identical. Hence, an important characterisation is formulated relating equal numbers of clusters with the same cluster sizes to the presence/absence of a trait in a reliability study. Two numerical examples that exemplify the implication of this relationship are presented.  相似文献   

3.
In this paper, three analysis procedures for repeated correlated binary data with no a priori ordering of the measurements are described and subsequently investigated. Examples for correlated binary data could be the binary assessments of subjects obtained by several raters in the framework of a clinical trial. This topic is especially of relevance when success criteria have to be defined for dedicated imaging trials involving several raters conducted for regulatory purposes. First, an analytical result on the expectation of the ‘Majority rater’ is presented when only the marginal distributions of the single raters are given. The paper provides a simulation study where all three analysis procedures are compared for a particular setting. It turns out that in many cases, ‘Average rater’ is associated with a gain in power. Settings were identified where ‘Majority significant’ has favorable properties. ‘Majority rater’ is in many cases difficult to interpret. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

4.
Summary. In the psychosocial and medical sciences, some studies are designed to assess the agreement between different raters and/or different instruments. Often the same sample will be used to compare the agreement between two or more assessment methods for simplicity and to take advantage of the positive correlation of the ratings. Although sample size calculations have become an important element in the design of research projects, such methods for agreement studies are scarce. We adapt the generalized estimating equations approach for modelling dependent κ -statistics to estimate the sample size that is required for dependent agreement studies. We calculate the power based on a Wald test for the equality of two dependent κ -statistics. The Wald test statistic has a non-central χ 2-distribution with non-centrality parameter that can be estimated with minimal assumptions. The method proposed is useful for agreement studies with two raters and two instruments, and is easily extendable to multiple raters and multiple instruments. Furthermore, the method proposed allows for rater bias. Power calculations for binary ratings under various scenarios are presented. Analyses of two biomedical studies are used for illustration.  相似文献   

5.
A modified large-sample (MLS) approach and a generalized confidence interval (GCI) approach are proposed for constructing confidence intervals for intraclass correlation coefficients. Two particular intraclass correlation coefficients are considered in a reliability study. Both subjects and raters are assumed to be random effects in a balanced two-factor design, which includes subject-by-rater interaction. Computer simulation is used to compare the coverage probabilities of the proposed MLS approach (GiTTCH) and GCI approaches with the Leiva and Graybill [1986. Confidence intervals for variance components in the balanced two-way model with interaction. Comm. Statist. Simulation Comput. 15, 301–322] method. The competing approaches are illustrated with data from a gauge repeatability and reproducibility study. The GiTTCH method maintains at least the stated confidence level for interrater reliability. For intrarater reliability, the coverage is accurate in several circumstances but can be liberal in some circumstances. The GCI approach provides reasonable coverage for lower confidence bounds on interrater reliability, but its corresponding upper bounds are too liberal. Regarding intrarater reliability, the GCI approach is not recommended because the lower bound coverage is liberal. Comparing the overall performance of the three methods across a wide array of scenarios, the proposed modified large-sample approach (GiTTCH) provides the most accurate coverage for both interrater and intrarater reliability.  相似文献   

6.
Abstract

Scott’s pi and Cohen’s kappa are widely used for assessing the degree of agreement between two raters with binary outcomes. However, many authors have pointed out its paradoxical behavior, that comes from the dependence on the prevalence of a trait under study. To overcome the limitation, Gwet [Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61(1):29–48] proposed an alternative and more stable agreement coefficient referred to as the AC1. In this article, we discuss a likelihood-based inference of the AC1 in the case of two raters with binary outcomes. Construction of confidence intervals is mainly discussed. In addition, hypothesis testing, and sample size estimation are also presented.  相似文献   

7.
ABSTRACT

Online consumer product ratings data are increasing rapidly. While most of the current graphical displays mainly represent the average ratings, Ho and Quinn proposed an easily interpretable graphical display based on an ordinal item response theory (IRT) model, which successfully accounts for systematic interrater differences. Conventionally, the discrimination parameters in IRT models are constrained to be positive, particularly in the modeling of scored data from educational tests. In this article, we use real-world ratings data to demonstrate that such a constraint can have a great impact on the parameter estimation. This impact on estimation was explained through rater behavior. We also discuss correlation among raters and assess the prediction accuracy for both the constrained and the unconstrained models. The results show that the unconstrained model performs better when a larger fraction of rater pairs exhibit negative correlations in ratings.  相似文献   

8.
An analysis of inter-rater agreement is presented. We study the problem with several raters using a Bayesian model based on the Dirichlet distribution. Inter-rater agreement, including global and partial agreement, is studied by determining the joint posterior distribution of the raters. Posterior distributions are computed with a direct resampling technique. Our method is illustrated with an example involving four residents, who are diagnosing 12 psychiatric patients suspected of having a thought disorder. Initially employing analytical and resampling methods, total agreement between the four is examined with a Bayesian testing technique. Later, partial agreement is examined by determining the posterior probability of certain orderings among the rater means. The power of resampling is revealed by its ability to compute complex multiple integrals that represent various posterior probabilities of agreement and disagreement between several raters.  相似文献   

9.
Agreement among raters is an important issue in medicine, as well as in education and psychology. The agreement among two raters on a nominal or ordinal rating scale has been investigated in many articles. The multi-rater case with normally distributed ratings has also been explored at length. However, there is a lack of research on multiple raters using an ordinal rating scale. In this simulation study, several methods were compared with analyze rater agreement. The special case that was focused on was the multi-rater case using a bounded ordinal rating scale. The proposed methods for agreement were compared within different settings. Three main ordinal data simulation settings were used (normal, skewed and shifted data). In addition, the proposed methods were applied to a real data set from dermatology. The simulation results showed that the Kendall's W and mean gamma highly overestimated the agreement in data sets with shifts in data. ICC4 for bounded data should be avoided in agreement studies with rating scales<5, where this method highly overestimated the simulated agreement. The difference in bias for all methods under study, except the mean gamma and Kendall's W, decreased as the rating scale increased. The bias of ICC3 was consistent and small for nearly all simulation settings except the low agreement setting in the shifted data set. Researchers should be careful in selecting agreement methods, especially if shifts in ratings between raters exist and may apply more than one method before any conclusions are made.  相似文献   

10.
Due to the high reliability and high testing cost of electro-explosive devices, even though an accelerated test is performed, one may observe very few failures or even no failures at all due to censoring. In this paper, we consider modelling the reliability of such devices by an exponential lifetime distribution in which the failure rate is assumed to be a function of some covariates and that the observed data are binary. The Bayesian approach, with three different prior settings, is used to develop inference on the failure rate, lifetime and the reliability under some settings. A Monte Carlo simulation study is carried out to show that this approach is quite useful and suitable for analysing data of the considered form, especially when the failure rates are very small. Finally, illustrative data are analysed using this approach.  相似文献   

11.
The goal of this study is to analyze the quality of ratings assigned to two constructed response questions for evaluating the written ability of essays in Portuguese language from the perspective of the many-facet Rasch (MFR [15 J.M. Linacre, Many-facet Rasch Measurement, 2nd ed., MESA Press, Chicago, 1994. [Google Scholar]]) model. The analyzed data set comes from 350 written tests with two open-item tasks that were developed based on a rating process independently marked by two rater coordinators and a group of 42 raters. The MFR model analysis shows the measurement quality related to the examinees, raters, tasks and items, and classification scale that has been used for the task rating process. The findings indicate significant differences amongst the rater severities and show that the raters cannot be interchanged. The results also suggest that the comparison between the two task difficulties needs further investigation. An additional study has been done on the scale structure of the classification used by each rater for each item. The result suggests that there have been some similarities amongst the tasks and a need of revision for some criteria of the rating process. Overall, the scale of evaluation has shown to be efficient for a classification of the examinees.  相似文献   

12.
In this paper, the hypothesis testing and interval estimation for the intraclass correlation coefficients are considered in a two-way random effects model with interaction. Two particular intraclass correlation coefficients are described in a reliability study. The tests and confidence intervals for the intraclass correlation coefficients are developed when the data are unbalanced. One approach is based on the generalized p-value and generalized confidence interval, the other is based on the modified large-sample idea. These two approaches simplify to the ones in Gilder et al. [2007. Confidence intervals on intraclass correlation coefficients in a balanced two-factor random design. J. Statist. Plann. Inference 137, 1199–1212] when the data are balanced. Furthermore, some statistical properties of the generalized confidence intervals are investigated. Finally, some simulation results to compare the performance of the modified large-sample approach with that of the generalized approach are reported. The simulation results indicate that the modified large-sample approach performs better than the generalized approach in the coverage probability and expected length of the confidence interval.  相似文献   

13.
In this article, we study inferences for reliability functions of the system having two components connected in series. Suppose that the lifetime of one component has a lognormal distribution. Lognormal, exponential, and weibull distributions are considered for the lifetime of the other component. Using the generalized inference approach, we obtain confidence intervals of our interested parameters with good coverage. Some frequentist properties in small-sample cases and large-sample cases are proved.  相似文献   

14.
In this article, we propose a novel approach for testing the equality of two log-normal populations using a computational approach test (CAT) that does not require explicit knowledge of the sampling distribution of the test statistic. Simulation studies demonstrate that the proposed approach can perform hypothesis testing with satisfying actual size even at small sample sizes. Overall, it is superior to other existing methods. Also, a CAT is proposed for testing about reliability of two log-normal populations when the means are the same. Simulations show that the actual size of this new approach is close to nominal level and better than the score test. At the end, the proposed methods are illustrated using two examples.  相似文献   

15.
In this article, the hypothesis testing and interval estimation for the reliability parameter are considered in balanced and unbalanced one-way random models. The tests and confidence intervals for the reliability parameter are developed using the concepts of generalized p-value and generalized confidence interval. Furthermore, some simulation results are presented to compare the performances between the proposed approach and the existing approach. For balanced models, the simulation results indicate that the proposed approach can provide satisfactory coverage probabilities and performs better than the existing approaches across the wide array of scenarios, especially for small sample sizes. For unbalanced models, the simulation results show that the two proposed approaches perform more satisfactorily than the existing approach in most cases. Finally, the proposed approaches are illustrated using two real examples.  相似文献   

16.
This paper considers constructing a new confidence interval for the slope parameter in the structural errors-in-variables model with known error variance associated with the regressors. Existing confidence intervals are so severely affected by Gleser–Hwang effect that they are subject to have poor empirical coverage probabilities and unsatisfactory lengths. Moreover, these problems get worse with decreasing reliability ratio which also result in more frequent absence of some existing intervals. To ease these issues, this paper presents a fiducial generalized confidence interval which maintains the correct asymptotic coverage. Simulation results show that this fiducial interval is slightly conservative while often having average length comparable or shorter than the other methods. Finally, we illustrate these confidence intervals with two real data examples, and in the second example some existing intervals do not exist.  相似文献   

17.
In past studies various criteria have been proposed for evaluating the performance of a confidence set. However, each of these criteria often causes some unsatisfactory results even for the standard models such as location model, scale model and multinormal model. In this article, we propose a new criterion so that the procedure of the confidence set estimation based on the criterion can lead to a desirable confidence set at least for the above models. The approach is on the basis of an improvement of the Neyman shortness according to two steps. The first step is some kind of theoretical improvement, referring to a proposal of Pratt. As a result, we get a solution to Pratt's paradox. In the second step, we adopt a kind of robust or minimax procedure without sticking to the uniform optimality. In conclusion, it is shown that the procedure based on our criterion produces a desirable and acceptable confidence set.  相似文献   

18.
In the classical approach to qualitative reliability demonstration, system failure probabilities are estimated based on a binomial sample drawn from the running production. In this paper, we show how to take account of additional available sampling information for some or even all subsystems of a current system under test with serial reliability structure. In that connection, we present two approaches, a frequentist and a Bayesian one, for assessing an upper bound for the failure probability of serial systems under binomial subsystem data. In the frequentist approach, we introduce (i) a new way of deriving the probability distribution for the number of system failures, which might be randomly assembled from the failed subsystems and (ii) a more accurate estimator for the Clopper–Pearson upper bound using a beta mixture distribution. In the Bayesian approach, however, we infer the posterior distribution for the system failure probability on the basis of the system/subsystem testing results and a prior distribution for the subsystem failure probabilities. We propose three different prior distributions and compare their performances in the context of high reliability testing. Finally, we apply the proposed methods to reduce the efforts of semiconductor burn-in studies by considering synergies such as comparable chip layers, among different chip technologies.  相似文献   

19.
When making inference on a normal distribution, one often seeks either a joint confidence region for the two parameters or a confidence band for the cumulative distribution function. A number of methods for constructing such confidence sets are available, but none of these methods guarantees a minimum-area confidence set. In this paper, we derive both a minimum-area joint confidence region for the two parameters and a minimum-area confidence band for the cumulative distribution function. The minimum-area joint confidence region is asymptotically equivalent to other confidence regions in the literature, but the minimum-area confidence band improves on existing confidence bands even asymptotically.  相似文献   

20.
Comparative lifetime experiments are of great importance when the interest is in ascertaining the relative merits of two competing products with regard to their reliability. In this article, we consider two exponential populations and when joint progressive Type-II censoring is implemented on the two samples. We then derive the moment generating functions and the exact distributions of the maximum likelihood estimators (MLEs) of the mean lifetimes of the two exponential populations under such a joint progressive Type-II censoring. We then discuss the exact lower confidence bounds, exact confidence intervals, and simultaneous confidence regions. Next, we discuss the corresponding approximate results based on the asymptotic normality of the MLEs as well as those based on the Bayesian method. All these confidence intervals and regions are then compared by means of Monte Carlo simulations with those obtained from bootstrap methods. Finally, an illustrative example is presented in order to illustrate all the methods of inference discussed here.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号