期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Three Recommendations for Improving the Use of p-Values

Daniel J. Benjamin 《The American statistician》2019,73(1):186-191

ABSTRACT

Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? p-Values themselves do not directly answer this question and are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis. Even in the “post p?<?0.05 era,” however, it is quite possible that p-values will continue to be widely reported and used to assess the strength of evidence (if for no other reason than the widespread availability and use of statistical software that routinely produces p-values and thereby implicitly advocates for their use). If so, the potential for misinterpretation will persist. In this article, we recommend three practices that would help researchers more accurately interpret p-values. Each of the three recommended practices involves interpreting p-values in light of their corresponding “Bayes factor bound,” which is the largest odds in favor of the alternative hypothesis relative to the null hypothesis that is consistent with the observed data. The Bayes factor bound generally indicates that a given p-value provides weaker evidence against the null hypothesis than typically assumed. We therefore believe that our recommendations can guard against some of the most harmful p-value misinterpretations. In research communities that are deeply attached to reliance on “p?<?0.05,” our recommendations will serve as initial steps away from this attachment. We emphasize that our recommendations are intended merely as initial, temporary steps and that many further steps will need to be taken to reach the ultimate destination: a holistic interpretation of statistical evidence that fully conforms to the principles laid out in the ASA statement on statistical significance and p-values. 相似文献

2.

Why are p-Values Controversial?

Todd A. Kuffner Stephen G. Walker 《The American statistician》2019,73(1):1-3

While it is often argued that a p-value is a probability; see Wasserstein and Lazar, we argue that a p-value is not defined as a probability. A p-value is a bijection of the sufficient statistic for a given test which maps to the same scale as the Type I error probability. As such, the use of p-values in a test should be no more a source of controversy than the use of a sufficient statistic. It is demonstrated that there is, in fact, no ambiguity about what a p-value is, contrary to what has been claimed in recent public debates in the applied statistics community. We give a simple example to illustrate that rejecting the use of p-values in testing for a normal mean parameter is conceptually no different from rejecting the use of a sample mean. The p-value is innocent; the problem arises from its misuse and misinterpretation. The way that p-values have been informally defined and interpreted appears to have led to tremendous confusion and controversy regarding their place in statistical analysis. 相似文献

3.

The adaptive calibration of testing p-values

Guimei Zhao Li Wang 《统计学通讯:理论与方法》2013,42(4):922-932

Abstract

In statistical hypothesis testing, a p-value is expected to be distributed as the uniform distribution on the interval (0, 1) under the null hypothesis. However, some p-values, such as the generalized p-value and the posterior predictive p-value, cannot be assured of this property. In this paper, we propose an adaptive p-value calibration approach, and show that the calibrated p-value is asymptotically distributed as the uniform distribution. For Behrens–Fisher problem and goodness-of-fit test under a normal model, the calibrated p-values are constructed and their behavior is evaluated numerically. Simulations show that the calibrated p-values are superior than original ones. 相似文献

4.

Content Audit for p-value Principles in Introductory Statistics

Karsten Maurer Lynette Hudiburgh Lisa Werwinski John Bailer 《The American statistician》2019,73(1):385-391

ABSTRACT

Longstanding concerns with the role and interpretation of p-values in statistical practice prompted the American Statistical Association (ASA) to make a statement on p-values. The ASA statement spurred a flurry of responses and discussions by statisticians, with many wondering about the steps necessary to expand the adoption of these principles. Introductory statistics classrooms are key locations to introduce and emphasize the nuance related to p-values; in part because they engrain appropriate analysis choices at the earliest stages of statistics education, and also because they reach the broadest group of students. We propose a framework for statistics departments to conduct a content audit for p-value principles in their introductory curriculum. We then discuss the process and results from applying this course audit framework within our own statistics department. We also recommend meeting with client departments as a complement to the course audit. Discussions about analyses and practices common to particular fields can help to evaluate if our service courses are meeting the needs of client departments and to identify what is needed in our introductory courses to combat the misunderstanding and future misuse of p-values. 相似文献

5.

P-Value Precision and Reproducibility

Boos DD Stefanski LA 《The American statistician》2011,65(4):213-221

P-values are useful statistical measures of evidence against a null hypothesis. In contrast to other statistical estimates, however, their sample-to-sample variability is usually not considered or estimated, and therefore not fully appreciated. Via a systematic study of log-scale p-value standard errors, bootstrap prediction bounds, and reproducibility probabilities for future replicate p-values, we show that p-values exhibit surprisingly large variability in typical data situations. In addition to providing context to discussions about the failure of statistical results to replicate, our findings shed light on the relative value of exact p-values vis-a-vis approximate p-values, and indicate that the use of *, **, and *** to denote levels .05, .01, and .001 of statistical significance in subject-matter journals is about the right level of precision for reporting p-values when judged by widely accepted rules for rounding statistical estimates. 相似文献

6.

Abandon Statistical Significance

Blakeley B. McShane David Gal Andrew Gelman Christian Robert Jennifer L. Tackett 《The American statistician》2019,73(1):235-245

ABSTRACT

We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly. 相似文献

7.

The widespread misinterpretation of p-values as error probabilities

Raymond Hubbard 《Journal of applied statistics》2011,38(11):2617-2626

The anonymous mixing of Fisherian (p-values) and Neyman–Pearsonian (α levels) ideas about testing, distilled in the customary but misleading p < α criterion of statistical significance, has led researchers in the social and management sciences (and elsewhere) to commonly misinterpret the p-value as a ‘data-adjusted’ Type I error rate. Evidence substantiating this claim is provided from a number of fronts, including comments by statisticians, articles judging the value of significance testing, textbooks, surveys of scholars, and the statistical reporting behaviours of applied researchers. That many investigators do not know the difference between p’s and α’s indicates much bewilderment over what those most ardently sought research outcomes—statistically significant results—means. Statisticians can play a leading role in clearing this confusion. A good starting point would be to abolish the p < α criterion of statistical significance. 相似文献

8.

Distribution of the Λ-Criterion for Sphericity Test and Its Connection to a Generalized Dirichlet Model

Seemon Thomas Alex Thannippara 《统计学通讯:模拟与计算》2013,42(7):1385-1395

It is shown that the exact null distribution of the likelihood ratio criterion for sphericity test in the p-variate normal case and the marginal distribution of the first component of a (p ? 1)-variate generalized Dirichlet model with a given set of parameters are identical. The exact distribution of the likelihood ratio criterion so obtained has a general format for every p. A novel idea is introduced here through which the complicated exact null distribution of the sphericity test criterion in multivariate statistical analysis is converted into an easily tractable marginal density in a generalized Dirichlet model. It provides a direct and easiest method of computation of p-values. The computation of p-values and a table of critical points corresponding to p = 3 and 4 are also presented. 相似文献

9.

A method for combining p-values in meta-analysis by gamma distributions

Li-Chu Chien 《Journal of applied statistics》2019,46(2):247-261

Combining p-values from statistical tests across different studies is the most commonly used approach in meta-analysis for evolutionary biology. The most commonly used p-value combination methods mainly incorporate the z-transform tests (e.g., the un-weighted z-test and the weighted z-test) and the gamma-transform tests (e.g., the CZ method [Z. Chen, W. Yang, Q. Liu, J.Y. Yang, J. Li, and M.Q. Yang, A new statistical approach to combining p-values using gamma distribution and its application to genomewide association study, Bioinformatics 15 (2014), p. S3]). However, among these existing p-value combination methods, no method is uniformly most powerful in all situations [Chen et al. 2014]. In this paper, we propose a meta-analysis method based on the gamma distribution, MAGD, by pooling the p-values from independent studies. The newly proposed test, MAGD, allows for flexible accommodating of the different levels of heterogeneity of effect sizes across individual studies. The MAGD simultaneously retains all the characters of the z-transform tests and the gamma-transform tests. We also propose an easy-to-implement resampling approach for estimating the empirical p-values of MAGD for the finite sample size. Simulation studies and two data applications show that the proposed method MAGD is essentially as powerful as the z-transform tests (the gamma-transform tests) under the circumstance with the homogeneous (heterogeneous) effect sizes across studies. 相似文献

10.

Limitations of P-Values and R-squared for Stepwise Regression Building: A Fairness Demonstration in Health Policy Risk Adjustment

Sherri Rose Thomas G. McGuire 《The American statistician》2019,73(1):152-156

ABSTRACT

Stepwise regression building procedures are commonly used applied statistical tools, despite their well-known drawbacks. While many of their limitations have been widely discussed in the literature, other aspects of the use of individual statistical fit measures, especially in high-dimensional stepwise regression settings, have not. Giving primacy to individual fit, as is done with p-values and R², when group fit may be the larger concern, can lead to misguided decision making. One of the most consequential uses of stepwise regression is in health care, where these tools allocate hundreds of billions of dollars to health plans enrolling individuals with different predicted health care costs. The main goal of this “risk adjustment” system is to convey incentives to health plans such that they provide health care services fairly, a component of which is not to discriminate in access or care for persons or groups likely to be expensive. We address some specific limitations of p-values and R² for high-dimensional stepwise regression in this policy problem through an illustrated example by additionally considering a group-level fairness metric. 相似文献

11.

A note on fiducial model averaging as an alternative to checking Bayesian and frequentist models

David R. Bickel 《统计学通讯:理论与方法》2018,47(13):3125-3137

Just as frequentist hypothesis tests have been developed to check model assumptions, prior predictive p-values and other Bayesian p-values check prior distributions as well as other model assumptions. These model checks not only suffer from the usual threshold dependence of p-values, but also from the suppression of model uncertainty in subsequent inference. One solution is to transform Bayesian and frequentist p-values for model assessment into a fiducial distribution across the models. Averaging the Bayesian or frequentist posterior distributions with respect to the fiducial distribution can reproduce results from Bayesian model averaging or classical fiducial inference. 相似文献

12.

A Proposed Hybrid Effect Size Plus p-Value Criterion: Empirical Evidence Supporting its Use

William M. Goodman Susan E. Spruill Eugene Komaroff 《The American statistician》2019,73(1):168-185

ABSTRACT

When the editors of Basic and Applied Social Psychology effectively banned the use of null hypothesis significance testing (NHST) from articles published in their journal, it set off a fire-storm of discussions both supporting the decision and defending the utility of NHST in scientific research. At the heart of NHST is the p-value which is the probability of obtaining an effect equal to or more extreme than the one observed in the sample data, given the null hypothesis and other model assumptions. Although this is conceptually different from the probability of the null hypothesis being true, given the sample, p-values nonetheless can provide evidential information, toward making an inference about a parameter. Applying a 10,000-case simulation described in this article, the authors found that p-values’ inferential signals to either reject or not reject a null hypothesis about the mean (α?=?0.05) were consistent for almost 70% of the cases with the parameter’s true location for the sampled-from population. Success increases if a hybrid decision criterion, minimum effect size plus p-value (MESP), is used. Here, rejecting the null also requires the difference of the observed statistic from the exact null to be meaningfully large or practically significant, in the researcher’s judgment and experience. The simulation compares performances of several methods: from p-value and/or effect size-based, to confidence-interval based, under various conditions of true location of the mean, test power, and comparative sizes of the meaningful distance and population variability. For any inference procedure that outputs a binary indicator, like flagging whether a p-value is significant, the output of one single experiment is not sufficient evidence for a definitive conclusion. Yet, if a tool like MESP generates a relatively reliable signal and is used knowledgeably as part of a research process, it can provide useful information. 相似文献

13.

p-Values,Bayes Factors,and Sufficiency

Jonathan Rougier 《The American statistician》2019,73(1):148-151

ABSTRACT

Various approaches can be used to construct a model from a null distribution and a test statistic. I prove that one such approach, originating with D. R. Cox, has the property that the p-value is never greater than the Generalized Likelihood Ratio (GLR). When combined with the general result that the GLR is never greater than any Bayes factor, we conclude that, under Cox’s model, the p-value is never greater than any Bayes factor. I also provide a generalization, illustrations for the canonical Normal model, and an alternative approach based on sufficiency. This result is relevant for the ongoing discussion about the evidential value of small p-values, and the movement among statisticians to “redefine statistical significance.” 相似文献

14.

Maximum studentized score tests for the detection of outliers in time series regression models

《Journal of Statistical Computation and Simulation》2012,82(12):1355-1372

Efficient score tests exist among others, for testing the presence of additive and/or innovative outliers that are the result of the shifted mean of the error process under the regression model. A sample influence function of autocorrelation-based diagnostic technique also exists for the detection of outliers that are the result of the shifted autocorrelations. The later diagnostic technique is however not useful if the outlying observation does not affect the autocorrelation structure but is generated due to an inflation in the variance of the error process under the regression model. In this paper, we develop a unified maximum studentized type test which is applicable for testing the additive and innovative outliers as well as variance shifted outliers that may or may not affect the autocorrelation structure of the outlier free time series observations. Since the computation of the p-values for the maximum studentized type test is not easy in general, we propose a Satterthwaite type approximation based on suitable doubly non-central F-distributions for finding such p-values [F.E. Satterthwaite, An approximate distribution of estimates of variance components, Biometrics 2 (1946), pp. 110–114]. The approximations are evaluated through a simulation study, for example, for the detection of additive and innovative outliers as well as variance shifted outliers that do not affect the autocorrelation structure of the outlier free time series observations. Some simulation results on model misspecification effects on outlier detection are also provided. 相似文献

15.

Estimating the proportion of true null hypotheses using the pattern of observed p-values

Tiejun Tong Zeny Feng Julia S. Hilton Hongyu Zhao 《Journal of applied statistics》2013,40(9):1949-1964

Estimating the proportion of true null hypotheses, π₀, has attracted much attention in the recent statistical literature. Besides its apparent relevance for a set of specific scientific hypotheses, an accurate estimate of this parameter is key for many multiple testing procedures. Most existing methods for estimating π₀ in the literature are motivated from the independence assumption of test statistics, which is often not true in reality. Simulations indicate that most existing estimators in the presence of the dependence among test statistics can be poor, mainly due to the increase of variation in these estimators. In this paper, we propose several data-driven methods for estimating π₀ by incorporating the distribution pattern of the observed p-values as a practical approach to address potential dependence among test statistics. Specifically, we use a linear fit to give a data-driven estimate for the proportion of true-null p-values in (λ, 1] over the whole range [0, 1] instead of using the expected proportion at 1?λ. We find that the proposed estimators may substantially decrease the variance of the estimated true null proportion and thus improve the overall performance. 相似文献

16.

Regularized proportional odds models

《Journal of Statistical Computation and Simulation》2012,82(2):251-268

The proportional odds model (POM) is commonly used in regression analysis to predict the outcome for an ordinal response variable. The maximum likelihood estimation (MLE) approach is typically used to obtain the parameter estimates. The likelihood estimates do not exist when the number of parameters, p, is greater than the number of observations n. The MLE also does not exist if there are no overlapping observations in the data. In a situation where the number of parameters is less than the sample size but p is approaching to n, the likelihood estimates may not exist, and if they exist they may have quite large standard errors. An estimation method is proposed to address the last two issues, i.e. complete separation and the case when p approaches n, but not the case when p>n. The proposed method does not use any penalty term but uses pseudo-observations to regularize the observed responses by downgrading their effect so that they become close to the underlying probabilities. The estimates can be computed easily with all commonly used statistical packages supporting the fitting of POMs with weights. Estimates are compared with MLE in a simulation study and an application to the real data. 相似文献

17.

Editorial Collaborators

Sander Greenland 《The American statistician》2019,73(1):106-108

Abstract

The present note explores sources of misplaced criticisms of P-values, such as conflicting definitions of “significance levels” and “P-values” in authoritative sources, and the consequent misinterpretation of P-values as error probabilities. It then discusses several properties of P-values that have been presented as fatal flaws: That P-values exhibit extreme variation across samples (and thus are “unreliable”), confound effect size with sample size, are sensitive to sample size, and depend on investigator sampling intentions. These properties are often criticized from a likelihood or Bayesian framework, yet they are exactly the properties P-values should exhibit when they are constructed and interpreted correctly within their originating framework. Other common criticisms are that P-values force users to focus on irrelevant hypotheses and overstate evidence against those hypotheses. These problems are not however properties of P-values but are faults of researchers who focus on null hypotheses and overstate evidence based on misperceptions that p?=?0.05 represents enough evidence to reject hypotheses. Those problems are easily seen without use of Bayesian concepts by translating the observed P-value p into the Shannon information (S-value or surprisal) –log₂(p). 相似文献

18.

Asymmetry models for square contingency tables: exact tests via algebraic statistics

Anne Krampe Maria Kateri Sonja Kuhnt 《Statistics and Computing》2011,21(1):55-67

Square contingency tables with the same row and column classification occur frequently in a wide range of statistical applications, e.g. whenever the members of a matched pair are classified on the same scale, which is usually ordinal. Such tables are analysed by choosing an appropriate loglinear model. We focus on the models of symmetry, triangular, diagonal and ordinal quasi symmetry. The fit of a specific model is tested by the chi-squared test or the likelihood-ratio test, where p-values are calculated from the asymptotic chi-square distribution of the test statistic or, if this seems unjustified, from the exact conditional distribution. Since the calculation of exact p-values is often not feasible, we propose alternatives based on algebraic statistics combined with MCMC methods. 相似文献

19.

The p-value Function and Statistical Inference

D. A. S. Fraser 《The American statistician》2019,73(1):135-147

ABSTRACT

This article has two objectives. The first and narrower is to formalize the p-value function, which records all possible p-values, each corresponding to a value for whatever the scalar parameter of interest is for the problem at hand, and to show how this p-value function directly provides full inference information for any corresponding user or scientist. The p-value function provides familiar inference objects: significance levels, confidence intervals, critical values for fixed-level tests, and the power function at all values of the parameter of interest. It thus gives an immediate accurate and visual summary of inference information for the parameter of interest. We show that the p-value function of the key scalar interest parameter records the statistical position of the observed data relative to that parameter, and we then describe an accurate approximation to that p-value function which is readily constructed. 相似文献

20.

On two approximations to the F-distribution: Application to testing for intraclass correlation in family studies

Allan Donner George A. Wells Michael Eliasziw 《Revue canadienne de statistique》1989,17(2):209-215

Two approximations to the F-distribution are evaluated in the context of testing for intraclass correlation in the analysis of family data. The evaluation is based on a computation of empirical significance levels and a comparison between p-values associated with these approximations and the corresponding exact p-values. It is found that the approximate methods may give very unsatisfactory results, and exact methods are therefore recommended for general use. 相似文献