期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Proposed Hybrid Effect Size Plus p-Value Criterion: Empirical Evidence Supporting its Use

William M. Goodman Susan E. Spruill Eugene Komaroff 《The American statistician》2019,73(1):168-185

ABSTRACT

When the editors of Basic and Applied Social Psychology effectively banned the use of null hypothesis significance testing (NHST) from articles published in their journal, it set off a fire-storm of discussions both supporting the decision and defending the utility of NHST in scientific research. At the heart of NHST is the p-value which is the probability of obtaining an effect equal to or more extreme than the one observed in the sample data, given the null hypothesis and other model assumptions. Although this is conceptually different from the probability of the null hypothesis being true, given the sample, p-values nonetheless can provide evidential information, toward making an inference about a parameter. Applying a 10,000-case simulation described in this article, the authors found that p-values’ inferential signals to either reject or not reject a null hypothesis about the mean (α?=?0.05) were consistent for almost 70% of the cases with the parameter’s true location for the sampled-from population. Success increases if a hybrid decision criterion, minimum effect size plus p-value (MESP), is used. Here, rejecting the null also requires the difference of the observed statistic from the exact null to be meaningfully large or practically significant, in the researcher’s judgment and experience. The simulation compares performances of several methods: from p-value and/or effect size-based, to confidence-interval based, under various conditions of true location of the mean, test power, and comparative sizes of the meaningful distance and population variability. For any inference procedure that outputs a binary indicator, like flagging whether a p-value is significant, the output of one single experiment is not sufficient evidence for a definitive conclusion. Yet, if a tool like MESP generates a relatively reliable signal and is used knowledgeably as part of a research process, it can provide useful information. 相似文献

2.

The adaptive calibration of testing p-values

Guimei Zhao Li Wang 《统计学通讯:理论与方法》2013,42(4):922-932

Abstract

In statistical hypothesis testing, a p-value is expected to be distributed as the uniform distribution on the interval (0, 1) under the null hypothesis. However, some p-values, such as the generalized p-value and the posterior predictive p-value, cannot be assured of this property. In this paper, we propose an adaptive p-value calibration approach, and show that the calibrated p-value is asymptotically distributed as the uniform distribution. For Behrens–Fisher problem and goodness-of-fit test under a normal model, the calibrated p-values are constructed and their behavior is evaluated numerically. Simulations show that the calibrated p-values are superior than original ones. 相似文献

3.

Interval-wise testing for functional data

A. Pini S. Vantini 《Journal of nonparametric statistics》2017,29(2):407-424

In the framework of null hypothesis significance testing for functional data, we propose a procedure able to select intervals of the domain imputable for the rejection of a null hypothesis. An unadjusted p-value function and an adjusted one are the output of the procedure, namely interval-wise testing. Depending on the sort and level α of type-I error control, significant intervals can be selected by thresholding the two p-value functions at level α. We prove that the unadjusted (adjusted) p-value function point-wise (interval-wise) controls the probability of type-I error and it is point-wise (interval-wise) consistent. To enlighten the gain in terms of interpretation of the phenomenon under study, we applied the interval-wise testing to the analysis of a benchmark functional data set, i.e. Canadian daily temperatures. The new procedure provides insights that current state-of-the-art procedures do not, supporting similar advantages in the analysis of functional data with less prior knowledge. 相似文献

4.

The p-Value Requires Context,Not a Threshold

Rebecca A. Betensky 《The American statistician》2019,73(1):115-117

Abstract

It is widely recognized by statisticians, though not as widely by other researchers, that the p-value cannot be interpreted in isolation, but rather must be considered in the context of certain features of the design and substantive application, such as sample size and meaningful effect size. I consider the setting of the normal mean and highlight the information contained in the p-value in conjunction with the sample size and meaningful effect size. The p-value and sample size jointly yield 95% confidence bounds for the effect of interest, which can be compared to the predetermined meaningful effect size to make inferences about the true effect. I provide simple examples to demonstrate that although the p-value is calculated under the null hypothesis, and thus seemingly may be divorced from the features of the study from which it arises, its interpretation as a measure of evidence requires its contextualization within the study. This implies that any proposal for improved use of the p-value as a measure of the strength of evidence cannot simply be a change to the threshold for significance. 相似文献

5.

Three Recommendations for Improving the Use of p-Values

Daniel J. Benjamin 《The American statistician》2019,73(1):186-191

ABSTRACT

Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? p-Values themselves do not directly answer this question and are often misinterpreted in ways that lead to overstating the evidence against the null hypothesis. Even in the “post p?<?0.05 era,” however, it is quite possible that p-values will continue to be widely reported and used to assess the strength of evidence (if for no other reason than the widespread availability and use of statistical software that routinely produces p-values and thereby implicitly advocates for their use). If so, the potential for misinterpretation will persist. In this article, we recommend three practices that would help researchers more accurately interpret p-values. Each of the three recommended practices involves interpreting p-values in light of their corresponding “Bayes factor bound,” which is the largest odds in favor of the alternative hypothesis relative to the null hypothesis that is consistent with the observed data. The Bayes factor bound generally indicates that a given p-value provides weaker evidence against the null hypothesis than typically assumed. We therefore believe that our recommendations can guard against some of the most harmful p-value misinterpretations. In research communities that are deeply attached to reliance on “p?<?0.05,” our recommendations will serve as initial steps away from this attachment. We emphasize that our recommendations are intended merely as initial, temporary steps and that many further steps will need to be taken to reach the ultimate destination: a holistic interpretation of statistical evidence that fully conforms to the principles laid out in the ASA statement on statistical significance and p-values. 相似文献

6.

Abandon Statistical Significance

Blakeley B. McShane David Gal Andrew Gelman Christian Robert Jennifer L. Tackett 《The American statistician》2019,73(1):235-245

ABSTRACT

We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly. 相似文献

7.

Modified p-Value of Two-Sided Test for Normal Distribution with Restricted Parameter Space

Hsiuying Wang 《统计学通讯:理论与方法》2013,42(8):1361-1374

This article proposes a modified p-value for the two-sided test of the location of the normal distribution when the parameter space is restricted. A commonly used test for the two-sided test of the normal distribution is the uniformly most powerful unbiased (UMPU) test, which is also the likelihood ratio test. The p-value of the test is used as evidence against the null hypothesis. Note that the usual p-value does not depend on the parameter space but only on the observation and the assumption of the null hypothesis. When the parameter space is known to be restricted, the usual p-value cannot sufficiently utilize this information to make a more accurate decision. In this paper, a modified p-value (also called the rp-value) dependent on the parameter space is proposed, and the test derived from the modified p-value is also shown to be the UMPU test. 相似文献

8.

Practicing safe statistics with the mid-p

R.D. Routledge 《Revue canadienne de statistique》1994,22(1):103-110

The mid-p-value is the standard p-value for a test minus half the difference between it and the nearest lower possible value. Its smaller size lends it an obvious appeal to users — it provides a more significant-looking summary of the evidence against the null hypothesis. This paper examines the possibility that the user might overstate the significance of the evidence by using the smaller mid-p in place of the standard p-value. Routine use of the mid-p is shown to control a quantity related to the Type I error rate. This related quantity is appropriate to consider when the decision to accept or reject the null hypothesis is not always firm. The natural, subjective interpretation of a p-value as the probability that the null hypothesis is true is also examined. The usual asymptotic correspondence between these two probabilities for one-sided hypotheses is shown to be strengthened when the standard p-value is replaced by the mid-p. 相似文献

9.

A more powerful unconditional exact test of homogeneity for 2 × c contingency table analysis

Louis Ehwerhemuepha Heng Sok Cyril Rakovski 《Journal of applied statistics》2019,46(14):2572-2582

The classical unconditional exact p-value test can be used to compare two multinomial distributions with small samples. This general hypothesis requires parameter estimation under the null which makes the test severely conservative. Similar property has been observed for Fisher's exact test with Barnard and Boschloo providing distinct adjustments that produce more powerful testing approaches. In this study, we develop a novel adjustment for the conservativeness of the unconditional multinomial exact p-value test that produces nominal type I error rate and increased power in comparison to all alternative approaches. We used a large simulation study to empirically estimate the 5th percentiles of the distributions of the p-values of the exact test over a range of scenarios and implemented a regression model to predict the values for two-sample multinomial settings. Our results show that the new test is uniformly more powerful than Fisher's, Barnard's, and Boschloo's tests with gains in power as large as several hundred percent in certain scenarios. Lastly, we provide a real-life data example where the unadjusted unconditional exact test wrongly fails to reject the null hypothesis and the corrected unconditional exact test rejects the null appropriately. 相似文献

10.

A Bayesian analysis for the multivariate point null testing problem

Miguel A. Gómez–Villegas Paloma Maín Luis Sanz 《Statistics》2013,47(4):379-391

A Bayesian test for the point null testing problem in the multivariate case is developed. A procedure to get the mixed distribution using the prior density is suggested. For comparisons between the Bayesian and classical approaches, lower bounds on posterior probabilities of the null hypothesis, over some reasonable classes of prior distributions, are computed and compared with the p-value of the classical test. With our procedure, a better approximation is obtained because the p-value is in the range of the Bayesian measures of evidence. 相似文献

11.

Will the ASA's Efforts to Improve Statistical Practice be Successful? Some Evidence to the Contrary

Raymond Hubbard 《The American statistician》2019,73(1):31-35

ABSTRACT

Recent efforts by the American Statistical Association to improve statistical practice, especially in countering the misuse and abuse of null hypothesis significance testing (NHST) and p-values, are to be welcomed. But will they be successful? The present study offers compelling evidence that this will be an extraordinarily difficult task. Dramatic citation-count data on 25 articles and books severely critical of NHST's negative impact on good science, underlining that this issue was/is well known, did nothing to stem its usage over the period 1960–2007. On the contrary, employment of NHST increased during this time. To be successful in this endeavor, as well as restoring the relevance of the statistics profession to the scientific community in the 21st century, the ASA must be prepared to dispense detailed advice. This includes specifying those situations, if they can be identified, in which the p-value plays a clearly valuable role in data analysis and interpretation. The ASA might also consider a statement that recommends abandoning the use of p-values. 相似文献

12.

Testing linear hypotheses in high-dimensional regressions

Zhidong Bai Dandan Jiang Jian-feng Yao 《Statistics》2013,47(6):1207-1223

For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative hypothesis requires complex analytic approximations, and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say p≤20. On the other hand, assuming that the data dimension p as well as the number q of regression variables are fixed while the sample size n grows, several asymptotic approximations are proposed in the literature for Wilk's Λ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension p and a large sample size n. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null hypothesis and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large p and large n context, but also for moderately large data dimensions such as p=30 or p=50. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in multivariate analysis of variance which is valid for high-dimensional data. 相似文献

13.

p-Values,Bayes Factors,and Sufficiency

Jonathan Rougier 《The American statistician》2019,73(1):148-151

ABSTRACT

Various approaches can be used to construct a model from a null distribution and a test statistic. I prove that one such approach, originating with D. R. Cox, has the property that the p-value is never greater than the Generalized Likelihood Ratio (GLR). When combined with the general result that the GLR is never greater than any Bayes factor, we conclude that, under Cox’s model, the p-value is never greater than any Bayes factor. I also provide a generalization, illustrations for the canonical Normal model, and an alternative approach based on sufficiency. This result is relevant for the ongoing discussion about the evidential value of small p-values, and the movement among statisticians to “redefine statistical significance.” 相似文献

14.

Statistical evidence and surprise unified under possibility theory

David R. Bickel 《Scandinavian Journal of Statistics》2023,50(3):923-928

Sander Greenland argues that reported results of hypothesis tests should include the surprisal, the base-2 logarithm of the reciprocal of a p-value. The surprisal measures how many bits of evidence in the data warrant rejecting the null hypothesis. A generalization of surprisal also can measure how much the evidence justifies rejecting a composite hypothesis such as the complement of a confidence interval. That extended surprisal, called surprise, quantifies how many bits of astonishment an agent believing a hypothesis would experience upon observing the data. While surprisal is a function of a point in hypothesis space, surprise is a function of a subset of hypothesis space. Satisfying the conditions of conditional min-plus probability, surprise inherits a wealth of tools from possibility theory. The equivalent compatibility function has been recently applied to the replication crisis, to adjusting p-values for prior information, and to comparing scientific theories. 相似文献

15.

A modified two-sample t-test based on permutation method for large-scale data

Mohsen Salehi Mohammad Mohammadi Mina Aminghafari 《统计学通讯:模拟与计算》2019,48(2):372-384

In large-scale data, for example, analyzing microarray data, which includes hypothesis testing for equality of means in order to discover differentially expressed genes, often deals with a large number of features versus a few number of replicates. Furthermore, some genes are differentially expressed and some others not. Thus, a usual permutation method, which is applied facing these situations, estimates the p-value poorly. This is because two types of genes are mixed. To overcome this obstacle, the null permutation samples are suggested in the literatures. We propose a modified uniformly most powerful unbiased test for testing the null hypothesis. 相似文献

16.

Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels

Mark Andrew Gannon Carlos Alberto de Bragança Pereira Adriano Polpo 《The American statistician》2019,73(1):213-222

ABSTRACT

This article argues that researchers do not need to completely abandon the p-value, the best-known significance index, but should instead stop using significance levels that do not depend on sample sizes. A testing procedure is developed using a mixture of frequentist and Bayesian tools, with a significance level that is a function of sample size, obtained from a generalized form of the Neyman–Pearson Lemma that minimizes a linear combination of α, the probability of rejecting a true null hypothesis, and β, the probability of failing to reject a false null, instead of fixing α and minimizing β. The resulting hypothesis tests do not violate the Likelihood Principle and do not require any constraints on the dimensionalities of the sample space and parameter space. The procedure includes an ordering of the entire sample space and uses predictive probability (density) functions, allowing for testing of both simple and compound hypotheses. Accessible examples are presented to highlight specific characteristics of the new tests. 相似文献

17.

The Detection of Nonnegligible Directional Effects With Associated Measures of Statistical Significance

Melinda H. McCann Joshua D. Habiger 《The American statistician》2020,74(3):213-217

ABSTRACT

When comparing two treatment groups, the objectives are often to (1) determine if the difference between groups (the effect) is of scientific interest, or nonnegligible, and (2) determine if the effect is positive or negative. In practice, a p-value corresponding to the null hypothesis that no effect exists is used to accomplish the first objective and a point estimate for the effect is used to accomplish the second objective. This article demonstrates that this approach is fundamentally flawed and proposes a new approach. The proposed method allows for claims regarding the size of an effect (nonnegligible vs. negligible) and its nature (positive vs. negative) to be made, and provides measures of statistical significance associated with each claim. 相似文献

18.

The p-values for one-sided hypothesis testing in univariate linear calibration

Guimei Zhao 《统计学通讯:模拟与计算》2017,46(6):4828-4840

In this article, we focus on the one-sided hypothesis testing for the univariate linear calibration, where a normally distributed response variable and an explanatory variable are involved. The observations of the response variable corresponding to known values of the explanatory variable are used to make inferences on a single unknown value of the explanatory variable. We apply the generalized inference to the calibration problem, and take the generalized p-value as the test statistic to develop a new p-value for one-sided hypothesis testing, which we refer to as the one-sided posterior predictive p-value. The behavior of the one-sided posterior predictive p-value is numerically compared with that of the generalized p-value, and simulations show that the proposed p-value is quite satisfactory in the frequentist performance. 相似文献

19.

Significance test for linear regression: how to test without P-values?

Paravee Maneejuk Woraphon Yamaka 《Journal of applied statistics》2021,48(5):827

The discussion on the use and misuse of p-values in 2016 by the American Statistician Association was a timely assertion that statistical concept should be properly used in science. Some researchers, especially the economists, who adopt significance testing and p-values to report their results, may felt confused by the statement, leading to misinterpretations of the statement. In this study, we aim to re-examine the accuracy of the p-value and introduce an alternative way for testing the hypothesis. We conduct a simulation study to investigate the reliability of the p-value. Apart from investigating the performance of p-value, we also introduce some existing approaches, Minimum Bayes Factors and Belief functions, for replacing p-value. Results from the simulation study confirm unreliable p-value in some cases and that our proposed approaches seem to be useful as the substituted tool in the statistical inference. Moreover, our results show that the plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p-values when the null hypothesis is true. However, the MBFs of Edwards et al. [Bayesian statistical inference for psychological research. Psychol. Rev. 70(3) (1963), pp. 193–242]; Vovk [A logic of probability, with application to the foundations of statistics. J. Royal Statistical Soc. Series B (Methodological) 55 (1993), pp. 317–351] and Sellke et al. [Calibration of p values for testing precise null hypotheses. Am. Stat. 55(1) (2001), pp. 62–71] provide more reliable results compared to all other methods when the null hypothesis is false.KEYWORDS: Ban of P-value, Minimum Bayes Factors, belief functions 相似文献

20.

Inference and Decision Making for 21st-Century Drug Development and Approval

Stephen J. Ruberg Frank E. Harrell Jr. Margaret Gamalo-Siebers Lisa LaVange J. Jack Lee Karen Price 《The American statistician》2019,73(1):319-327

ABSTRACT

The cost and time of pharmaceutical drug development continue to grow at rates that many say are unsustainable. These trends have enormous impact on what treatments get to patients, when they get them and how they are used. The statistical framework for supporting decisions in regulated clinical development of new medicines has followed a traditional path of frequentist methodology. Trials using hypothesis tests of “no treatment effect” are done routinely, and the p-value < 0.05 is often the determinant of what constitutes a “successful” trial. Many drugs fail in clinical development, adding to the cost of new medicines, and some evidence points blame at the deficiencies of the frequentist paradigm. An unknown number effective medicines may have been abandoned because trials were declared “unsuccessful” due to a p-value exceeding 0.05. Recently, the Bayesian paradigm has shown utility in the clinical drug development process for its probability-based inference. We argue for a Bayesian approach that employs data from other trials as a “prior” for Phase 3 trials so that synthesized evidence across trials can be utilized to compute probability statements that are valuable for understanding the magnitude of treatment effect. Such a Bayesian paradigm provides a promising framework for improving statistical inference and regulatory decision making. 相似文献