首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
Distance between two probability densities or two random variables is a well established concept in statistics. The present paper considers generalizations of distances to separation measurements for three or more elements in a function space. Geometric intuition and examples from hypothesis testing suggest lower and upper bounds for such measurements in terms of pairwise distances; but also in Lp spaces some useful non-pairwise separation measurements always lie within these bounds. Examples of such separation measurements are the Bayes probability of correct classification among several arbitrary distributions, and the expected range among several random variables.  相似文献   

2.
Summary.  We analyse the shapes of star-shaped objects which are prealigned. This is motivated from two examples studying the growth of leaves, and the temporal evolution of tree rings. In the latter case measurements were taken at fixed angles whereas in the former case the angles were free. Subsequently, this leads to different shape spaces, related to different concepts of size, for the analysis. Whereas several shape spaces already existed in the literature when the angles are fixed, a new shape space for free angles, called spherical shape space , needed to be introduced. We compare these different shape spaces both regarding their mathematical properties and in their adequacy to the data at hand; we then apply suitably defined principal component analysis on these. In both examples we find that the shapes evolve mainly along the first principal component during growth; this is the 'geodesic hypothesis' that was formulated by Le and Kume. Moreover, we could link change-points of this evolution to significant changes in environmental conditions.  相似文献   

3.
We investigate the problem of regression from multiple reproducing kernel Hilbert spaces by means of orthogonal greedy algorithm. The greedy algorithm is appealing as it uses a small portion of candidate kernels to represent the approximation of regression function, and can greatly reduce the computational burden of traditional multi-kernel learning. Satisfied learning rates are obtained based on the Rademacher chaos complexity and data dependent hypothesis spaces.  相似文献   

4.
The four-parameter kappa distribution (K4D) is a generalized form of some commonly used distributions such as generalized logistic, generalized Pareto, generalized Gumbel, and generalized extreme value (GEV) distributions. Owing to its flexibility, the K4D is widely applied in modeling in several fields such as hydrology and climatic change. For the estimation of the four parameters, the maximum likelihood approach and the method of L-moments are usually employed. The L-moment estimator (LME) method works well for some parameter spaces, with up to a moderate sample size, but it is sometimes not feasible in terms of computing the appropriate estimates. Meanwhile, using the maximum likelihood estimator (MLE) with small sample sizes shows substantially poor performance in terms of a large variance of the estimator. We therefore propose a maximum penalized likelihood estimation (MPLE) of K4D by adjusting the existing penalty functions that restrict the parameter space. Eighteen combinations of penalties for two shape parameters are considered and compared. The MPLE retains modeling flexibility and large sample optimality while also improving on small sample properties. The properties of the proposed estimator are verified through a Monte Carlo simulation, and an application case is demonstrated taking Thailand’s annual maximum temperature data.  相似文献   

5.
Log-normal and log-logistic distributions are often used to analyze lifetime data. For certain ranges of the parameters, the shape of the probability density functions or the hazard functions can be very similar in nature. It might be very difficult to discriminate between the two distribution functions. In this article, we consider the discrimination procedure between the two distribution functions. We use the ratio of maximized likelihood for discrimination purposes. The asymptotic properties of the proposed criterion are investigated. It is observed that the asymptotic distributions are independent of the unknown parameters. The asymptotic distributions are used to determine the minimum sample size needed to discriminate between these two distribution functions for a user specified probability of correct selection. We perform some simulation experiments to see how the asymptotic results work for small sizes. For illustrative purpose, two data sets are analyzed.  相似文献   

6.
Summary.  Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.  相似文献   

7.
ABSTRACT

Scientific research of all kinds should be guided by statistical thinking: in the design and conduct of the study, in the disciplined exploration and enlightened display of the data, and to avoid statistical pitfalls in the interpretation of the results. However, formal, probability-based statistical inference should play no role in most scientific research, which is inherently exploratory, requiring flexible methods of analysis that inherently risk overfitting. The nature of exploratory work is that data are used to help guide model choice, and under these circumstances, uncertainty cannot be precisely quantified, because of the inevitable model selection bias that results. To be valid, statistical inference should be restricted to situations where the study design and analysis plan are specified prior to data collection. Exploratory data analysis provides the flexibility needed for most other situations, including statistical methods that are regularized, robust, or nonparametric. Of course, no individual statistical analysis should be considered sufficient to establish scientific validity: research requires many sets of data along many lines of evidence, with a watchfulness for systematic error. Replicating and predicting findings in new data and new settings is a stronger way of validating claims than blessing results from an isolated study with statistical inferences.  相似文献   

8.
Digits in statistical data produced by natural or social processes are often distributed in a manner described by ‘Benford's law’. Recently, a test against this distribution was used to identify fraudulent accounting data. This test is based on the supposition that first, second, third, and other digits in real data follow the Benford distribution while the digits in fabricated data do not. Is it possible to apply Benford tests to detect fabricated or falsified scientific data as well as fraudulent financial data? We approached this question in two ways. First, we examined the use of the Benford distribution as a standard by checking the frequencies of the nine possible first and ten possible second digits in published statistical estimates. Second, we conducted experiments in which subjects were asked to fabricate statistical estimates (regression coefficients). The digits in these experimental data were scrutinized for possible deviations from the Benford distribution. There were two main findings. First, both digits of the published regression coefficients were approximately Benford distributed or at least followed a pattern of monotonic decline. Second, the experimental results yielded new insights into the strengths and weaknesses of Benford tests. Surprisingly, first digits of faked data also exhibited a pattern of monotonic decline, while second, third, and fourth digits were distributed less in accordance with Benford's law. At least in the case of regression coefficients, there were indications that checks for digit-preference anomalies should focus less on the first (i.e. leftmost) and more on later digits.  相似文献   

9.
ABSTRACT

Second generation p-values preserve the simplicity that has made p-values popular while resolving critical flaws that promote misinterpretation of data, distraction by trivial effects, and unreproducible assessments of data. The second-generation p-value (SGPV) is an extension that formally accounts for scientific relevance by using a composite null hypothesis that captures null and scientifically trivial effects. Because the majority of spurious findings are small effects that are technically nonnull but practically indistinguishable from the null, the second-generation approach greatly reduces the likelihood of a false discovery. SGPVs promote transparency, rigor and reproducibility of scientific results by a priori identifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with the candidate hypotheses or null hypotheses, or when the data are inconclusive. We illustrate the importance of these advances using a dataset of 247,000 single-nucleotide polymorphisms, i.e., genetic markers that are potentially associated with prostate cancer.  相似文献   

10.
Inference for a generalized linear model is generally performed using asymptotic approximations for the bias and the covariance matrix of the parameter estimators. For small experiments, these approximations can be poor and result in estimators with considerable bias. We investigate the properties of designs for small experiments when the response is described by a simple logistic regression model and parameter estimators are to be obtained by the maximum penalized likelihood method of Firth [Firth, D., 1993, Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38]. Although this method achieves a reduction in bias, we illustrate that the remaining bias may be substantial for small experiments, and propose minimization of the integrated mean square error, based on Firth's estimates, as a suitable criterion for design selection. This approach is used to find locally optimal designs for two support points.  相似文献   

11.
We extend recent work on Laplace approximations (Tierney and Kadane 1986; Tierney, Kass, and Kadane 1989) from parameter spaces that are subspaces of Rk to those that are on circles, spheres, and cylinders. While such distributions can be mapped onto the real line (for example, a distribution on the circle can be thought of as a function of an angle θ, 0 ? 0 ? 2π), that the end points coincide is not a feature of the real line, and requires special treatment. Laplace approximations on the real line make essential use of the normal integral in both the numerator and the denominator. Here that role is played by the von Mises integral on the circle, by the Bingham integrals on the spheres and hyperspheres, and by the normal-von Mises and normal-Bingham integrals on the cylinders and hypercylinders, respectively. We begin with a brief introduction to Laplace approximations and to previous Bayesian work on circles, spheres, and cylinders. We then develop the theory for parameter spaces that are hypercylinders, since all other shapes considered here are special cases. We compute some examples, which show reasonable accuracy even for small samples.  相似文献   

12.
ABSTRACT

In the parametric setting, the notion of a likelihood function forms the basis for the development of tests of hypotheses and estimation of parameters. Tests in connection with the analysis of variance stem entirely from considerations of the likelihood function. On the other hand, non parametric procedures have generally been derived without any formal mechanism and are often the result of clever intuition. In the present article, we propose a more formal approach for deriving tests involving the use of ranks. Specifically, we define a likelihood function motivated by characteristics of the ranks of the data and demonstrate that this leads to well-known tests of hypotheses. We also point to various areas of further exploration such as how co-variates may be incorporated.  相似文献   

13.
Functional data are being observed frequently in many scientific fields, and therefore most of the standard statistical methods are being adapted for functional data. The multivariate analysis of variance problem for functional data is considered. It seems to be of practical interest similarly as the one-way analysis of variance for such data. For the MANOVA problem for multivariate functional data, we propose permutation tests based on a basis function representation and tests based on random projections. Their performance is examined in comprehensive simulation studies, which provide an idea of the size control and power of the tests and identify differences between them. The simulation experiments are based on artificial data and real labeled multivariate time series data found in the literature. The results suggest that the studied testing procedures can detect small differences between vectors of curves even with small sample sizes. Illustrative real data examples of the use of the proposed testing procedures in practice are also presented.  相似文献   

14.
We introduce new criteria for model discrimination and use these and existing criteria to evaluate standard orthogonal designs. We show that the capability of orthogonal designs for model discrimination is surprisingly varied. In fact, for specified sample sizes, number of factors, and model spaces, many orthogonal designs are not model discriminating by the definition given in this paper, while others in the same class of orthogonal designs are. We also use these criteria to construct optimal two-level model-discriminating designs for screening experiments. The efficacy of these designs is studied, both in terms of estimation efficiency and discrimination success. Simulation studies indicate that the constructed designs result in substantively higher likelihoods of identifying the correct model.  相似文献   

15.
Recent approaches to the statistical analysis of adverse event (AE) data in clinical trials have proposed the use of groupings of related AEs, such as by system organ class (SOC). These methods have opened up the possibility of scanning large numbers of AEs while controlling for multiple comparisons, making the comparative performance of the different methods in terms of AE detection and error rates of interest to investigators. We apply two Bayesian models and two procedures for controlling the false discovery rate (FDR), which use groupings of AEs, to real clinical trial safety data. We find that while the Bayesian models are appropriate for the full data set, the error controlling methods only give similar results to the Bayesian methods when low incidence AEs are removed. A simulation study is used to compare the relative performances of the methods. We investigate the differences between the methods over full trial data sets, and over data sets with low incidence AEs and SOCs removed. We find that while the removal of low incidence AEs increases the power of the error controlling procedures, the estimated power of the Bayesian methods remains relatively constant over all data sizes. Automatic removal of low-incidence AEs however does have an effect on the error rates of all the methods, and a clinically guided approach to their removal is needed. Overall we found that the Bayesian approaches are particularly useful for scanning the large amounts of AE data gathered.  相似文献   

16.
Confidence interval is a basic type of interval estimation in statistics. When dealing with samples from a normal population with the unknown mean and the variance, the traditional method to construct t-based confidence intervals for the mean parameter is to treat the n sampled units as n groups and build the intervals. Here we propose a generalized method. We first divide them into several equal-sized groups and then calculate the confidence intervals with the mean values of these groups. If we define “better” in terms of the expected length of the confidence interval, then the first method is better because the expected length of the confidence interval obtained from the first method is shorter. We prove this intuition theoretically. We also specify when the elements in each group are correlated, the first method is invalid, while the second can give us correct results in terms of the coverage probability. We illustrate this with analytical expressions. In practice, when the data set is extremely large and distributed in several data centers, the second method is a good tool to get confidence intervals, in both independent and correlated cases. Some simulations and real data analyses are presented to verify our theoretical results.  相似文献   

17.
We show that for any sample size, any size of the test, and any weights matrix outside a small class of exceptions, there exists a positive measure set of regression spaces such that the power of the Cliff–Ord test vanishes as the autocorrelation increases in a spatial error model. This result extends to the tests that define the Gaussian power envelope of all invariant tests for residual spatial autocorrelation. In most cases, the regression spaces such that the problem occurs depend on the size of the test, but there also exist regression spaces such that the power vanishes regardless of the size. A characterization of such particularly hostile regression spaces is provided.  相似文献   

18.
This paper is devoted to the estimation of the derivative of the regression function in fixed-design nonparametric regression. We establish the almost sure convergence as well as the asymptotic normality of our estimate. We also provide concentration inequalities which are useful for small sample sizes. Numerical experiments on simulated data show that our nonparametric statistical procedure performs very well. We also illustrate our approach on high-frequency environmental data for the study of marine pollution.  相似文献   

19.
Population-parameter mapping (PPM) is a method for estimating the parameters of latent scientific models that describe the statistical likelihood function. The PPM method involves a Bayesian inference in terms of the statistical parameters and the mapping from the statistical parameter space to the parameter space of the latent scientific parameters, and obtains a model coherence estimate, P(coh). The P(coh) statistic can be valuable for designing experiments, comparing competing models, and can be helpful in redesigning flawed models. Examples are provided where greater estimation precision was found for small sample sizes for the PPM point estimates relative to the maximum likelihood estimator (MLE).  相似文献   

20.
ABSTRACT

Data Science is one of the newest interdisciplinary areas. It is transforming our lives unexpectedly fast. This transformation is also happening in our learning styles and practicing habits. We advocate an approach to data science training that uses several types of computational tools, including R, bash, awk, regular expressions, SQL, and XPath, often used in tandem. We discuss ways for undergraduate mentees to learn about data science topics, at an early point in their training. We give some intuition for researchers, professors, and practitioners about how to effectively embed real-life examples into data science learning environments. As a result, we have a unified program built on a foundation of team-oriented, data-driven projects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号