首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In recent years, growing attention has been placed on the increasing pattern of ‘clumpy data’ in many empirical areas such as financial market microstructure, criminology and seismology, and digital media consumption to name just a few; but a well-defined and careful measurement of clumpiness has remained somewhat elusive. The related ‘hot hand’ effect has long been a widespread belief in sports, and has triggered a branch of interesting research which could shed some light on this domain. However, since many concerns have been raised about the low power of the existing ‘hot hand’ significance tests, we propose a new class of clumpiness measures which are shown to have higher statistical power in extensive simulations under a wide variety of statistical models for repeated outcomes. Finally, an empirical study is provided by using a unique dataset obtained from Hulu.com, an increasingly popular video streaming provider. Our results provide evidence that the ‘clumpiness phenomena’ is widely prevalent in digital content consumption, which supports the lore of ‘bingeability’ of online content believed to exist today.  相似文献   

2.
The data cloning method is a new computational tool for computing maximum likelihood estimates in complex statistical models such as mixed models. This method is synthesized with integrated nested Laplace approximation to compute maximum likelihood estimates efficiently via a fast implementation in generalized linear mixed models. Asymptotic behavior of the hybrid data cloning method is discussed. The performance of the proposed method is illustrated through a simulation study and real examples. It is shown that the proposed method performs well and rightly justifies the theory. Supplemental materials for this article are available online.  相似文献   

3.
The article develops a semiparametric estimation method for the bivariate count data regression model. We develop a series expansion approach in which dependence between count variables is introduced by means of stochastically related unobserved heterogeneity components, and in which, unlike existing commonly used models, positive as well as negative correlations are allowed. Extensions that accommodate excess zeros, censored data, and multivariate generalizations are also given. Monte Carlo experiments and an empirical application to tobacco use confirms that the model performs well relative to existing bivariate models, in terms of various statistical criteria and in capturing the range of correlation among dependent variables. This article has supplementary materials online.  相似文献   

4.
The availability of the next generation sequencing (NGS) technology in today's biomedical research has provided new opportunities in scientific discovery of genetic information. The high-throughput NGS technology, especially DNA-seq, is particularly useful in profiling a genome for the analysis of DNA copy number variants (CNVs). The read count (RC) data resulting from NGS technology are massive and information rich. How to exploit the RC data for accurate CNV detection has become a computational and statistical challenge. We provide a statistical online change point method to help detect CNVs in the sequencing RC data in this paper. This method uses the idea of online searching for change point (or breakpoint) with a Markov chain assumption on the breakpoints loci and an iterative computing process via a Bayesian framework. We illustrate that an online change-point detection method is particularly suitable for identifying CNVs in the RC data. The algorithm is applied to the publicly available NCI-H2347 lung cancer cell line sequencing reads data for locating the breakpoints. Extensive simulation studies have been carried out and results show the good behavior of the proposed algorithm. The algorithm is implemented in R and the codes are available upon request.  相似文献   

5.
When genuine panel data samples are not available, repeated cross-sectional surveys can be used to form so-called pseudo panels. In this article, we investigate the properties of linear pseudo panel data estimators with fixed number of cohorts and time observations. We extend standard linear pseudo panel data setup to models with factor residuals by adapting the quasi-differencing approach developed for genuine panels. In a Monte Carlo study, we find that the proposed procedure has good finite sample properties in situations with endogeneity, cohort interactive effects, and near nonidentification. Finally, as an illustration the proposed method is applied to data from Ecuador to study labor supply elasticity. Supplementary materials for this article are available online.  相似文献   

6.
The six recommendations made by the Guidelines for Assessment and Instruction in Statistics Education (GAISE) committee were first communicated in 2005 and more formally in 2010. In this article, 25 introductory statistics textbooks are examined to assess how well these textbooks have incorporated the three GAISE recommendations most relevant to implementation in textbooks (statistical literacy and thinking; use of real data; stress concepts over procedures). The implementation of another recommendation (using technology) is described but not assessed. In general, most textbooks appear to be adopting the GAISE recommendations reasonably well in both exposition and exercises. The textbooks are particularly adept at using real data, using real data well, and promoting statistical literacy. Textbooks are less adept—but still rated reasonably well, in general—at explaining concepts over procedures and promoting statistical thinking. In contrast, few textbooks have easy-usable glossaries of statistical terms to assist with understanding of statistical language and literacy development. Supplementary materials for this article are available online.  相似文献   

7.
网上拍卖中竞买者出价数据的特征及分析方法研究   总被引:2,自引:1,他引:1  
在传统统计分析中,研究者面对的数值型数据有三种形式,即横截面数据、时间序列数据以及混合数据。这些类型的数据具有离散、等间隔分布、密度均匀等特点,它们是传统的描述性统计和推断性统计中最主要的数据分析对象。然而,从拍卖网站收集到的诸如竞买者出价等数据,却不具备这些特点,对传统统计分析方法提出了挑战。因此需要从数据容量、数据的混合性、不等间隔分布及数据密度等方面,对网上拍卖数据的产生机制进行阐释,对其特征进行分析,并结合实际网上拍卖资料给出分析此类数据的方法和过程。  相似文献   

8.
Summary. A review of methods suggested in the literature for sequential detection of changes in public health surveillance data is presented. Many researchers have noted the need for prospective methods. In recent years there has been an increased interest in both the statistical and the epidemiological literature concerning this type of problem. However, most of the vast literature in public health monitoring deals with retrospective methods, especially spatial methods. Evaluations with respect to the statistical properties of interest for prospective surveillance are rare. The special aspects of prospective statistical surveillance and different ways of evaluating such methods are described. Attention is given to methods that include only the time domain as well as methods for detection where observations have a spatial structure. In the case of surveillance of a change in a Poisson process the likelihood ratio method and the Shiryaev–Roberts method are derived.  相似文献   

9.
Real-time monitoring is necessary for nanoparticle exposure assessment to characterize the exposure profile, but the data produced are autocorrelated. This study was conducted to compare three statistical methods used to analyze data, which constitute autocorrelated time series, and to investigate the effect of averaging time on the reduction of the autocorrelation using field data. First-order autoregressive (AR(1)) and autoregressive-integrated moving average (ARIMA) models are alternative methods that remove autocorrelation. The classical regression method was compared with AR(1) and ARIMA. Three data sets were used. Scanning mobility particle sizer data were used. We compared the results of regression, AR(1), and ARIMA with averaging times of 1, 5, and 10?min. AR(1) and ARIMA models had similar capacities to adjust autocorrelation of real-time data. Because of the non-stationary of real-time monitoring data, the ARIMA was more appropriate. When using the AR(1), transformation into stationary data was necessary. There was no difference with a longer averaging time. This study suggests that the ARIMA model could be used to process real-time monitoring data especially for non-stationary data, and averaging time setting is flexible depending on the data interval required to capture the effects of processes for occupational and environmental nano measurements.  相似文献   

10.
It is widely accepted that some financial data exhibit long memory or long dependence, and that the observed data usually possess noise. In the continuous time situation, the factional Brownian motion BH and its extension are an important class of models to characterize the long memory or short memory of data, and Hurst parameter H is an index to describe the degree of dependence. In this article, we estimate the Hurst parameter of a discretely sampled fractional integral process corrupted by noise. We use the preaverage method to diminish the impact of noise, employ the filter method to exclude the strong dependence, and obtain the smoothed data, and estimate the Hurst parameter by the smoothed data. The asymptotic properties such as consistency and asymptotic normality of the estimator are established. Simulations for evaluating the performance of the estimator are conducted. Supplementary materials for this article are available online.  相似文献   

11.
Stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates in applications involving large‐scale data or streaming data. As an alternative version, averaged implicit SGD (AI‐SGD) has been shown to be more stable and more efficient. Although the asymptotic properties of AI‐SGD have been well established, statistical inferences based on it such as interval estimation remain unexplored. The bootstrap method is not computationally feasible because it requires to repeatedly resample from the entire data set. In addition, the plug‐in method is not applicable when there is no explicit covariance matrix formula. In this paper, we propose a scalable statistical inference procedure, which can be used for conducting inferences based on the AI‐SGD estimator. The proposed procedure updates the AI‐SGD estimate as well as many randomly perturbed AI‐SGD estimates, upon the arrival of each observation. We derive some large‐sample theoretical properties of the proposed procedure and examine its performance via simulation studies.  相似文献   

12.
The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.  相似文献   

13.
A general dynamic panel data model is considered that incorporates individual and interactive fixed effects allowing for contemporaneous correlation in model innovations. The model accommodates general stationary or nonstationary long-range dependence through interactive fixed effects and innovations, removing the necessity to perform a priori unit-root or stationarity testing. Moreover, persistence in innovations and interactive fixed effects allows for cointegration; innovations can also have vector-autoregressive dynamics; deterministic trends can be featured. Estimations are performed using conditional-sum-of-squares criteria based on projected series by which latent characteristics are proxied. Resulting estimates are consistent and asymptotically normal at standard parametric rates. A simulation study provides reliability on the estimation method. The method is then applied to the long-run relationship between debt and GDP. Supplementary materials for this article are available online.  相似文献   

14.
In a recent article from the Annals of Applied Statistics, Cox discussed the main phases of applied statistical research ranging from clarifying study objectives to final data analysis and interpreting results. As an incidental remark to these main phases, we advocate that beyond cleaning and preprocessing the data, it is a good practice to audit the data to determine if they can be trusted at all. A case study based on Ghanaian Official Fishery Statistics is used to illustrate this need, with Benford's law being the tool used to carrying out the data audit. Supplementary materials for this article are available online.  相似文献   

15.

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

  相似文献   

16.
Simpson's paradox is a challenging topic to teach in an introductory statistics course. To motivate students to understand this paradox both intuitively and statistically, this article introduces several new ways to teach Simpson's paradox. We design a paper toss activity between instructors and students in class to engage students in the learning process. We show that Simpson's paradox widely exists in basketball statistics, and thus instructors may consider looking for Simpson's paradox in their own school basketball teams as examples to motivate students’ interest. A new probabilistic explanation of Simpson's paradox is provided, which helps foster students’ statistical understanding. Supplementary materials for this article are available online.  相似文献   

17.
The asymptotic results pertaining to the distribution of the log-likelihood ratio allow for the creation of a confidence region, which is a general extension of the confidence interval. Two- and three-dimensional regions can be displayed visually to describe the plausible region of the parameters of interest simultaneously. While most advanced statistical textbooks on inference discuss these asymptotic confidence regions, there is no exploration of how to numerically compute these regions for graphical purposes. This article demonstrates the application of a simple trigonometric transformation to compute two- and three-dimensional confidence regions; we transform the Cartesian coordinates of the parameters to create what we call the radial profile log-likelihood. The method is applicable to any distribution with a defined likelihood function, so it is not limited to specific data distributions or model paradigms. We describe the method along with the algorithm, follow with an example of our method, and end with an examination of computation time. Supplementary materials for this article are available online.  相似文献   

18.
This article develops a vector autoregression (VAR) for time series which are observed at mixed frequencies—quarterly and monthly. The model is cast in state-space form and estimated with Bayesian methods under a Minnesota-style prior. We show how to evaluate the marginal data density to implement a data-driven hyperparameter selection. Using a real-time dataset, we evaluate forecasts from the mixed-frequency VAR and compare them to standard quarterly frequency VAR and to forecasts from MIDAS regressions. We document the extent to which information that becomes available within the quarter improves the forecasts in real time. This article has online supplementary materials.  相似文献   

19.
Empirical estimates of source statistical economic data such as trade flows, greenhouse gas emissions, or employment figures are always subject to uncertainty (stemming from measurement errors or confidentiality) but information concerning that uncertainty is often missing. This article uses concepts from Bayesian inference and the maximum entropy principle to estimate the prior probability distribution, uncertainty, and correlations of source data when such information is not explicitly provided. In the absence of additional information, an isolated datum is described by a truncated Gaussian distribution, and if an uncertainty estimate is missing, its prior equals the best guess. When the sum of a set of disaggregate data is constrained to match an aggregate datum, it is possible to determine the prior correlations among disaggregate data. If aggregate uncertainty is missing, all prior correlations are positive. If aggregate uncertainty is available, prior correlations can be either all positive, all negative, or a mix of both. An empirical example is presented, which reports relative uncertainties and correlation priors for the County Business Patterns database. In this example, relative uncertainties range from 1% to 80% and 20% of data pairs exhibit correlations below ?0.9 or above 0.9. Supplementary materials for this article are available online.  相似文献   

20.
Email marketing has been an increasingly important tool for today’s businesses. In this article, we propose a counting-process-based Bayesian method for quantifying the effectiveness of email marketing campaigns in conjunction with customer characteristics. Our model explicitly addresses the seasonality of data, accounts for the impact of customer characteristics on their purchasing behavior, and evaluates effects of email offers as well as their interactions with customer characteristics. Using the proposed method, together with a propensity-score-based unit-matching technique for alleviating potential confounding, we analyze a large email marketing dataset of an online ticket marketplace to evaluate the short- and long-term effectiveness of their email campaigns. It is shown that email offers can increase customer purchase rate both immediately and during a longer term. Customers’ characteristics such as length of shopping history, purchase recency, average ticket price, average ticket count, and number of genres purchased also affect customers’ purchase rate. A strong positive interaction is uncovered between email offer and purchase recency, suggesting that customers who have been inactive recently are more likely to take advantage of promotional offers. Supplementary materials for this article are available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号