首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 20 毫秒
1.
林存洁  李扬 《统计研究》2016,33(11):109-112
在大数据时代,传统的统计学是否还有用武之地成为很多人的争议。本文以ARGO模型为案例,介绍了统计方法在大数据分析中的应用和取得的成果,并从统计学的角度出发,提出改进的措施与方法。通过ARGO模型的分析结果发现,大数据分析的很多根本性问题仍然是统计问题,而数据中的统计规律仍然是数据分析要挖掘的最大价值,这也意味着统计思想在大数据分析中只能越来越重要。而对于结构复杂、来源多样的大数据来说,统计学方法也需要新的探索和尝试,这将是统计学所面临的机遇和挑战。  相似文献   

2.
ABSTRACT

The broken-stick (BS) is a popular stopping rule in ecology to determine the number of meaningful components of principal component analysis. However, its properties have not been systematically investigated. The purpose of the current study is to evaluate its ability to detect the correct dimensionality in a data set and whether it tends to over- or underestimate it. A Monte Carlo protocol was carried out. Two main correlation matrices deemed usual in practice were used with three levels of correlation (0, 0.10 and 0.30) between components (generating oblique structure) and with different sample sizes. Analyses of the population correlation matrices indicated that, for extremely large sample sizes, the BS method could be correct for only one of the six simulated structure. It actually failed to identify the correct dimensionality half the time with orthogonal structures and did even worse with some oblique ones. In harder conditions, results show that the power of the BS decreases as sample size increases: weakening its usefulness in practice. Since the BS method seems unlikely to identify the underlying dimensionality of the data, and given that better stopping rules exist it appears as a poor choice when carrying principal component analysis.  相似文献   

3.
秦磊  谢邦昌 《统计研究》2016,33(2):107-110
大数据时代下机遇与挑战并存,如何基于传统方法去处理大数据引人深思,一味地追求大数据也不一定正确。本文以谷歌流感趋势(GFT)为案例,介绍了大数据在疾病疫情监测方面的主要技术及相关成果,阐述了大数据在使用中的关键问题,并结合复杂的统计学工具给出了一些改进措施。谷歌流感趋势的成功取决于相关关系的应用,其失误却来源于模型的构造、因果关系和相关关系的冲突等问题。谷歌流感趋势案例的分析与启示对政府今后在大数据解决方案中有重要的理论和实践意义。  相似文献   

4.
Fan J  Lv J 《Statistica Sinica》2010,20(1):101-148
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.  相似文献   

5.
Ultra-high dimensional data arise in many fields of modern science, such as medical science, economics, genomics and imaging processing, and pose unprecedented challenge for statistical analysis. With such rapid-growth size of scientific data in various disciplines, feature screening becomes a primary step to reduce the high dimensionality to a moderate scale that can be handled by the existing penalized methods. In this paper, we introduce a simple and robust feature screening method without any model assumption to tackle high dimensional censored data. The proposed method is model-free and hence applicable to a general class of survival models. The sure screening and ranking consistency properties without any finite moment condition of the predictors and the response are established. The computation of the proposed method is rather straightforward. Finite sample performance of the newly proposed method is examined via extensive simulation studies. An application is illustrated with the gene association study of the mantle cell lymphoma.  相似文献   

6.
For big data analysis, high computational cost for Bayesian methods often limits their applications in practice. In recent years, there have been many attempts to improve computational efficiency of Bayesian inference. Here we propose an efficient and scalable computational technique for a state-of-the-art Markov chain Monte Carlo methods, namely, Hamiltonian Monte Carlo. The key idea is to explore and exploit the structure and regularity in parameter space for the underlying probabilistic model to construct an effective approximation of its geometric properties. To this end, we build a surrogate function to approximate the target distribution using properly chosen random bases and an efficient optimization process. The resulting method provides a flexible, scalable, and efficient sampling algorithm, which converges to the correct target distribution. We show that by choosing the basis functions and optimization process differently, our method can be related to other approaches for the construction of surrogate functions such as generalized additive models or Gaussian process models. Experiments based on simulated and real data show that our approach leads to substantially more efficient sampling algorithms compared to existing state-of-the-art methods.  相似文献   

7.
黄恒君 《统计研究》2019,36(7):3-12
大数据在统计生产中潜力巨大,有助于构建高质量的统计生产体系,但符合统计生产目标的数据源特征及其数据质量问题有待明确。本文在寻求大数据源与传统统计数据源共同点的基础上,讨论了统计生产中的大数据源及其数据质量问题,进而探讨了大数据与传统统计生产融合应用。首先从数据生成流程及数据特征两个方面论证并限定了可用于统计生产的大数据源;然后在广义数据质量框架下讨论了大数据统计生产中的数据质量问题,梳理了大数据统计生产流程的数据质量控制要点和质量缺陷;最后根据数据质量分析结果,提出了将大数据融入传统调查的统计体系构建思路。  相似文献   

8.
大数据具有数据来源差异性、高维性及稀疏性等特点,如何挖掘数据集间的异质性和共同性并降维去噪是大数据分析的目标与挑战之一。整合分析(Integrative Analysis)同时分析多个独立数据集,避免因地域、时间等因素造成的样本差异而引起模型不稳定,是研究大数据差异性的有效方法。它的特点是将每个解释变量在所有数据集中的系数视为一组,通过惩罚函数对系数组进行压缩,研究变量间的关联性并实现降维。本文从同构数据整合分析、异构数据整合分析以及考虑网络结构的整合分析三方面梳理了惩罚整合分析方法的原理、算法和研究现状。统计模拟发现,在弱相关、一般相关和强相关三种情形下, Group Bridge、 Group MCP、Composite MCP都表现良好,其中 Group Bridge的假阳数最低且最稳定。最后,将整合分析用于研究具有来源差异性的新农合家庭医疗支出,以及具有超高维、小样本等大数据典型特征的癌症基因数据,得到了一些有意义的结论。  相似文献   

9.
数据科学的统计学内涵   总被引:1,自引:0,他引:1  
数据科学以大数据为研究对象,而大数据对统计分析最直接的冲击莫过于数据收集方式的变革,同时统计分析的视野也不再局限于传统的属性数据,而是包括了关系数据、非结构、半结构数据等其他类型更丰富的数据。伴随着数据开放运动,数据库之间的关联信息的价值逐步得到体现。基于统计学的视角分别从科学理论基础、计算机处理技术和商业应用等三个维度研究了数据科学的统计学内涵,探讨了数据科学范式对统计分析过程的直接影响,以及统计学视角面临的机遇与挑战。  相似文献   

10.
A variable screening procedure via correlation learning was proposed in Fan and Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening is called NIS, a specific member of the sure independence screening. Several closely related variable screening procedures are proposed. Under general nonparametric models, it is shown that under some mild technical conditions, the proposed independence screening methods enjoy a sure screening property. The extent to which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, a data-driven thresholding and an iterative nonparametric independence screening (INIS) are also proposed to enhance the finite sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.  相似文献   

11.
米子川  姜天英 《统计研究》2016,33(11):11-18
2014年7月,澳盛银行首次将阿里巴巴系列指数纳入通胀观察标的,标志着大数据指数已经开始对传统的统计调查指数提出质疑和挑战。本文基于阿里巴巴aSPI指数和官方公布的CPI指数的比较研究,首次提出了aSPI指数显著优于CPI指数的一些基本特征;同时,通过实证分析对比了两种指数的同步性特征和分解性特征,即首先运用协整检验方法确定二者的同步性;其次通过EMD模型对二者进行序列分解,得出各自的波动成分和增长趋势;最后,在EMD对aSPI指数分解的基础上,通过Lasso回归估计了CPI指数。研究表明,随着对大数据研究的广泛性、科学性以及方法论和软件工具的进步,大数据指数对传统统计调查的佐证、补充乃至融合将会成为一种新趋势,通过实证、应用与发展,逐步产生新的CPI编制方法和分析体系,将是大数据指数理论和实践的根本出路。  相似文献   

12.
煤炭大数据指数编制及经验模态分解模型研究   总被引:1,自引:0,他引:1  
基于开放性数据源、连续观测昨多变量数据编制的大数据指数,与传统的统计调查指数存在的差异不仅在于数据本身的无限扩张,而且在于编制方法以及分解研究的规则、模型方面的差异。在大数据背景下,率先尝试性地提出大数据指数的定义和数据假设,将"互联网大数据指数"引入煤炭交易价格指数综合编制太原煤炭交易大数据指数,从而反映煤炭价格的变动趋势;导入经验模态分解模型,对所编制的煤炭大数据指数进行分解研究,尝试比较与传统的统计调查指数的差异。研究表明:新编制的煤炭价格大数据指数要比太原煤炭交易价格指数更为敏感和迅速,能更好地反映煤炭价格的变动趋势。随着"互联网+"和大数据战略的逐渐普及,基于互联网大数据编制的综合指数会影响到更多领域,将成为经济管理和社会发展各个领域的晴雨表和指示器;与传统统计调查指数逐步融合、互补或者升级,成为宏观经济大数据指数的重要组成部分。  相似文献   

13.
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.  相似文献   

14.
This article considers the utility of the bounded cumulative hazard model in cure rate estimation, which is an appealing alternative to the widely used two-component mixture model. This approach has the following distinct advantages: (1) It allows for a natural way to extend the proportional hazards regression model, leading to a wide class of extended hazard regression models. (2) In some settings the model can be interpreted in terms of biologically meaningful parameters. (3) The model structure is particularly suitable for semiparametric and Bayesian methods of statistical inference. Notwithstanding the fact that the model has been around for less than a decade, a large body of theoretical results and applications has been reported to date. This review article is intended to give a big picture of these modeling techniques and associated statistical problems. These issues are discussed in the context of survival data in cancer.  相似文献   

15.
Multivariate data are present in many research areas. Its analysis is challenging when assumptions of normality are violated and the data are discrete. The Poisson discrete data can be thought of as very common discrete type, but the inflated and the doubly inflated correspondence are gaining popularity (Sengupta, Chaganty, and Sabo 2015; Lee, Jung, and Jin 2009; Agarwal, Gelfand, and Citron-Pousty 2002).

Our aim is to build a statistical model that can be tractable and used to estimate the model parameters for the multivariate doubly inflated Poisson. To keep the correlation structure, we incorporate ideas from the copula distributions. A multivariate doubly inflated Poisson distribution using Gaussian copula is introduced. Data simulation and parameter estimation algorithms are also provided. Residual checks are carried out to assess any substantial biases. The model dimensionality has been increased to test the performance of the provided estimation method. All results show high-efficiency and promising outcomes in the modeling of discrete data and particularly the doubly inflated Poisson count type data, under a novel modified algorithm.  相似文献   


16.
Different longitudinal study designs require different statistical analysis methods and different methods of sample size determination. Statistical power analysis is a flexible approach to sample size determination for longitudinal studies. However, different power analyses are required for different statistical tests which arises from the difference between different statistical methods. In this paper, the simulation-based power calculations of F-tests with Containment, Kenward-Roger or Satterthwaite approximation of degrees of freedom are examined for sample size determination in the context of a special case of linear mixed models (LMMs), which is frequently used in the analysis of longitudinal data. Essentially, the roles of some factors, such as variance–covariance structure of random effects [unstructured UN or factor analytic FA0], autocorrelation structure among errors over time [independent IND, first-order autoregressive AR1 or first-order moving average MA1], parameter estimation methods [maximum likelihood ML and restricted maximum likelihood REML] and iterative algorithms [ridge-stabilized Newton-Raphson and Quasi-Newton] on statistical power of approximate F-tests in the LMM are examined together, which has not been considered previously. The greatest factor affecting statistical power is found to be the variance–covariance structure of random effects in the LMM. It appears that the simulation-based analysis in this study gives an interesting insight into statistical power of approximate F-tests for fixed effects in LMMs for longitudinal data.  相似文献   

17.
Several methods based on smoothing or statistical criteria have been used for deriving disaggregated values compatible with observed annual totals. The present method is based on the artificial neural networks. This article evaluates the use of artificial neural networks (ANNs) for the disaggregation of annual US GDP data to quarterly time increments. A feed-forward neural network with back-propagation algorithm for learning was used. An ANN model is introduced and evaluated in this paper. The proposed method is considered as a temporal disaggregation method without related series. A comparison with previous temporal disaggregation methods without related series has been done. The disaggregated quarterly GDP data compared well with observed quarterly data. In addition, they preserved all the basic statistics such as summing to the annual data value, cross correlation structure among quarterly flows, etc.  相似文献   

18.
Missing data in clinical trials is a well‐known problem, and the classical statistical methods used can be overly simple. This case study shows how well‐established missing data theory can be applied to efficacy data collected in a long‐term open‐label trial with a discontinuation rate of almost 50%. Satisfaction with treatment in chronically constipated patients was the efficacy measure assessed at baseline and every 3 months postbaseline. The improvement in treatment satisfaction from baseline was originally analyzed with a paired t‐test ignoring missing data and discarding the correlation structure of the longitudinal data. As the original analysis started from missing completely at random assumptions regarding the missing data process, the satisfaction data were re‐examined, and several missing at random (MAR) and missing not at random (MNAR) techniques resulted in adjusted estimate for the improvement in satisfaction over 12 months. Throughout the different sensitivity analyses, the effect sizes remained significant and clinically relevant. Thus, even for an open‐label trial design, sensitivity analysis, with different assumptions for the nature of dropouts (MAR or MNAR) and with different classes of models (selection, pattern‐mixture, or multiple imputation models), has been found useful and provides evidence towards the robustness of the original analyses; additional sensitivity analyses could be undertaken to further qualify robustness. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

19.
ABSTRACT

Parallel analysis (Horn 1965) and the minimum average partial correlation (MAP; Velicer 1976) have been widely spread as optimal solutions to identify the correct number of axes in principal component analysis. Previous results showed, however, that they become inefficient when variables belonging to different components strongly correlate. Simulations are used to assess their power to detect the dimensionality of data sets with oblique structures. Overall, MAP had the best performances as it was more powerful and accurate than PA when the component structure was modestly oblique. However, both stopping rules performed poorly in the presence of highly oblique factors.  相似文献   

20.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号