首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ABSTRACT

Incremental modelling of data streams is of great practical importance, as shown by its applications in advertising and financial data analysis. We propose two incremental covariance matrix decomposition methods for a compositional data type. The first method, exact incremental covariance decomposition of compositional data (C-EICD), gives an exact decomposition result. The second method, covariance-free incremental covariance decomposition of compositional data (C-CICD), is an approximate algorithm that can efficiently compute high-dimensional cases. Based on these two methods, many frequently used compositional statistical models can be incrementally calculated. We take multiple linear regression and principle component analysis as examples to illustrate the utility of the proposed methods via extensive simulation studies.  相似文献   

2.
In many complex diseases such as cancer, a patient undergoes various disease stages before reaching a terminal state (say disease free or death). This fits a multistate model framework where a prognosis may be equivalent to predicting the state occupation at a future time t. With the advent of high-throughput genomic and proteomic assays, a clinician may intent to use such high-dimensional covariates in making better prediction of state occupation. In this article, we offer a practical solution to this problem by combining a useful technique, called pseudo-value (PV) regression, with a latent factor or a penalized regression method such as the partial least squares (PLS) or the least absolute shrinkage and selection operator (LASSO), or their variants. We explore the predictive performances of these combinations in various high-dimensional settings via extensive simulation studies. Overall, this strategy works fairly well provided the models are tuned properly. Overall, the PLS turns out to be slightly better than LASSO in most settings investigated by us, for the purpose of temporal prediction of future state occupation. We illustrate the utility of these PV-based high-dimensional regression methods using a lung cancer data set where we use the patients’ baseline gene expression values.  相似文献   

3.
孙怡帆等 《统计研究》2021,38(5):136-146
随着信息技术的发展,高维数据日益丰富。现实中,很多高维数据由多个主体各异的数据集融合而成。如何准确识别出高维数据集间的异同性成为大数据分析的目标之一。本文提出了变系数模型下的高维数据整合分析方法。该方法可以同时对多个数据集进行变量选择和系数估计,并且能 够自动识别出变量系数在数据集间的异同性。模拟结果表明本文方法在异同性识别、变量选择、系数估 计和预测等方面明显优于对比方法。在肺癌致病基因识别的应用研究中,本文方法能够识别出具有生物解释的致病基因并发现了两种亚型之间的异同性。  相似文献   

4.
中国投资者的风险偏好   总被引:3,自引:0,他引:3       下载免费PDF全文
马莉莉  李泉 《统计研究》2011,28(8):63-72
 投资者如何分配家庭资产,为个体的风险偏好研究提供了重要的信息。本文采用奥尔多投资研究中心《城市投资者行为调查问卷》的调查结果,从投资者对风险资产需求这一角度,详细研究哪些因素会显著影响投资者的风险偏好,以及不同群体投资者风险偏好的异质性问题。研究结果表明:投资者的财富水平、受教育程度、健康状况、收入水平和是否抚养小孩都是影响投资者风险偏好的重要因素。同时,不同群体风险偏好的表现有差异。研究中国投资者风险偏好的异质性,可以为进一步研究投资者的金融资产决策、储蓄行为、财富积累过程以及不同群体对宏观经济政策的反应提供依据。  相似文献   

5.
Two separate structure discovery properties of Fisher's LDF are derived in a mixture multivariate normal setting. One of the properties is related to Fisher information and is proved by using Stein's identity. The other property is on lack of unimodality. The properties are used to give three selection rules for choice of informative projections of high-dimensional data, not necessarily multivariate normal. Their usefulness in the two group-classification problem is studied theoretically and by means of examples. Extensions and various issues about practical implementation are discussed.  相似文献   

6.
Summary.  To obtain information about the contribution of individual and area level factors to population health, it is desirable to use both data collected on areas, such as censuses, and on individuals, e.g. survey and cohort data. Recently developed models allow us to carry out simultaneous regressions on related data at the individual and aggregate levels. These can reduce 'ecological bias' that is caused by confounding, model misspecification or lack of information and increase power compared with analysing the data sets singly. We use these methods in an application investigating individual and area level sociodemographic predictors of the risk of hospital admissions for heart and circulatory disease in London. We discuss the practical issues that are encountered in this kind of data synthesis and demonstrate that this modelling framework is sufficiently flexible to incorporate a wide range of sources of data and to answer substantive questions. Our analysis shows that the variations that are observed are mainly attributable to individual level factors rather than the contextual effect of deprivation.  相似文献   

7.
Summary.  We present models for the combined analysis of evidence from randomized controlled trials categorized as being at either low or high risk of bias due to a flaw in their conduct. We formulate a bias model that incorporates between-study and between-meta-analysis heterogeneity in bias, and uncertainty in overall mean bias. We obtain algebraic expressions for the posterior distribution of the bias-adjusted treatment effect, which provide limiting values for the information that can be obtained from studies at high risk of bias. The parameters of the bias model can be estimated from collections of previously published meta-analyses. We explore alternative models for such data, and alternative methods for introducing prior information on the bias parameters into a new meta-analysis. Results from an illustrative example show that the bias-adjusted treatment effect estimates are sensitive to the way in which the meta-epidemiological data are modelled, but that using point estimates for bias parameters provides an adequate approximation to using a full joint prior distribution. A sensitivity analysis shows that the gain in precision from including studies at high risk of bias is likely to be low, however numerous or large their size, and that little is gained by incorporating such studies, unless the information from studies at low risk of bias is limited. We discuss approaches that might increase the value of including studies at high risk of bias, and the acceptability of the methods in the evaluation of health care interventions.  相似文献   

8.
Health Risk and Portfolio Choice   总被引:1,自引:0,他引:1  
This article investigates the role of self-perceived risky health in explaining continued reductions in financial risk taking after retirement. If future adverse health shocks threaten to increase the marginal utility of consumption, either by absorbing wealth or by changing the utility function, then health risk should prompt individuals to lower their exposure to financial risk. I examine individual-level data from the Study of Assets and Health Dynamics Among the Oldest Old (AHEAD), which reveal that risky health prompts safer investment. Elderly singles respond the most to health risk, consistent with a negative cross partial deriving from health shocks that impede home production. Spouses and planned bequests provide some degree of hedging. Risky health may explain 20%% of the age-related decline in financial risk taking after retirement.  相似文献   

9.
宁瀚文  屠雪永 《统计研究》2019,36(10):58-73
波动率是金融风险管理研究的重要内容之一。本文基于复杂网络理论和数据挖掘技术提出股票市场的高维波动率网络模型。首先运用互信息度量不同股票价格波动之间的相关关系,其次对股票市场不同周期下的波动情况建立度的中心势、平均距离、幂律分布等网络拓扑指标,再次根据这些指标利用Prim算法构建出高维波动率网络模型,最后运用Newman-Girvan算法对股票价格波动率的相关性进行分层研究。高维波动率网络模型突破了传统波动率模型关于变量维数的限制,能够在依赖少量假设的基础上,挖掘出多个金融市场主体间的相互关系,反映金融市场的风险特征及网络拓扑性质。实证结果发现:与常用的Pearson相关系数法相比,在互信息框架下,股价波动的非线性相关关系得到了更好的度量;股票市场的整体波动性与个股波动率相关性变化趋势相反,市场处在高波动时期资产组合分散化效果较好;网络中存在少量度数大的关键节点和中心节点,风险通过这些节点可以迅速传递到整个市场;股票市场的运行具有明显的行业聚集现象;网络分层研究进一步直观的展现了风险在层与层之间的传递规律和与之对应的行业特征。高维波动率网络模型为挖掘股票市场的风险特征与管理金融风险提供了一个新的工具。  相似文献   

10.
Gene regulation plays a fundamental role in biological activities. The gene regulation network (GRN) is a high-dimensional complex system, which can be represented by various mathematical or statistical models. The ordinary differential equation (ODE) model is one of the popular dynamic GRN models. We proposed a comprehensive statistical procedure for ODE model to identify the dynamic GRN. In this article, we applied this model to different segments of time course gene expression data from a simulation experiment and a yeast cell cycle study. We found that the two cell cycle and one cell cycle data provided consistent results, but half cell cycle data produced biased estimation. Therefore, we may conclude that the proposed model can quantify both two cell cycle and one cell cycle gene expression dynamics, but not for half cycle dynamics. The findings suggest that the model can identify the dynamic GRN correctly if the time course gene expression data are sufficient enough to capture the overall dynamics of underlying biological mechanism.  相似文献   

11.
Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets.  相似文献   

12.
统计系统基本单位名录库是统计数据质量的基石,现有数据源在成本、时效性、数据提供者负担方面存在劣势。为此,提出一种互联网大数据整合视角下的名录库更新维护思路:从参与者行为、数据质量角度论证了将异源异构互联网作为名录库更新数据源的优势,讨论了名录库基本信息、属性信息及地理定位信息获取的技术手段,并给出实例应用。  相似文献   

13.
Variable screening for censored survival data is most challenging when both survival and censoring times are correlated with an ultrahigh-dimensional vector of covariates. Existing approaches to handling censoring often make use of inverse probability weighting by assuming independent censoring with both survival time and covariates. This is a convenient but rather restrictive assumption which may be unmet in real applications, especially when the censoring mechanism is complex and the number of covariates is large. To accommodate heterogeneous (covariate-dependent) censoring that is often present in high-dimensional survival data, we propose a Gehan-type rank screening method to select features that are relevant to the survival time. The method is invariant to monotone transformations of the response and of the predictors, and works robustly for a general class of survival models. We establish the sure screening property of the proposed methodology. Simulation studies and a lymphoma data analysis demonstrate its favorable performance and practical utility.  相似文献   

14.
Principal component analysis (PCA) is widely used to analyze high-dimensional data, but it is very sensitive to outliers. Robust PCA methods seek fits that are unaffected by the outliers and can therefore be trusted to reveal them. FastHCS (high-dimensional congruent subsets) is a robust PCA algorithm suitable for high-dimensional applications, including cases where the number of variables exceeds the number of observations. After detailing the FastHCS algorithm, we carry out an extensive simulation study and three real data applications, the results of which show that FastHCS is systematically more robust to outliers than state-of-the-art methods.  相似文献   

15.
ABSTRACT

Identifying homogeneous subsets of predictors in classification can be challenging in the presence of high-dimensional data with highly correlated variables. We propose a new method called cluster correlation-network support vector machine (CCNSVM) that simultaneously estimates clusters of predictors that are relevant for classification and coefficients of penalized SVM. The new CCN penalty is a function of the well-known Topological Overlap Matrix whose entries measure the strength of connectivity between predictors. CCNSVM implements an efficient algorithm that alternates between searching for predictors’ clusters and optimizing a penalized SVM loss function using Majorization–Minimization tricks and a coordinate descent algorithm. This combining of clustering and sparsity into a single procedure provides additional insights into the power of exploring dimension reduction structure in high-dimensional binary classification. Simulation studies are considered to compare the performance of our procedure to its competitors. A practical application of CCNSVM on DNA methylation data illustrates its good behaviour.  相似文献   

16.
In this article we propose methodology for inference of binary-valued adjacency matrices from various measures of the strength of association between pairs of network nodes, or more generally pairs of variables. This strength of association can be quantified by sample covariance and correlation matrices, and more generally by test-statistics and hypothesis test p-values from arbitrary distributions. Community detection methods such as block modeling typically require binary-valued adjacency matrices as a starting point. Hence, a main motivation for the methodology we propose is to obtain binary-valued adjacency matrices from such pairwise measures of strength of association between variables. The proposed methodology is applicable to large high-dimensional data sets and is based on computationally efficient algorithms. We illustrate its utility in a range of contexts and data sets.  相似文献   

17.
18.
Decision making with adaptive utility provides a generalisation to classical Bayesian decision theory, allowing the creation of a normative theory for decision selection when preferences are initially uncertain. In this paper we address some of the foundational issues of adaptive utility as seen from the perspective of a Bayesian statistician. The implications that such a generalisation has upon the traditional utility concepts of value of information and risk aversion are also explored, with a new concept of trial aversion introduced that is similar to risk aversion, but which concerns a decision maker's aversion to selecting decisions with high uncertainty over resulting utility.  相似文献   

19.
Variable and model selection problems are fundamental to high-dimensional statistical modeling in diverse fields of sciences. Especially in health studies, many potential factors are usually introduced to determine an outcome variable. This paper deals with the problem of high-dimensional statistical modeling through the analysis of the trauma annual data in Greece for 2005. The data set is divided into the experiment and control sets and consists of 6334 observations and 112 factors that include demographic, transport and intrahospital data used to detect possible risk factors of death. In our study, different model selection techniques are applied to the experiment set and the notion of deviance is used on the control set to assess the fit of the overall selected model. The statistical methods employed in this work were the non-concave penalized likelihood methods, smoothly clipped absolute deviation, least absolute shrinkage and selection operator, and Hard, the generalized linear logistic regression, and the best subset variable selection.The way of identifying the significant variables in large medical data sets along with the performance and the pros and cons of the various statistical techniques used are discussed. The performed analysis reveals the distinct advantages of the non-concave penalized likelihood methods over the traditional model selection techniques.  相似文献   

20.
何强  董志勇 《统计研究》2020,37(12):91-104
大数据为季度GDP走势预测创新研究带来重要突破口。本文利用百度等网站的互联网大数据,基于代表性高维数据机器学习(和深度学习)模型,对我国2011-2018年季度GDP增速深入进行预测分析。研究发现,对模型中的随机干扰因素作出一定分布的统计假设,有助于降低预测误差,任由模型通过大量数据机械地学习和完善并不总是有利于模型预测能力的提升;采用对解释变量集添加惩罚约束的方法,可以有效地处理互联网大数据维度较高的棘手问题;预测季度GDP增速的最优大数据解释变量集的稳定性较高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号