基于B-样条基底展开的曲线聚类方法   总被引:4,自引:1,他引:3  
随着大数据时代的来临,近年来函数型数据分析方法成为研究的热点问题,针对曲线的聚类分析方法引起了学界的关注.给出一种曲线聚类的方法:以L2距离作为亲疏程度的度量,在B样条基底函数展开表述下,将曲线本身信息、曲线变化信息引入聚类算法构建,并实现了曲线聚类与传统多元统计聚类方法的对接.作为应用,以城乡收入函数聚类实例验证了该曲线聚类方法,结果表明,在引入曲线变化信息的情况下,比仅考虑曲线本身信息能够取得更好的聚类效果.  相似文献   

In this article, we consider the estimation of covariation of two asset prices which contain jumps and microstructure noise, based on high-frequency data. We propose a realized covariance estimator, which combines pre-averaging method to remove the microstructure noise and the threshold method to reduce the jumps effect. The asymptotic properties, such as consistency and asymptotic normality, are investigated. The estimator allows very general structure of jumps, for example, infinity activity or even infinity variation. Simulation is also included to illustrate the performance of the proposed procedure.  相似文献   

The B-spline representation is a common tool to improve the fitting of smooth nonlinear functions, it offers a fitting as a piecewise polynomial. The regions that define the pieces are separated by a sequence of knots. The main difficulty in this type of modeling is the choice of the number and the locations of these knots. The Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm provides a solution to simultaneously select these two parameters by considering the knots as free parameters. This algorithm belongs to the MCMC techniques that allow simulations from target distributions on spaces of varying dimension. The aim of the present investigation is to use this algorithm in the framework of the analysis of survival time, for the Cox model in particular. In fact, the relation between the hazard ratio function and the covariates being assumed to be log-linear, this assumption is too restrictive. Thus, we propose to use the RJMCMC algorithm to model the log hazard ratio function by a B-spline representation with an unknown number of knots at unknown locations. This method is illustrated with two real data sets: the Stanford heart transplant data and lung cancer survival data. Another application of the RJMCMC is selecting the significant covariates, and a simulation study is performed.  相似文献   


An aspect of cluster analysis which has been widely studied in recent years is the weighting and selection of variables. Procedures have been proposed which are able to identify the cluster structure present in a data matrix when that structure is confined to a subset of variables. Other methods assess the relative importance of each variable as revealed by a suitably chosen weight. But when a cluster structure is present in more than one subset of variables and is different from one subset to another, those solutions as well as standard clustering algorithms can lead to misleading results. Some very recent methodologies for finding consensus classifications of the same set of units can be useful also for the identification of cluster structures in a data matrix, but each one seems to be only partly satisfactory for the purpose at hand. Therefore a new more specific procedure is proposed and illustrated by analyzing two real data sets; its performances are evaluated by means of a simulation experiment.  相似文献   

文章试图将统计思想与(Rough)粗糙集理论相结合,针对事务性数据库属性项压缩问题提出了一些行之有效的方法,即基于重要性的属性压缩、基于相依性的属性压缩、属性项的广义线形分析及压缩、基于多重相关性的属性项压缩,以此达到数据库压缩之目的。  相似文献   

“大数据”背景下利用扫描数据编制中国CPI问题研究   总被引:1,自引:6,他引:1  
扫描数据为政府统计源头数据信息化改革与宏观经济测度提供了新的技术范式。基于对世界各国利用扫描数据编制CPI的现状进行梳理研究,并针对中国扫描数据的现状和政府价格统计的特点,提出了一种利用扫描数据编制中国CPI的思路,力图为基于"大数据"的政府统计源头数据信息化改革提供理论和实践参考。  相似文献   

In applications of IRT, it often happens that many examinees omit a substantial proportion of item responses. This can occur for various reasons, though it may well be due to no more than the simple fact of design incompleteness. In such circumstances, literature not infrequently refers to various types of estimation problem, often in terms of generic “convergence problems” in the software used to estimate model parameters. With reference to the Partial Credit Model and the instance of data missing at random, this article demonstrates that as their number increases, so does that of anomalous datasets, intended as those not corresponding to a finite estimate of (the vector parameter that identifies) the model. Moreover, the necessary and sufficient conditions for the existence and uniqueness of the maximum likelihood estimation of the Partial Credit Model (and hence, in particular, the Rasch model) in the case of incomplete data are given – with reference to the model in its more general form, the number of response categories varying according to item. A taxonomy of possible cases of anomaly is then presented, together with an algorithm useful in diagnostics.  相似文献   

徐凤  黎实 《统计研究》2018,35(3):112-128
在大维面板数据中,截面之间很可能呈现出部分异质的特征,即参数在截面间具有组群效应,同组参数相同而不同组参数相异。如果忽略部分异质性而采用完全异质或同质的方法,可能导致估计的不一致性以及统计推断无效性。鉴于已有的部分异质性的研究要么限定截面独立,要么局限于强因子情形,本文尝试在Reese和Westerlund(2015)[1]提出的允许强因子或非强因子存在的较一般的框架下探讨面板数据部分异质结构的识别问题。采用Pesaran(2006)[2] CCE (Common Correlated Effects)方法处理不同强弱的共同因子,并借鉴Su et al.,(2016)[3]的C-Lasso (Classifier- Least Absolute Shrinkage and Selection Operator)方法,对CCE变化后的方程构造带有加法-乘法惩罚项的惩罚最小二乘,优化后以同步地实现分组和参数的估计。理论分析表明,在强因子或半强因子情形中,本文所提方法在分组方面具有渐近一致性,即所有个体被正确分组的概率随着 而趋于1。同时,参数的Lasso估计和事后Lasso估计均具有渐近正态性。另外分析结果也表明,因子的强弱不会影响分组的一致性但会影响以上两种估计量的渐近正态性,因子越强,两种估计量收敛得越快。模拟结果则表明有限样本下,本文所提的方法在分组、参数估计和分组数确定方面均具有良好的表现。具体的,在强因子和不同的半强因子情形中,随着N,T的增加,分组和分组数正确率很快地上升到100%,而两种参数估计的均方根误差和偏差则明显地降低。最后,利用本文所提的方法,研究了人力资本对经济增长影响的部分异质性。  相似文献   

数据的质量直接影响数据分析的效率和分析结果的可靠性。数据质量包括数据结构质量和给定数据结构后的数据真实性、一致性和完整性。在着重考虑拿到数据之后,从单元格、记录、变量三个角度如何识别数据中潜在的质量问题,并以案例为支撑,介绍了各种可能出现的问题。  相似文献   

数据流分类中的概念漂移问题是数据挖掘技术领域的前沿和难点,其重点是等级分类可能随着数据序列的转移而产生漂移现象。虽然估计动态漂移及其调整分类的算法已被提出,但现有算法由于目标分布例证的缺失在概念漂移估计方面的表现并不是很好,例证的多少严重影响了估计效果。鉴此,提出了一种新的参数估计方法,称为转移估计法,运用目标分布数据,结合相似分布理论,对现存的算法进行改进,以便实现对数据流分类中的概念漂移现象进行正确检测和估计。通过对虚拟和真实数据集的仿真实验表明,改进算法在数据流分类中的概念漂移估计方面优于现存算法。  相似文献   

网上拍卖中竞买者出价数据的特征及分析方法研究   总被引:4,自引:2,他引:2  
在传统统计分析中,研究者面对的数值型数据有三种形式,即横截面数据、时间序列数据以及混合数据。这些类型的数据具有离散、等间隔分布、密度均匀等特点,它们是传统的描述性统计和推断性统计中最主要的数据分析对象。然而,从拍卖网站收集到的诸如竞买者出价等数据,却不具备这些特点,对传统统计分析方法提出了挑战。因此需要从数据容量、数据的混合性、不等间隔分布及数据密度等方面,对网上拍卖数据的产生机制进行阐释,对其特征进行分析,并结合实际网上拍卖资料给出分析此类数据的方法和过程。  相似文献   

由于数据来源不同,所以在编制社会核算矩阵时会出现大量数据不衔接的问题.目前国内外学者多数采用RAS、CE等方法进行平衡处理,而这些方法都属于纯数学线性调整和平衡方法,并没有考虑具体数据的现实经济意义.鉴此,设计一种“项目对应平衡法”,以矩阵表中元素的具体经济含义为依据,对所有不平衡项目进行逐项处理,以达到整个矩阵的总体平衡.同时采用这种平衡方法,实际编制了2007年中国社会核算矩阵.  相似文献   

函数数据挖掘及其在中国消费函数分析中的应用   总被引:1,自引:0,他引:1  
以数据挖掘的思想,提出了利用Bemstein基构建一般函数数据的方法。在此基础上,根据中国31个省(自治区、直辖市)城镇居民的人均年收入和消费性支出的数据,构建了消费函数数据,并进行误差分析,求出消费函数的一阶和二阶导数,进一步挖掘消费函数的发展速率,取得良好的效果。  相似文献   

We develop functional data analysis techniques using the differential geometry of a manifold of smooth elastic functions on an interval in which the functions are represented by a log-speed function and an angle function. The manifold's geometry provides a method for computing a sample mean function and principal components on tangent spaces. Using tangent principal component analysis, we estimate probability models for functional data and apply them to functional analysis of variance, discriminant analysis, and clustering. We demonstrate these tasks using a collection of growth curves from children from ages 1–18.  相似文献   

Crossover designs are used often in clinical trials. It is not uncommon that subjects discontinue before completing all treatment periods in a crossover study. Despite availability of statistical methodologies utilizing all available data and software for obtaining valid inferences under the assumption of missing at random (MAR), naïve approaches, such as the complete case (CC) analysis, which is only valid with a strong assumption of missing completely at random are still widely used in practice. In this article, we obtain the analytical form of the estimation bias of treatment effects with CC for linear mixed models. We use simulation studies to examine the inflation of Type I error and efficiency loss in the inferences with CC under MAR. Invalidity and inefficiency of two other commonly used approaches for defining analyzed data in the presence of missing data, including data from at least two periods in three period crossover and available cases for a specific comparison of interest, are also demonstrated through simulation studies.  相似文献   

In this article, we consider exact tests in panel data regression model with one-way and two-way error component for which no exact tests are available. Exact inferences using generalized p-values are obtained. When there are several groups of panel data, test for equal coefficients under one-way and two-way error component are derived.  相似文献   

This study develops a new bias-corrected estimator for the fixed-effects dynamic panel data model and derives its limiting distribution for finite number of time periods, T, and large number of cross-section units, N. The bias-corrected estimator is derived as a bias correction of the least squares dummy variable (within) estimator. It does not share some of the drawbacks of recently developed instrumental variables and generalized method-of-moments estimators and is relatively easy to compute. Monte Carlo experiments provide evidence that the bias-corrected estimator performs well even in small samples. The proposed technique is applied in an empirical analysis of unemployment dynamics at the U.S. state level for the 1991–2000 period.  相似文献   

In this article, we propose various tests for serial correlation in fixed-effects panel data regression models with a small number of time periods. First, a simplified version of the test suggested by Wooldridge (2002) and Drukker (2003) is considered. The second test is based on the Lagrange Multiplier (LM) statistic suggested by Baltagi and Li (1995), and the third test is a modification of the classical Durbin–Watson statistic. Under the null hypothesis of no serial correlation, all tests possess a standard normal limiting distribution as N tends to infinity and T is fixed. Analyzing the local power of the tests, we find that the LM statistic has superior power properties. Furthermore, a generalization to test for autocorrelation up to some given lag order and a test statistic that is robust against time dependent heteroskedasticity are proposed.  相似文献   

Missing observations in both responses and covariates arise frequently in longitudinal studies. When missing data are missing not at random, inferences under the likelihood framework often require joint modelling of response and covariate processes, as well as missing data processes associated with incompleteness of responses and covariates. Specification of these four joint distributions is a nontrivial issue from the perspectives of both modelling and computation. To get around this problem, we employ pairwise likelihood formulations, which avoid the specification of third or higher order association structures. In this paper, we consider three specific missing data mechanisms which lead to further simplified pairwise likelihood (SPL) formulations. Under these missing data mechanisms, inference methods based on SPL formulations are developed. The resultant estimators are consistent, and enjoy better robustness and computation convenience. The performance is evaluated empirically though simulation studies. Longitudinal data from the National Population Health Survey and Waterloo Smoking Prevention Project are analysed to illustrate the usage of our methods.  相似文献   


Useful knowledge acquisition from known and systematized information (data) is a big challenge for researchers, users and finally, decision makers. In this sense, knowledge discovery from data (KDD) process represents a valuable tool for information analysis. Moreover, this work presents an approach through KDD in time series pattern identification for anchovy and sardine fisheries and environmental data, in northern Chile. Time series, multivariate analysis and data mining techniques, along with technical literature review for results validation. The KDD approach and the data mining techniques implemented achieved an integration between these variables, identifying relevant patterns associated with fisheries abundance fluctuations and strong association with environmental changes such as El Niño and long-term cold–warm regimes between them, establishing anchovy and sardine pre-dominant time-periods, associated with environmental conditions are identified. The latter establishes groundwork for studying underlying functional relationships that could reduce gaps in the national fisheries management policies for those fisheries.  相似文献   

