首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
对于一类变量非线性相关的面板数据,现有的基于线性算法的面板数据聚类方法并不能准确地度量样本间的相似性,且聚类结果的可解释性低。综合考虑变量非线性相关问题及聚类结果可解释性问题,提出一种非线性面板数据的聚类方法,通过非线性核主成分算法实现对样本相似性的测度,并基于混合高斯模型进行样本概率聚类,实证表明该方法的有效性及其对聚类结果的可解释性有所提高。  相似文献   

2.
基于统计模型的模糊聚类算法的时间复杂度在数据集规模超过一定数量级时是计算不可行的,解决时间复杂度的一个行之有效的方法是抽样.文章通过对静态抽样进行改进,设计了一种半静态抽样法,使样本数据集最大程度得保持原数据集的信息,并保证聚类结果的不失真性;最后通过实证分析,比较并证明了该方法是有效的.  相似文献   

3.
非平衡数据集的改进SMOTE再抽样算法   总被引:1,自引:0,他引:1       下载免费PDF全文
薛薇 《统计研究》2012,29(6):95-98
非平衡数据集的不均衡学习特点通常表现为负类的分类效果不理想。改进SMOTE再抽样算法,将过抽样和欠抽样方式有机结合,有针对性地选择近邻并采用不同策略合成样本。实验表明,分类器在经此算法处理后的非平衡数据集的正负两类上,均可获得较理想的分类效果。  相似文献   

4.
基于数据分布密度划分的聚类算法是数据挖掘聚类算法中的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计出高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区产生数据集的子簇空间,并进行子簇合并形成了理想的聚类结果;依据算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

5.
针对基于众包竞赛中欺诈者筛除机制的黄金标准数据方法、聚类算法的离群点检测算法K-means-算法和DBSCAN算法,依赖于事先给定的参数,不适合大规模数据集检测的问题,提出基于样本连通图的离群点检测算法。首先,给定参数并重复调用离群点检测算法,识别数据中的离群点和聚类;其次,计算每两个样本之间的连接次数和连接强度,在给定连接强度下界δ的情况下,根据样本的连接强度来构造样本之间的连通图;最后,根据样本之间的连通情况,对样本进行标记,把样本标记为聚类节点和离群点。实验结果表明,该算法在放宽参数设置范围的情况下,缩小了离群点个数波动范围,提升了离群点识别准确率,优于对比算法和经典的黄金标准数据方法。  相似文献   

6.
数据分布密度划分的聚类算法是数据挖掘聚类算法的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区,产生数据集的子簇空间,并进行子簇合并,形成理想的聚类结果;依据该算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

7.
李扬等 《统计研究》2018,35(7):125-128
海量化的数据规模作为大数据的第一个特征,带来计算方面的首要挑战。大规模样本不一定可以完全替代总体,因此大数据分析的算法设计不仅要考虑精简计算成本,还要考虑如何刻画估计结果的不确定性。本文以分治自助算法和子集双重自助算法为例讨论兼具计算效率提升和不确定性评价的可并行计算的大数据统计算法设计,通过比较分析探讨设计思想与未来研究方向。  相似文献   

8.
万舒晨 《统计研究》2021,38(6):116-127
为推动规模以下工业抽样调查工作以及解决当前调查面临的有关问题,本文对抽样设计进行了改进研究。首先,本文对规模以下工业抽样设计演变过程进行系统梳理,总结了现行抽样设计充分利用双重抽样框设计和综合运用三种抽样方法的特点。其次,针对园区层企业密度高的特点,探索结合园区因素改进地域抽样设计,对园区层和非园区层分别抽样,解决调查中面临的非抽样误差问题,并调整辅助变量使其与核心指标相关性均较高,确保抽样推断精度,有效提高抽样调查效率。并以我国东 部某省为例进行实证模拟得到结合园区因素抽样设计对调查工作改进的结论。再次,针对我国各级政府管理需要以及局队业务分工优化调整情况,介绍了规模以下工业样本追加理论和实证应用的主要研究成果。最后,在大数据时代数据来源广泛的背景下,本文在多重抽样框设计以及利用辅助变量提升样本轮换推断精度方面提出了进一步改进抽样设计的思路。  相似文献   

9.
以内蒙古自治区12个盟市的绿色资源环境发展为研究对象,采用灰色动态聚类与粗糙集相结合的方法,构建包含有全年供水量等11个指标的内蒙古自治区绿色资源环境指标体系,其要点在于:一是通过灰色关联分析建立样本间的灰色关联矩阵,进而进行样本间的灰色聚类,反映样本间的信息重复性;二是采用动态聚类方法,每次去除一个指标后继续通过灰色关联分析建立的灰色关联矩阵进行灰色样本聚类,为粗糙集约简提供信息数据;三是通过粗糙集约简理论判断海选指标对聚类结果的影响是否显著,将每一次的聚类结果与原始聚类结果比较,保留两次聚类结果不同且对评价样本分类有显著影响的海选指标;四是采用非参数Kruska-Wallis检验的P值检验法证明本文构建的指标体系是合理的。通过对比分析表明,本文的灰色动态聚类-粗糙集指标筛选模型优于现有研究的聚类-灰色关联指标筛选模型。  相似文献   

10.
在经济社会调查中,总体单元之间的空间相关性普遍存在,对传统抽样设计提出了挑战。针对这一问题,提出了使用经纬度坐标作为空间辅助信息,借助空间平衡抽样算法获取样本的设计思路。该种算法利用总体单元之间的空间距离设计抽样算法更新包含概率,使空间上距离较近的单元倾向于不同时进入样本,从而使样本单元在空间上均匀覆盖。实证研究结果表明,随着样本量连续增加,空间平衡抽样设计的估计量标准差在合理的抽样比范围内总是优于传统抽样设计,能够显著提高估计效率。  相似文献   

11.
金勇进  刘展 《统计研究》2016,33(3):11-17
利用大数据进行抽样,很多情况下抽样框的构造比较困难,使得抽取的样本属于非概率样本,难以将传统的抽样推断理论应用到非概率样本中,如何解决非概率抽样的统计推断问题,是大数据背景下抽样调查面临的严重挑战。本文提出了解决非概率抽样统计推断问题的基本思路:一是抽样方法,可以考虑基于样本匹配的样本选择、链接跟踪抽样方法等,使得到的非概率样本近似于概率样本,从而可采用概率样本的统计推断理论;二是权数的构造与调整,可以考虑基于伪设计、模型和倾向得分等方法得到类似于概率样本的基础权数;三是估计,可以考虑基于伪设计、模型和贝叶斯的混合概率估计。最后,以基于样本匹配的样本选择为例探讨了具体解决方法。  相似文献   

12.
The random walk Metropolis algorithm is a simple Markov chain Monte Carlo scheme which is frequently used in Bayesian statistical problems. We propose a guided walk Metropolis algorithm which suppresses some of the random walk behavior in the Markov chain. This alternative algorithm is no harder to implement than the random walk Metropolis algorithm, but empirical studies show that it performs better in terms of efficiency and convergence time.  相似文献   

13.
Given the random walk model, we show, for the traditional unrestricted regression used in testing stationarity, that no matter what the initial value of the random walk is or its drift or its error standard deviation, the sampling distributions of certain statistics remain unchanged. Using Monte Carlo simulations, we estimate, for different finite samples, the sampling distributions of these statistics. After smoothing the percentiles of the empirical sampling distributions, we come up with a new set of critical values for testing the existence of a random walk, if each statistic is being used on an individual base. Combining the new sets of critical values, we finally suggest a general methodology for testing for a random walk model.  相似文献   

14.
The stochastic block model (SBM) is widely used for modelling network data by assigning individuals (nodes) to communities (blocks) with the probability of an edge existing between individuals depending upon community membership. In this paper, we introduce an autoregressive extension of the SBM, based on continuous-time Markovian edge dynamics. The model is appropriate for networks evolving over time and allows for edges to turn on and off. Moreover, we allow for the movement of individuals between communities. An effective reversible-jump Markov chain Monte Carlo algorithm is introduced for sampling jointly from the posterior distribution of the community parameters and the number and location of changes in community membership. The algorithm is successfully applied to a network of mice.  相似文献   

15.
Abstract.  Much recent methodological progress in the analysis of infectious disease data has been due to Markov chain Monte Carlo (MCMC) methodology. In this paper, it is illustrated that rejection sampling can also be applied to a family of inference problems in the context of epidemic models, avoiding the issues of convergence associated with MCMC methods. Specifically, we consider models for epidemic data arising from a population divided into households. The models allow individuals to be potentially infected both from outside and from within the household. We develop methodology for selection between competing models via the computation of Bayes factors. We also demonstrate how an initial sample can be used to adjust the algorithm and improve efficiency. The data are assumed to consist of the final numbers ultimately infected within a sample of households in some community. The methods are applied to data taken from outbreaks of influenza.  相似文献   

16.
Adaptive sampling without replacement of clusters   总被引:1,自引:0,他引:1  
In a common form of adaptive cluster sampling, an initial sample of units is selected by random sampling without replacement and, whenever the observed value of the unit is sufficiently high, its neighboring units are added to the sample, with the process of adding neighbors repeated if any of the added units are also high valued. In this way, an initial selection of a high-valued unit results in the addition of the entire network of surrounding high-valued units and some low-valued “edge” units where sampling stops. Repeat selections can occur when more than one initially selected unit is in the same network or when an edge unit is shared by more than one added network. Adaptive sampling without replacement of networks avoids some of this repeat selection by sequentially selecting initial sample units only from the part of the population not already in any selected network. The design proposed in this paper carries this step further by selecting initial units only from the population, exclusive of any previously selected networks or edge units.  相似文献   

17.
Missing data are often problematic in social network analysis since what is missing may potentially alter the conclusions about what we have observed as tie-variables need to be interpreted in relation to their local neighbourhood and the global structure. Some ad hoc methods for dealing with missing data in social networks have been proposed but here we consider a model-based approach. We discuss various aspects of fitting exponential family random graph (or p-star) models (ERGMs) to networks with missing data and present a Bayesian data augmentation algorithm for the purpose of estimation. This involves drawing from the full conditional posterior distribution of the parameters, something which is made possible by recently developed algorithms. With ERGMs already having complicated interdependencies, it is particularly important to provide inference that adequately describes the uncertainty, something that the Bayesian approach provides. To the extent that we wish to explore the missing parts of the network, the posterior predictive distributions, immediately available at the termination of the algorithm, are at our disposal, which allows us to explore the distribution of what is missing unconditionally on any particular parameter values. Some important features of treating missing data and of the implementation of the algorithm are illustrated using a well-known collaboration network and a variety of missing data scenarios.  相似文献   

18.
孙旭等 《统计研究》2019,36(7):119-128
代际流动表可以统计子代与其父代社会地位配对数据的交互频数,反映了社会资源占有的优劣势在父子两代人之间的比较。对财富、阶级、特权等社会基本特征演变的实证考察,均依赖于代际流动表的量化分析。对数线性模型是流动表建模分析的基本工具,通过对列联表单元格频数进行拟合,可以识别流动表行分类与列分类之间的强弱交互效应,刻画父子社会地位间的交互结构。本文利用复杂网络社区发现算法分析父子社会地位的关联结构,针对简约对数线性模型拟合精度不够的问题,提出一种新的建模思路:利用社区发现算法对简约对数线性模型的残差列联表进行关联关系挖掘,将发现的社区效应作为附加参数约束引入原对数线性模型,以改善数据的拟合情况。由于该方法只在原简约对数线性模型中增加了一个参数约束,因此仍可以保证建模结果的简洁性及理论意义,同时社区效应补充了原对数线性模型对经验数据结构的解读。论文用此方法对来源于中国综合社会调查数据的经验代际职业流动表进行建模分析,较好地解释了子代职业阶层与父代职业阶层间的关联模式。  相似文献   

19.
We present a novel methodology for estimating the parameters of a finite mixture model (FMM) based on partially rank‐ordered set (PROS) sampling and use it in a fishery application. A PROS sampling design first selects a simple random sample of fish and creates partially rank‐ordered judgement subsets by dividing units into subsets of prespecified sizes. The final measurements are then obtained from these partially ordered judgement subsets. The traditional expectation–maximization algorithm is not directly applicable for these observations. We propose a suitable expectation–maximization algorithm to estimate the parameters of the FMMs based on PROS samples. We also study the problem of classification of the PROS sample into the components of the FMM. We show that the maximum likelihood estimators based on PROS samples perform substantially better than their simple random sample counterparts even with small samples. The results are used to classify a fish population using the length‐frequency data.  相似文献   

20.
In this paper, order statistics from independent and non identically distributed random variables is used to obtain ordered ranked set sampling (ORSS). Bayesian inference of unknown parameters under a squared error loss function of the Pareto distribution is determined. We compute the minimum posterior expected loss (the posterior risk) of the derived estimates and compare them with those based on the corresponding simple random sample (SRS) to assess the efficiency of the obtained estimates. Two-sample Bayesian prediction for future observations is introduced by using SRS and ORSS for one- and m-cycle. A simulation study and real data are applied to show the proposed results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号