首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 180 毫秒
1.
欧氏距离条件下的聚类分析没有考虑指标间的相关性,基于模型的聚类方法存在多重共线性影响参数稳定性等问题,针对上述问题,文章在欧式距离条件下对变量间具有相关性的数据样本进行聚类分析时,先构建变量间相关性结构的回归相关模型,再通过差分分析对变量间的多重共线进行消除,然后做聚类分析.并以1996-2011年9个省份城市教育投入情况进行聚类分析,结果表明,给出的聚类方法是有效的.  相似文献   

2.
文章针对主成分综合评价主要环节的一般性问题展开讨论,给出可行的解决方案并进行了理论分析。在总结现有关于主成分聚类分析重要文献的基础上,通过构建客观赋权的加权主成分距离为聚类统计量,有效地解决了现有聚类模型不能处理指标共线性和重要性差异悬殊的问题。对比本文拓展的聚类模型与同类模型的分类效率发现,加权主成分聚类分析蕴含的客观合理性是其优势所在的根本原因。  相似文献   

3.
一种加权主成分距离的聚类分析方法   总被引:1,自引:0,他引:1  
吕岩威  李平 《统计研究》2016,33(11):102-108
指标之间的高度相关性及其重要性差异导致了传统聚类分析方法往往无法获得良好的分类效果。本文在对传统聚类分析方法及其各种改进方法局限性展开探讨的基础上,运用数学方法重构了分类定义中的距离概念,通过定义自适应赋权的主成分距离为分类统计量,提出一种新的改进的主成分聚类分析方法——加权主成分距离聚类分析法。理论研究表明,加权主成分距离聚类分析法系统集成了已有聚类分析方法的优点,有充分的理论基础保证其科学合理性。仿真实验结果显示,加权主成分距离聚类分析法能够有效解决已有聚类分析方法在特定情形下的失真问题,所得分类效果更为理想。  相似文献   

4.
对于一类变量非线性相关的面板数据,现有的基于线性算法的面板数据聚类方法并不能准确地度量样本间的相似性,且聚类结果的可解释性低。综合考虑变量非线性相关问题及聚类结果可解释性问题,提出一种非线性面板数据的聚类方法,通过非线性核主成分算法实现对样本相似性的测度,并基于混合高斯模型进行样本概率聚类,实证表明该方法的有效性及其对聚类结果的可解释性有所提高。  相似文献   

5.
文章针对多元线性回归模型提出了一种建立在主分量变换基础上的方法。该方法通过因变量与各个变量间对应的波动量建立相关性矩阵,以此来获得多元相关性分布状态;通过主分量变换获得具有最大相关性的主分量;最后按照主分量矩阵与各相关矩阵的距离及最小二乘估计确定回归系数。该算法建立在波动相关性分析基础上,反映了系统内相关要素之间的统计确定性,且建立在相关性统计上的主分量变换能够消除共线性问题对回归系数的影响,增加了最小二乘估计方法的可靠性。  相似文献   

6.
我国商业银行的规模和财务指标存在较大差别,相应影响商业银行贷款效率的因素也存在较大差异。文章通过类平均聚类方法,将欧几里德距离较小且具有相似经济背景的银行分为一组,得到五大国有银行类及非国有银行类两类银行。通过逐步回归方法,逐步剔除SFA模型中t检验不显著的变量,保留所有对被解释变量影响显著的解释变量,建立商业银行贷款效率评价模型,并进行了实证分析。  相似文献   

7.
本文研究的是时间序列的聚类问题。由于现实世界中时间序列多数是非线性的,而现有的时间序列聚类问题大都是基于线性时间序列模型进行聚类的,本文提出了可以用于非线性时间序列的聚类方法。以时间序列的二维核密度估计之间的相似性作为非线性时间序列的距离度量,该距离度量方式是一种非参数的距离度量方法,考虑到了时间序列自相关结构的差异,能够粗糙地识别时间序列形状和动态相关结构的相似性。与理论研究结果相一致,我们的模拟实验结果也验证了这种距离度量的有效性。  相似文献   

8.
基于主成分分析的汽车特征价格模型初探   总被引:1,自引:0,他引:1  
特征价格模型建立过程中,特征变量的选取是一个重要问题。实证研究中,为消除特征变量问的多重共线性,研究者通常采用逐步回归分析法来筛选变量,这样进入模型的特征变量往往比较少。因此。本文将主成分分析法引入于特征价格模型。利用我国汽车数据,建立了基于汽车特征因素主成分分析的特征价格模型,不仅解决了汽车特征变量间存在的多重共线性问题,而且有效改善了用逐步回归分析法筛选变量选取较少变量的情形。  相似文献   

9.
Logistic模型多重共线性问题的诊断及改进   总被引:1,自引:0,他引:1  
文章诊断并改进了logistic回归模型多重共线性问题方法,采用条件指数和方差分解比例两项指标进行共线性诊断、应用主成分改进和偏最小二乘回归两种方法进行多重共线性变量的改进处理:去除了回归模型中变量间的多重共线性影响,建立了较为理想的关系模型.结果表明,在Logisdc回归模型分析中,应用上述方法进行多重共线性的诊断和处理是有效及可行的.  相似文献   

10.
将相关分析和有向聚类分析结合,提出有向相关聚类方法。先依据相关性进行变量合并,再进行有向聚类,分析结果更合理,聚类过程更简单。将该方法用于大学生健康成长影响因素的调查数据,得出更合理的结果。  相似文献   

11.
Rong Zhu  Xinyu Zhang 《Statistics》2018,52(1):205-227
The theories and applications of model averaging have been developed comprehensively in the past two decades. In this paper, we consider model averaging for multivariate multiple regression models. In order to make use of the correlation information of the dependent variables sufficiently, we propose a model averaging method based on Mahalanobis distance which is related to the correlation of the dependent variables. We prove the asymptotic optimality of the resulting Mahalanobis Mallows model averaging (MMMA) estimators under certain assumptions. In the simulation study, we show that the proposed MMMA estimators compare favourably with model averaging estimators based on AIC and BIC weights and the Mallows model averaging estimators from the single dependent variable regression models. We further apply our method to the real data on urbanization rate and the proportion of non-agricultural population in ethnic minority areas of China.  相似文献   

12.
In this work we study a way to explore and extract more information from data sets with a hierarchical tree structure. We propose that any statistical study on this type of data should be made by group, after clustering. In this sense, the most adequate approach is to use the Mahalanobis–Wasserstein distance as a measure of similarity between the cases, to carry out clustering or unsupervised classification. This methodology allows for the clustering of cases, as well as the identification of their profiles, based on the distribution of all the variables that characterises each subject associated with each case. An application to a set of teenagers' interviews regarding their habits of communication is described. The interviewees answered several questions about the kind of contacts they had on their phone, Facebook, email or messenger as well as the frequency of communication between them. The results indicate that the methodology is adequate to cluster this kind of data sets, since it allows us to identify and characterise different profiles from the data. We compare the results obtained with this methodology with the ones obtained using the entire database, and we conclude that they may lead to different findings.  相似文献   

13.
In this article, we consider the performance of the principal component two-parameter estimator in situation of multicollinearity for misspecified linear regression model where misspecification is due to omission of some relevant explanatory variables. The conditions of superiority of the principal component two-parameter estimator over some estimators under the Mahalanobis loss function by the average loss criterion are derived. Furthermore, a real data example and a Monte Carlo simulation study are provided to illustrate some of the theoretical results.  相似文献   

14.
In this article, we consider clustering based on principal component analysis (PCA) for high-dimensional mixture models. We present theoretical reasons why PCA is effective for clustering high-dimensional data. First, we derive a geometric representation of high-dimension, low-sample-size (HDLSS) data taken from a two-class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geometric representation and provide geometric consistency properties for multiclass mixture models. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. Finally, we demonstrate the performance of the clustering using gene expression datasets.  相似文献   

15.
Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.  相似文献   

16.
Birnbaum–Saunders (BS) models are receiving considerable attention in the literature. Multivariate regression models are a useful tool of the multivariate analysis, which takes into account the correlation between variables. Diagnostic analysis is an important aspect to be considered in the statistical modeling. In this paper, we formulate multivariate generalized BS regression models and carry out a diagnostic analysis for these models. We consider the Mahalanobis distance as a global influence measure to detect multivariate outliers and use it for evaluating the adequacy of the distributional assumption. We also consider the local influence approach and study how a perturbation may impact on the estimation of model parameters. We implement the obtained results in the R software, which are illustrated with real-world multivariate data to show their potential applications.  相似文献   

17.
 本文针对经典聚类分析和普通主成分聚类分析极端情形下的失效问题展开讨论,通过定义客观赋权的主成分距离为分类统计量,并以实证检验取得良好效果为依据,有效地解决了主成分聚类分析在极端情形下所不能揭示的问题。  相似文献   

18.
We propose several diagnostic methods for checking the adequacy of marginal regression models for analyzing correlated binary data. We use a parametric marginal model based on latent variables and derive the projection (hat) matrix, Cook's distance, various residuals and Mahalanobis distance between the observed binary responses and the estimated probabilities for a cluster. Emphasized are several graphical methods including the simulated Q-Q plot, the half-normal probability plot with a simulated envelope, and the partial residual plot. The methods are illustrated with a real life example.  相似文献   

19.
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号