首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
文章以美国威斯康星州的乳腺癌调查数据为例,分别采用SIS和TCS算法对高维数据进行降维处理,尝试将改进的Logistic广义线性模型对降维后的变量进行拟合.再与传统的一般线性模型、Logistic广义线性模型相比,结果表明,基于算法降维后的Logistic广义线性模型预测误差更小,其中基于TCS算法降维后的广义线性模型在拟合中要明显优于SIS算法降维后的广义线性模型.  相似文献   

2.
在自然语言处理中,将非结构化的文本数据表示成结构化数据是文本处理工作的基础,文本表示的优劣对后期文本处理的效果有直接的影响。提出一种新的结构化文本表示模型——结构张量空间模型,该模型将文本按照其自身的层次含义进行分层表示,相比较于传统的文本表示模型,更充分地体现文本的结构信息。研究了基于结构张量空间模型的文本分类问题,实验结果表明,在小样本数据下,结合结构张量空间模型的分类器性能更好。  相似文献   

3.
在高维空间中进行统计建模通常会碰到"维数祸根"问题,解决办法之一是降维,充分降维是一种有效的降维方法。针对多维响应降维子空间提出一类矩生成函数估计方法及其改进估计量,并给出该类方法估计量的大样本性质:相合性、渐近正态性。通过随机模拟与实例分析,表明改进估计量估计效果有较大幅度提高。  相似文献   

4.
文章利用充分降维的思想,对分类问题的BinomialBoosting(BBoosting)算法进行了改进,提出了一种新的方法--Dimension Reduction BinomialBoosting(DRBBoosting).这种算法在每次迭代中,结合充分降维方法,充分提取X与Y之间的信息,得到X的线性组合βTX,用βTX进行boosting迭代,避免了BBoosting对所有变量逐个分析.与BBoosting相比,收敛速度快,预测精度高;模拟比较也表明了DRBBoosting的优点.  相似文献   

5.
大数据具有数据来源差异性、高维性及稀疏性等特点,如何挖掘数据集间的异质性和共同性并降维去噪是大数据分析的目标与挑战之一。整合分析(Integrative Analysis)同时分析多个独立数据集,避免因地域、时间等因素造成的样本差异而引起模型不稳定,是研究大数据差异性的有效方法。它的特点是将每个解释变量在所有数据集中的系数视为一组,通过惩罚函数对系数组进行压缩,研究变量间的关联性并实现降维。本文从同构数据整合分析、异构数据整合分析以及考虑网络结构的整合分析三方面梳理了惩罚整合分析方法的原理、算法和研究现状。统计模拟发现,在弱相关、一般相关和强相关三种情形下, Group Bridge、 Group MCP、Composite MCP都表现良好,其中 Group Bridge的假阳数最低且最稳定。最后,将整合分析用于研究具有来源差异性的新农合家庭医疗支出,以及具有超高维、小样本等大数据典型特征的癌症基因数据,得到了一些有意义的结论。  相似文献   

6.
文章在对多维交叉分类数据进行粗糙集描述的基础上,提出了用关联信息系数矩阵测度多维定性变量关联性的方法。研究表明,应用关联信息系数矩阵可以更有效地发现多维变量间的关联结构。  相似文献   

7.
用于分类的随机森林和Bagging分类树比较   总被引:1,自引:0,他引:1  
借助试验数据,从两种理论分析角度解释随机森林算法优于Bagging分类树算法的原因。将两种算法表述在两种不同的框架下,消除了这两种算法分析中的一些模糊之处。尤其在第二种分析框架下,更能清楚的看出,之所以随机森林算法优于Bagging分类树算法,是因为随机森林算法对应更小的偏差。  相似文献   

8.
张宸  韩夏 《统计与决策》2017,(14):45-48
当前网络舆情信息存在数据量大、流动快及数据非结构化等特点,难以实现对其快速、准确的分类.SVM算法和朴素贝叶斯算法都是性能优秀的传统分类算法,但无法满足快速处理海量数据.文章利用Hadoop平台可并行处理分布式数据存储的优良特性,提出了HSVM_WNB分类算法,将采集的舆情文档依照HDFS架构进行本地化存储,并通过MapReduce进程完成并行分类处理.最后利用实验验证,本算法能够有效提升网络舆情分类能力与分类效率.  相似文献   

9.
基于som网络-主成分-BP网络的股价预测   总被引:5,自引:1,他引:4  
文章提出一种基于som网络-主成分-BP网络预测模型,用于股市收盘价的实时预测。首先采用som神经网络将特性分散的样本划分成不同的子类,然后采用主成分分析方法对影响目标数据的众多变量进行降维处理,在此基础上,构建了股市收盘价的BP神经网络预测模型,大大改善了预报的精度和效率,通过对采集的股市数据进行测试,表明本文提出方法的有效性。  相似文献   

10.
目前对经济区域进行分类通常采用主成分分析法等方法进行研究。文章在研究经济区域资源位的过程中引入了聚类算法,并将支撑向量机算法(SVM)引入该研究领域。结合算法原理说明了将其引入经济区域分类的可行性;同时对应用两种算法对经济区域进行分类的结果作出了比较,对应用聚类和支撑向量机两种算法对经济区域进行分类的优缺点进行了分析。  相似文献   

11.
In high dimensional classification problem, two stage method, reducing the dimension of predictor first and then applying the classification method, is a natural solution and has been widely used in many fields. The consistency of the two stage method is an important issue, since errors induced by dimension reduction method inevitably have impacts on the following classification method. As an effective method for classification problem, boosting has been widely used in practice. In this paper, we study the consistency of two stage method–dimension reduction based boosting algorithm (briefly DRB) for classification problem. Theoretical results show that Lipschitz condition on the base learner is required to guarantee the consistency of DRB. This theoretical findings provide useful guideline for application.  相似文献   

12.
A Bayesian multi-category kernel classification method is proposed. The algorithm performs the classification of the projections of the data to the principal axes of the feature space. The advantage of this approach is that the regression coefficients are identifiable and sparse, leading to large computational savings and improved classification performance. The degree of sparsity is regulated in a novel framework based on Bayesian decision theory. The Gibbs sampler is implemented to find the posterior distributions of the parameters, thus probability distributions of prediction can be obtained for new data points, which gives a more complete picture of classification. The algorithm is aimed at high dimensional data sets where the dimension of measurements exceeds the number of observations. The applications considered in this paper are microarray, image processing and near-infrared spectroscopy data.  相似文献   

13.
A composite endpoint consists of multiple endpoints combined in one outcome. It is frequently used as the primary endpoint in randomized clinical trials. There are two main disadvantages associated with the use of composite endpoints: a) in conventional analyses, all components are treated equally important; and b) in time‐to‐event analyses, the first event considered may not be the most important component. Recently Pocock et al. (2012) introduced the win ratio method to address these disadvantages. This method has two alternative approaches: the matched pair approach and the unmatched pair approach. In the unmatched pair approach, the confidence interval is constructed based on bootstrap resampling, and the hypothesis testing is based on the non‐parametric method by Finkelstein and Schoenfeld (1999). Luo et al. (2015) developed a close‐form variance estimator of the win ratio for the unmatched pair approach, based on a composite endpoint with two components and a specific algorithm determining winners, losers and ties. We extend the unmatched pair approach to provide a generalized analytical solution to both hypothesis testing and confidence interval construction for the win ratio, based on its logarithmic asymptotic distribution. This asymptotic distribution is derived via U‐statistics following Wei and Johnson (1985). We perform simulations assessing the confidence intervals constructed based on our approach versus those per the bootstrap resampling and per Luo et al. We have also applied our approach to a liver transplant Phase III study. This application and the simulation studies show that the win ratio can be a better statistical measure than the odds ratio when the importance order among components matters; and the method per our approach and that by Luo et al., although derived based on large sample theory, are not limited to a large sample, but are also good for relatively small sample sizes. Different from Pocock et al. and Luo et al., our approach is a generalized analytical method, which is valid for any algorithm determining winners, losers and ties. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

14.
Parameters of a finite mixture model are often estimated by the expectation–maximization (EM) algorithm where the observed data log-likelihood function is maximized. This paper proposes an alternative approach for fitting finite mixture models. Our method, called the iterative Monte Carlo classification (IMCC), is also an iterative fitting procedure. Within each iteration, it first estimates the membership probabilities for each data point, namely the conditional probability of a data point belonging to a particular mixing component given that the data point value is obtained, it then classifies each data point into a component distribution using the estimated conditional probabilities and the Monte Carlo method. It finally updates the parameters of each component distribution based on the classified data. Simulation studies were conducted to compare IMCC with some other algorithms for fitting mixture normal, and mixture t, densities.  相似文献   

15.
In this paper we present a perspective on the overall process of developing classifiers for real-world classification problems. Specifically, we identify, categorize and discuss the various problem-specific factors that influence the development process. Illustrative examples are provided to demonstrate the iterative nature of the process of applying classification algorithms in practice. In addition, we present a case study of a large scale classification application using the process framework described, providing an end-to-end example of the iterative nature of the application process. The paper concludes that the process of developing classification applications for operational use involves many factors not normally considered in the typical discussion of classification models and algorithms.  相似文献   

16.
There are a variety of methods in the literature which seek to make iterative estimation algorithms more manageable by breaking the iterations into a greater number of simpler or faster steps. Those algorithms which deal at each step with a proper subset of the parameters are called in this paper partitioned algorithms. Partitioned algorithms in effect replace the original estimation problem with a series of problems of lower dimension. The purpose of the paper is to characterize some of the circumstances under which this process of dimension reduction leads to significant benefits.Four types of partitioned algorithms are distinguished: reduced objective function methods, nested (partial Gauss-Seidel) iterations, zigzag (full Gauss-Seidel) iterations, and leapfrog (non-simultaneous) iterations. Emphasis is given to Newton-type methods using analytic derivatives, but a nested EM algorithm is also given. Nested Newton methods are shown to be equivalent to applying to same Newton method to the reduced objective function, and are applied to separable regression and generalized linear models. Nesting is shown generally to improve the convergence of Newton-type methods, both by improving the quadratic approximation to the log-likelihood and by improving the accuracy with which the observed information matrix can be approximated. Nesting is recommended whenever a subset of parameters is relatively easily estimated. The zigzag method is shown to produce a stable but generally slow iteration; it is fast and recommended when the parameter subsets have approximately uncorrelated estimates. The leapfrog iteration has less guaranteed properties in general, but is similar to nesting and zigzagging when the parameter subsets are orthogonal.  相似文献   

17.
Recently, a new ensemble classification method named Canonical Forest (CF) has been proposed by Chen et al. [Canonical forest. Comput Stat. 2014;29:849–867]. CF has been proven to give consistently good results in many data sets and comparable to other widely used classification ensemble methods. However, CF requires an adopting feature reduction method before classifying high-dimensional data. Here, we extend CF to a high-dimensional classifier by incorporating a random feature subspace algorithm [Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:832–844]. This extended algorithm is called HDCF (high-dimensional CF) as it is specifically designed for high-dimensional data. We conducted an experiment using three data sets – gene imprinting, oestrogen, and leukaemia – to compare the performance of HDCF with several popular and successful classification methods on high-dimensional data sets, including Random Forest [Breiman L. Random forest. Mach Learn. 2001;45:5–32], CERP [Ahn H, et al. Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal. 2007;51:6166–6179], and support vector machines [Vapnik V. The nature of statistical learning theory. New York: Springer; 1995]. Besides the classification accuracy, we also investigated the balance between sensitivity and specificity for all these four classification methods.  相似文献   

18.
The Buckley–James estimator (BJE) [J. Buckley and I. James, Linear regression with censored data, Biometrika 66 (1979), pp. 429–436] has been extended from right-censored (RC) data to interval-censored (IC) data by Rabinowitz et al. [D. Rabinowitz, A. Tsiatis, and J. Aragon, Regression with interval-censored data, Biometrika 82 (1995), pp. 501–513]. The BJE is defined to be a zero-crossing of a modified score function H(b), a point at which H(·) changes its sign. We discuss several approaches (for finding a BJE with IC data) which are extensions of the existing algorithms for RC data. However, these extensions may not be appropriate for some data, in particular, they are not appropriate for a cancer data set that we are analysing. In this note, we present a feasible iterative algorithm for obtaining a BJE. We apply the method to our data.  相似文献   

19.
Reshef et al. (Science 334:1518–1523, 2011) introduce the maximal information coefficient, or MIC, which captures a wide range of relationships between pairs of variables. We derive a useful property which can be employed either to substantially reduce the computer time to determine MIC, or to obtain a series of MIC values for different resolutions. Through studying the dependence of the MIC scores on the maximal resolution, employed to partition the data, we show that relationships of different natures can be discerned more clearly. We also provide an iterative greedy algorithm, as an alternative to the ApproxMaxMI proposed by Reshef et al., to determine the value of MIC through iterative optimization, which can be conducted parallelly.  相似文献   

20.
In most applications, the parameters of a mixture of linear regression models are estimated by maximum likelihood using the expectation maximization (EM) algorithm. In this article, we propose the comparison of three algorithms to compute maximum likelihood estimates of the parameters of these models: the EM algorithm, the classification EM algorithm and the stochastic EM algorithm. The comparison of the three procedures was done through a simulation study of the performance (computational effort, statistical properties of estimators and goodness of fit) of these approaches on simulated data sets.

Simulation results show that the choice of the approach depends essentially on the configuration of the true regression lines and the initialization of the algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号