首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Support vector machine (SVM) is sparse in that its classifier is expressed as a linear combination of only a few support vectors (SVs). Whenever an outlier is included as an SV in the classifier, the outlier may have serious impact on the estimated decision function. In this article, we propose a robust loss function that is convex. Our learning algorithm is more robust to outliers than SVM. Also the convexity of our loss function permits an efficient solution path algorithm. Through simulated and real data analysis, we illustrate that our method can be useful in the presence of labeling errors.  相似文献   

2.
In this paper, we consider the classification of high-dimensional vectors based on a small number of training samples from each class. The proposed method follows the Bayesian paradigm, and it is based on a small vector which can be viewed as the regression of the new observation on the space spanned by the training samples. The classification method provides posterior probabilities that the new vector belongs to each of the classes, hence it adapts naturally to any number of classes. Furthermore, we show a direct similarity between the proposed method and the multicategory linear support vector machine introduced in Lee et al. [2004. Multicategory support vector machines: theory and applications to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99 (465), 67–81]. We compare the performance of the technique proposed in this paper with the SVM classifier using real-life military and microarray datasets. The study shows that the misclassification errors of both methods are very similar, and that the posterior probabilities assigned to each class are fairly accurate.  相似文献   

3.
One of the major issues in medical field constitutes the correct diagnosis, including the limitation of human expertise in diagnosing the disease in a manual way. Nowadays, the use of machine learning classifiers, such as support vector machines (SVM), in medical diagnosis is increasing gradually. However, traditional classification algorithms can be limited in their performance when they are applied on highly imbalanced data sets, in which negative examples (i.e. negative to a disease) outnumber the positive examples (i.e. positive to a disease). SVM constitutes a significant improvement and its mathematical formulation allows the incorporation of different weights so as to deal with the problem of imbalanced data. In the present work an extensive study of four medical data sets is conducted using a variant of SVM, called proximal support vector machine (PSVM) proposed by Fung and Mangasarian [9 G.M. Fung and O.L. Mangasarian, Proximal support vector machine classifiers, in Proceedings KDD-2001: Knowledge Discovery and Data Mining, F. Provost and R. Srikant, eds., Association for Computing Machinery, San Francisco, CA, New York, 2001, pp. 77–86. Available at ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [Google Scholar]]. Additionally, in order to deal with the imbalanced nature of the medical data sets we applied both a variant of SVM, referred as two-cost support vector machine and a modification of PSVM referred as modified PSVM. Both algorithms incorporate different weights one for each class examples.  相似文献   

4.
ABSTRACT

The last few years, the applications of Support Vector Machine (SVM) for solving classification and regression problems have been increasing, due to its high performance and ability to transform the non-linear relationships among variables to linear form by employing the kernel idea (kernel function). In this work, we develop a semi-parametric approach to fit single-index models to deal with high-dimensional problems. To achieve this goal, we use support vector regression (SVR) for estimating the unknown nonparametric link function, while the single-index is determined by using the semi-parametric least squares method (Ichimura 1993). This development enhances the ability of SVR to solve high-dimensional problem. We design a three simulation examples with high-dimensional problems (linear and nonlinear). The simulations demonstrate the superior performance of the proposed method versus the standard SVR method. This is further illustrated by applying the real data.  相似文献   

5.
In this work we present a study on the analysis of a large data set from seismology. A set of different large margin classifiers based on the well-known support vector machine (SVM) algorithm is used to classify the data into two classes based on their magnitude on the Richter scale. Due to the imbalance of nature between the two classes reweighing techniques are used to show the importance of reweighing algorithms. Moreover, we present an incremental algorithm to explore the possibility of predicting the strength of an earthquake with incremental techniques.  相似文献   

6.
The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods.  相似文献   

7.
Statistical process control tools have been used routinely to improve process capabilities through reliable on-line monitoring and diagnostic processes. In the present paper, we propose a novel multivariate control chart that integrates a support vector machine (SVM) algorithm, a bootstrap method, and a control chart technique to improve multivariate process monitoring. The proposed chart uses as the monitoring statistic the predicted probability of class (PoC) values from an SVM algorithm. The control limits of SVM-PoC charts are obtained by a bootstrap approach. A simulation study was conducted to evaluate the performance of the proposed SVM–PoC chart and to compare it with other data mining-based control charts and Hotelling's T 2 control charts under various scenarios. The results showed that the proposed SVM–PoC charts outperformed other multivariate control charts in nonnormal situations. Further, we developed an exponential weighed moving average version of the SVM–PoC charts for increasing sensitivity to small shifts.  相似文献   

8.
魏瑾瑞 《统计研究》2015,32(2):90-96
混合核函数方法并没有解决核函数的选择问题,只是将问题等价转换为权重参数的选择。同时该方法还需要分别为两个核函数确定参数,大大增加了算法的复杂程度,限制了支持向量机的泛化能力。事实上,调节核函数的参数对分类结果的影响要远大于选择什么类型的核函数,因此混合核函数方法实属“避轻就重”。实证分析表明,不同核函数对应的共同支持向量比例很高,存在很大程度的一致性,线性组合的意义并不大,这也是混合核函数方法无法有效提升分类性能的一个重要原因。  相似文献   

9.
In many scientific investigations, a large number of input variables are given at the early stage of modeling and identifying the variables predictive of the response is often a main purpose of such investigations. Recently, the support vector machine has become an important tool in classification problems of many fields. Several variants of the support vector machine adopting different penalties in its objective function have been proposed. This paper deals with the Fisher consistency and the oracle property of support vector machines in the setting where the dimension of inputs is fixed. First, we study the Fisher consistency of the support vector machine over the class of affine functions. It is shown that the function class for decision functions is crucial for the Fisher consistency. Second, we study the oracle property of the penalized support vector machines with the smoothly clipped absolute deviation penalty. Once we have addressed the Fisher consistency of the support vector machine over the class of affine functions, the oracle property appears to be meaningful in the context of classification. A simulation study is provided in order to show small sample properties of the penalized support vector machines with the smoothly clipped absolute deviation penalty.  相似文献   

10.
ABSTRACT

Identifying homogeneous subsets of predictors in classification can be challenging in the presence of high-dimensional data with highly correlated variables. We propose a new method called cluster correlation-network support vector machine (CCNSVM) that simultaneously estimates clusters of predictors that are relevant for classification and coefficients of penalized SVM. The new CCN penalty is a function of the well-known Topological Overlap Matrix whose entries measure the strength of connectivity between predictors. CCNSVM implements an efficient algorithm that alternates between searching for predictors’ clusters and optimizing a penalized SVM loss function using Majorization–Minimization tricks and a coordinate descent algorithm. This combining of clustering and sparsity into a single procedure provides additional insights into the power of exploring dimension reduction structure in high-dimensional binary classification. Simulation studies are considered to compare the performance of our procedure to its competitors. A practical application of CCNSVM on DNA methylation data illustrates its good behaviour.  相似文献   

11.
ABSTRACT

The support vector machine (SVM), first developed by Vapnik and his group at AT&T Bell Laboratories, is being used as a new technique for regression and classification problems. In this paper we present an approach to estimating prediction intervals for SVM regression based on posterior predictive densities. Furthermore, the method is illustrated with a data example.  相似文献   

12.
We study multiple-class classification problems. Both ordinal and categorical labeled cases are discussed. The common approaches for multiple-class classification are built on binary classifiers, in which one-versus-one and one-versus-rest are typical approaches. When the number of classes is large, then these binary-classifier-based methods may suffer from either computational costs or the highly imbalanced sample sizes in their training stage. In order to alleviate the computational burden and the imbalanced training data issue in multiple-class classification problems, we propose a method that has competitive performance and retains the ease of model interpretation, which is essential for a prognostic/predictive model.  相似文献   

13.
This article proposes a discriminant function and an algorithm to analyze the data addressing the situation, where the data are positively skewed. The performance of the suggested algorithm based on the suggested discriminant function (LNDF) has been compared with the conventional linear discriminant function (LDF) and quadratic discriminant function (QDF) as well as with the nonparametric support vector machine (SVM) and the Random Forests (RFs) classifiers, using real and simulated datasets. A maximum reduction of approximately 81% in the error rates as compared to QDF for ten-variate data was noted. The overall results are indicative of better performance of the proposed discriminant function under certain circumstances.  相似文献   

14.
王鹏  黄迅 《统计研究》2018,35(2):3-13
本文以沪深300指数(CSI300)长达11年时间的5分钟高频交易数据为研究样本,首先提出一种基于多分形特征的金融市场正常与关注状态的界定方法,并引入新型的支持向量机(SVM)人工智能模型,即孪生SVM(Twin-SVM)模型对多分形特征下的金融市场风险展开预警研究。实证结果表明:(1)中国新兴金融市场的价格波动具有显著的多分形特征;(2)基于多分形特征参数界定的正常与关注状态不仅准确,而且也具有明显的统计检验意义和明确的现实意义;(3)与传统SVM和BP神经网络(NN)相比,Twin-SVM在预测精度上不仅显著更高,而且在预测稳定性上也明显更优,即Twin-SVM能够有效地解决其它预警模型存在的非对称样本问题。  相似文献   

15.
In this article, a variable selection procedure, called surrogate selection, is proposed which can be applied when a support vector machine or kernel Fisher discriminant analysis is used in a binary classification problem. Surrogate selection applies the lasso after substituting the kernel discriminant scores for the binary group labels, as well as values for the input variable observations. Empirical results are reported, showing that surrogate selection performs well.  相似文献   

16.
The main models of machine learning are briefly reviewed and considered for building a classifier to identify the Fragile X Syndrome (FXS). We have analyzed 172 patients potentially affected by FXS in Andalusia (Spain) and, by means of a DNA test, each member of the data set is known to belong to one of two classes: affected, not affected. The whole predictor set, formed by 40 variables, and a reduced set with only nine predictors significantly associated with the response are considered. Four alternative base classification models have been investigated: logistic regression, classification trees, multilayer perceptron and support vector machines. For both predictor sets, the best accuracy, considering both the mean and the standard deviation of the test error rate, is achieved by the support vector machines, confirming the increasing importance of this learning algorithm. Three ensemble methods - bagging, random forests and boosting - were also considered, amongst which the bagged versions of support vector machines stand out, especially when they are constructed with the reduced set of predictor variables. The analysis of the sensitivity, the specificity and the area under the ROC curve agrees with the main conclusions extracted from the accuracy results. All of these models can be fitted by free R programs.  相似文献   

17.
基于数据分布密度划分的聚类算法是数据挖掘聚类算法中的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计出高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区产生数据集的子簇空间,并进行子簇合并形成了理想的聚类结果;依据算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

18.
Sequential minimal optimization (SMO) algorithm is effective in solving large-scale support vector machine (SVM). The existing algorithms all assume that the kernels are positive definite (PD) or positive semi-definite (PSD) and should meet the Mercer condition. Some kernels, however, such as sigmoid kernel, which originates from neural network and then is extensively used in SVM, are conditionally PD in certain circumstances; in addition, practically, it is often difficult to prove whether a kernel is PD or PSD or not except some well-known kernels. So, the applications of the existing algorithm of SMO are limited. Considering the deficiency of the traditional ones, this algorithm of solving ?-SVR with nonpositive semi-definite (non-PSD) kernels is proposed. Different from the existing algorithms which must consider four Lagrange multipliers, the algorithm proposed in this article just need to consider two Lagrange multipliers in the process of implementation. The proposed algorithm simplified the implementation by expanding the original dual programming of ?-SVR and solving its KKT conditions, thus being easily applied in solving ?-SVR with non-PSD kernels. The presented algorithm is evaluated using five benchmark problems and one reality problem. The results show that ?-SVR with non-PSD provides more accurate prediction than that with PD kernel.  相似文献   

19.
In this paper, we consider the estimation of both the parameters and the nonparametric link function in partially linear single‐index models for longitudinal data that may be unbalanced. In particular, a new three‐stage approach is proposed to estimate the nonparametric link function using marginal kernel regression and the parametric components with generalized estimating equations. The resulting estimators properly account for the within‐subject correlation. We show that the parameter estimators are asymptotically semiparametrically efficient. We also show that the asymptotic variance of the link function estimator is minimized when the working error covariance matrices are correctly specified. The new estimators are more efficient than estimators in the existing literature. These asymptotic results are obtained without assuming normality. The finite‐sample performance of the proposed method is demonstrated by simulation studies. In addition, two real‐data examples are analyzed to illustrate the methodology.  相似文献   

20.
This paper proposes an algorithm for the classification of multi-dimensional datasets based on the conjugate Bayesian Multiple Kernel Grouping Learning (BMKGL). Using conjugate Bayesian framework improves the computation efficiency. Multiple kernels instead of a single kernel avoid the kernel selection problem which is also a computationally expensive work. Through grouping parameter learning, BMKGL can simultaneously integrate information from different dimensions and find the dimensions which contribute more to the variations of the outcome for the purpose of interpretable property. Meanwhile, BMKGL can select the most suitable combination of kernels for different dimensions so as to extract the most appropriate measure for each dimension and improve the accuracy of classification results. The simulation results illustrate that our learning process has better performance in prediction results and stability compared to some popular classifiers, such as k-nearest neighbours algorithm, support vector machine algorithm and naive Bayes classifier. BMKGL also outperforms previous methods in terms of accuracy and interpretation for the heart disease and EEG datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号