首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Summary: This paper describes common features in data sets from motor vehicle insurance companies and proposes a general approach which exploits knowledge of such features in order to model high–dimensional data sets with a complex dependency structure. The results of the approach can be a basis to develop insurance tariffs. The approach is applied to a collection of data sets from several motor vehicle insurance companies. As an example, we use a nonparametric approach based on a combination of two methods from modern statistical machine learning, i.e. kernel logistic regression and -support vector regression.*This work was supported by the Deutsche Forschungsgemeinschaft (SFB 475, Reduction of complexity in multivariate data structures) and by the Forschungsband Do-MuS from the University of Dortmund. I am grateful to Mr. A. Wolfstein and Dr. W. Terbeck from the Verband öffentlicher Versicherer in Düsseldorf, Germany, for making available the data set and for many helpful discussions.  相似文献   

2.
One of the major issues in medical field constitutes the correct diagnosis, including the limitation of human expertise in diagnosing the disease in a manual way. Nowadays, the use of machine learning classifiers, such as support vector machines (SVM), in medical diagnosis is increasing gradually. However, traditional classification algorithms can be limited in their performance when they are applied on highly imbalanced data sets, in which negative examples (i.e. negative to a disease) outnumber the positive examples (i.e. positive to a disease). SVM constitutes a significant improvement and its mathematical formulation allows the incorporation of different weights so as to deal with the problem of imbalanced data. In the present work an extensive study of four medical data sets is conducted using a variant of SVM, called proximal support vector machine (PSVM) proposed by Fung and Mangasarian [9 G.M. Fung and O.L. Mangasarian, Proximal support vector machine classifiers, in Proceedings KDD-2001: Knowledge Discovery and Data Mining, F. Provost and R. Srikant, eds., Association for Computing Machinery, San Francisco, CA, New York, 2001, pp. 77–86. Available at ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [Google Scholar]]. Additionally, in order to deal with the imbalanced nature of the medical data sets we applied both a variant of SVM, referred as two-cost support vector machine and a modification of PSVM referred as modified PSVM. Both algorithms incorporate different weights one for each class examples.  相似文献   

3.
    
The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently owing to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations that lie close to the boundary of the classifier function. However when the number of observations is not very large (small n) but the number of dimensions/features is large (large p), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of a useful fraction of the available features may result in huge data compression. In this paper, we propose an algorithmic approach by means of which such an optimal set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie in the context of import vector machine (IVM), to select an optimal sub‐dimensional model to build the final classifier with sufficient accuracy.  相似文献   

4.
The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods.  相似文献   

5.
现有聚类方法都是基于消费者全部的行为信息,对于观测不完全的信息,提出了三阶段聚类方法。首先,使用样本数据的全部信息对消费者聚类;接着仅使用人口统计变量建立分类模型;最后对上述结果进行修正。三阶段聚类方法最大优点是可以将没有入选样本的个体分配到由样本个体得到的行为集群中去,将这个方法应用于电视行业,得到了很有实际应有价值的结果。  相似文献   

6.
    
Kernel‐based classification methods, for example, support vector machines, map the data into a higher‐dimensional space via a kernel function. In practice, choosing the value of hyperparameter in the kernel function is crucial in order to ensure good performance. We propose a method of selecting the hyperparameter in the Gaussian radial basis function (RBF) kernel by considering the geometry of the embedded feature space. This method is independent of the choice of the discrimination algorithm and also computationally efficient. Its classification performance is competitive with existing methods including cross‐validation. Using simulated and real‐data examples, we show that the proposed method is stable with respect to sampling variability. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 142‐148, 2010  相似文献   

7.
    
Support vector machines (SVMs) are a family of machine learning methods, originally introduced for the problem of classification and later generalized to various other situations. They are based on principles of statistical learning theory and convex optimization, and are currently used in various domains of application, including bioinformatics, text categorization, and computer vision. Copyright © 2009 John Wiley & Sons, Inc. This article is categorized under:
  • Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
  相似文献   

8.
    
The field of machine learning provides useful means and tools for finding accurate solutions to complex and challenging biological problems. In recent years a class of learning algorithms namely kernel methods has been successfully applied to various tasks in computational biology. In this article we present an overview of kernel methods and support vector machines and focus on their applications to biological sequences. We also describe a new class of approaches that is termed as deep learning. These techniques have desirable characteristics and their use can be highly effective within the field of computational biology. WIREs Comput Stat 2012 doi: 10.1002/wics.1223 This article is categorized under:
  • Applications of Computational Statistics > Computational and Molecular Biology
  • Statistical Learning and Exploratory Methods of the Data Sciences > Neural Networks
  • Statistical Learning and Exploratory Methods of the Data Sciences > Support Vector Machines
  相似文献   

9.
    
Prior knowledge over general nonlinear sets is incorporated into proximal nonlinear kernel classification problems as linear equalities. The key tool in this incorporation is the conversion of general nonlinear prior knowledge implications into linear equalities in the classification variables without the need to kernelize these implications. These equalities are then included into a proximal nonlinear kernel classification formulation (G. Fung and O. L. Mangasarian, Proximal support vector machine classifiers, in Proceedings KDD‐2001: Knowledge Discovery and Data Mining, F. Provost and R. Srikant (eds), San Francisco, CA, New York, Association for Computing Machinery) that is solvable as a system of linear equations. Effectiveness of the proposed formulation is demonstrated on a number of publicly available classification datasets. Nonlinear kernel classifiers for these datasets exhibit marked improvements upon the introduction of nonlinear prior knowledge compared with nonlinear kernel classifiers that do not utilize such knowledge. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000‐000, 2009  相似文献   

10.
    
As a new strategy for treatment, which takes individual heterogeneity into consideration, personalized medicine is of growing interest. Discovering individualized treatment rules for patients who have heterogeneous responses to treatment is one of the important areas in developing personalized medicine. As more and more information per individual is being collected in clinical studies and not all of the information is relevant for treatment discovery, variable selection becomes increasingly important in discovering individualized treatment rules. In this article, we develop a variable selection method based on penalized outcome weighted learning through which an optimal treatment rule is considered as a classification problem where each subject is weighted proportional to his or her clinical outcome. We show that the resulting estimator of the treatment rule is consistent and establish variable selection consistency and the asymptotic distribution of the estimators. The performance of the proposed approach is demonstrated via simulation studies and an analysis of chronic depression data. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
随着中国环境压力的增加,政府提出了供给侧改革,去产能是改革的主要内容,但是由于产业特征的实时演变,需要对政策进行完善。文章运用模糊C均值算法和支持向量机算法分析现阶段需要进行去产能的产业,结果发现在现行去产能政策中大部分行业是需要去产能的,但煤炭开采和洗选业以及铁路、船舶、航空航天和其他运输设备制造业已不适合继续去产能,同时将化学原料和化学制品制造业加入去产能行列中。  相似文献   

12.
A tutorial on support vector regression   总被引:78,自引:0,他引:78  
In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.  相似文献   

13.
Sequential minimal optimization (SMO) algorithm is effective in solving large-scale support vector machine (SVM). The existing algorithms all assume that the kernels are positive definite (PD) or positive semi-definite (PSD) and should meet the Mercer condition. Some kernels, however, such as sigmoid kernel, which originates from neural network and then is extensively used in SVM, are conditionally PD in certain circumstances; in addition, practically, it is often difficult to prove whether a kernel is PD or PSD or not except some well-known kernels. So, the applications of the existing algorithm of SMO are limited. Considering the deficiency of the traditional ones, this algorithm of solving ?-SVR with nonpositive semi-definite (non-PSD) kernels is proposed. Different from the existing algorithms which must consider four Lagrange multipliers, the algorithm proposed in this article just need to consider two Lagrange multipliers in the process of implementation. The proposed algorithm simplified the implementation by expanding the original dual programming of ?-SVR and solving its KKT conditions, thus being easily applied in solving ?-SVR with non-PSD kernels. The presented algorithm is evaluated using five benchmark problems and one reality problem. The results show that ?-SVR with non-PSD provides more accurate prediction than that with PD kernel.  相似文献   

14.
针对现阶段新经济增长点选择模型无法区分“已有的”增长点与“新的”增长点的问题,使用支持向量机挖掘新经济增长点的潜在性.研究显示:陕西省2010年38个工业行业可划分为“新经济增长点”与“非新经济增长点”两类,新经济增长点一类中前十位行业与陕西省“十二五”规划中的文化产业、高新技术产业、新能源产业发展相一致,可见支持向量机在新经济增长点选择中的可行性和可靠性.  相似文献   

15.
应用不等权重支持向量机预测人民币汇率的变动   总被引:1,自引:0,他引:1  
基于金融时间序列的近期数据对未来的影响会大于早期数据,对应用于金融时间序列预测的支持向量机方法进行改进,给出了不等权重支持向量机方法(USVM)及其多项式光滑化处理。将不等权重支持向量机方法应用于训练样本集的子集确定预测模型,实证分析表明USVM算法预测是有效的。  相似文献   

16.
Quantile regression (QR) models have received a great deal of attention in both the theoretical and applied statistical literature. In this paper we propose support vector quantile regression (SVQR) with monotonicity restriction, which is easily obtained via the dual formulation of the optimization problem. We also provide the generalized approximate cross validation method for choosing the hyperparameters which affect the performance of the proposed SVQR. The experimental results for the synthetic and real data sets confirm the successful performance of the proposed model.  相似文献   

17.
    
Because of its many practical applications, classifying functional data has received considerable attention over the last decades. Most classification approaches for functional data are extended from those for multivariate data. During the extension, two strategies, namely filtering and regularization, have commonly been employed to tackle the issues raised by the fact that functional data are intrinsically infinite‐dimensional. Because of space limitations, we focus on the filtering methods in this review. This article is categorized under:
  • Statistical and Graphical Methods of Data Analysis > Analysis of High Dimensional Data
  • Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
  相似文献   

18.
    
A main goal of regression is to derive statistical conclusions on the conditional distribution of the output variable Y given the input values x. Two of the most important characteristics of a single distribution are location and scale. Regularised kernel methods (RKMs) – also called support vector machines in a wide sense – are well established to estimate location functions like the conditional median or the conditional mean. We investigate the estimation of scale functions by RKMs when the conditional median is unknown, too. Estimation of scale functions is important, e.g. to estimate the volatility in finance. We consider the median absolute deviation (MAD) and the interquantile range as measures of scale. Our main result shows the consistency of MAD-type RKMs.  相似文献   

19.
Unbalanced data classification has been a long-standing issue in the field of medical vision science. We introduced the methods of support vector machines (SVM) with active learning (AL) to improve prediction of unbalanced classes in the medical imaging field. A standard SVM algorithm with four different AL approaches are proposed: (1) The first one uses random sampling to select the initial pool with AL algorithm; (2) the second doubles the training instances of the rare category to reduce the unbalanced ratio before the AL algorithm; (3) the third uses a balanced pool with equal number from each category; and (4) the fourth uses a balanced pool and implements balanced sampling throughout the AL algorithm. Grid pixel data of two scleroderma lung disease patterns, lung fibrosis (LF), and honeycomb (HC) were extracted from computed tomography images of 71 patients to produce a training set of 348 HC and 3009 LF instances and a test set of 291 HC and 2665 LF. From our research, SVM with AL using balanced sampling compared to random sampling increased the test sensitivity of HC by 56% (17.5% vs. 73.5%) and 47% (23% vs. 70%) for the original and denoised dataset, respectively. SVM with AL with balanced sampling can improve the classification performances of unbalanced data.  相似文献   

20.
    
Data are generated at an unprecedented rate and scale these days across many disciplines. The field of streaming data analysis has emerged as a result of new data collection and storage technologies in various areas, such as air pollution monitoring, detection of traffic congestion, disease surveillance, and recommendation systems. In this paper, we consider the problem of model estimation for data streams in reproducing kernel Hilbert spaces. We propose an adaptive supervised learning method with a data sparsity constraint that uses limited storage spaces and can handle nonstationary models. We demonstrate the competitive performance of the proposed method using simulations and analysis of the bike sharing dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号