首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The main models of machine learning are briefly reviewed and considered for building a classifier to identify the Fragile X Syndrome (FXS). We have analyzed 172 patients potentially affected by FXS in Andalusia (Spain) and, by means of a DNA test, each member of the data set is known to belong to one of two classes: affected, not affected. The whole predictor set, formed by 40 variables, and a reduced set with only nine predictors significantly associated with the response are considered. Four alternative base classification models have been investigated: logistic regression, classification trees, multilayer perceptron and support vector machines. For both predictor sets, the best accuracy, considering both the mean and the standard deviation of the test error rate, is achieved by the support vector machines, confirming the increasing importance of this learning algorithm. Three ensemble methods - bagging, random forests and boosting - were also considered, amongst which the bagged versions of support vector machines stand out, especially when they are constructed with the reduced set of predictor variables. The analysis of the sensitivity, the specificity and the area under the ROC curve agrees with the main conclusions extracted from the accuracy results. All of these models can be fitted by free R programs.  相似文献   

2.
In many scientific investigations, a large number of input variables are given at the early stage of modeling and identifying the variables predictive of the response is often a main purpose of such investigations. Recently, the support vector machine has become an important tool in classification problems of many fields. Several variants of the support vector machine adopting different penalties in its objective function have been proposed. This paper deals with the Fisher consistency and the oracle property of support vector machines in the setting where the dimension of inputs is fixed. First, we study the Fisher consistency of the support vector machine over the class of affine functions. It is shown that the function class for decision functions is crucial for the Fisher consistency. Second, we study the oracle property of the penalized support vector machines with the smoothly clipped absolute deviation penalty. Once we have addressed the Fisher consistency of the support vector machine over the class of affine functions, the oracle property appears to be meaningful in the context of classification. A simulation study is provided in order to show small sample properties of the penalized support vector machines with the smoothly clipped absolute deviation penalty.  相似文献   

3.
Among many classification methods, linear discriminant analysis (LDA) is a favored tool due to its simplicity, robustness, and predictive accuracy but when the number of genes is larger than the number of observations, it cannot be applied directly because the within-class covariance matrix is singular. Also, diagonal LDA (DLDA) is a simpler model compared to LDA and has better performance in some cases. However, in reality, DLDA requires a strong assumption based on mutual independence. In this article, we propose the modified LDA (MLDA). MLDA is based on independence, but uses the information that has an effect on classification performance with the dependence structure. We suggest two approaches. One is the case of using gene rank. The other involves no use of gene rank. We found that MLDA has better performance than LDA, DLDA, or K-nearest neighborhood and is comparable with support vector machines in real data analysis and the simulation study.  相似文献   

4.
Gradient Boosting (GB) was introduced to address both classification and regression problems with great power. People have studied the boosting with L2 loss intensively both in theory and practice. However, the L2 loss is not proper for learning distributional functionals beyond the conditional mean such as conditional quantiles. There are huge amount of literatures studying conditional quantile prediction with various methods including machine learning techniques such like random forests and boosting. Simulation studies reveal that the weakness of random forests lies in predicting centre quantiles and that of GB lies in predicting extremes. Is there an algorithm that enjoys the advantages of both random forests and boosting so that it can perform well over all quantiles? In this article, we propose such a boosting algorithm called random GB which embraces the merits of both random forests and GB. Empirical results will be presented to support the superiority of this algorithm in predicting conditional quantiles.  相似文献   

5.
Recently, a new ensemble classification method named Canonical Forest (CF) has been proposed by Chen et al. [Canonical forest. Comput Stat. 2014;29:849–867]. CF has been proven to give consistently good results in many data sets and comparable to other widely used classification ensemble methods. However, CF requires an adopting feature reduction method before classifying high-dimensional data. Here, we extend CF to a high-dimensional classifier by incorporating a random feature subspace algorithm [Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:832–844]. This extended algorithm is called HDCF (high-dimensional CF) as it is specifically designed for high-dimensional data. We conducted an experiment using three data sets – gene imprinting, oestrogen, and leukaemia – to compare the performance of HDCF with several popular and successful classification methods on high-dimensional data sets, including Random Forest [Breiman L. Random forest. Mach Learn. 2001;45:5–32], CERP [Ahn H, et al. Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal. 2007;51:6166–6179], and support vector machines [Vapnik V. The nature of statistical learning theory. New York: Springer; 1995]. Besides the classification accuracy, we also investigated the balance between sensitivity and specificity for all these four classification methods.  相似文献   

6.
This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure.  相似文献   

7.
Statistical learning is emerging as a promising field where a number of algorithms from machine learning are interpreted as statistical methods and vice-versa. Due to good practical performance, boosting is one of the most studied machine learning techniques. We propose algorithms for multivariate density estimation and classification. They are generated by using the traditional kernel techniques as weak learners in boosting algorithms. Our algorithms take the form of multistep estimators, whose first step is a standard kernel method. Some strategies for bandwidth selection are also discussed with regard both to the standard kernel density classification problem, and to our 'boosted' kernel methods. Extensive experiments, using real and simulated data, show an encouraging practical relevance of the findings. Standard kernel methods are often outperformed by the first boosting iterations and in correspondence of several bandwidth values. In addition, the practical effectiveness of our classification algorithm is confirmed by a comparative study on two real datasets, the competitors being trees including AdaBoosting with trees.  相似文献   

8.
For time series data with obvious periodicity (e.g., electric motor systems and cardiac monitor) or vague periodicity (e.g., earthquake and explosion, speech, and stock data), frequency-based techniques using the spectral analysis can usually capture the features of the series. By this approach, we are able not only to reduce the data dimensions into frequency domain but also utilize these frequencies by general classification methods such as linear discriminant analysis (LDA) and k-nearest-neighbor (KNN) to classify the time series. This is a combination of two classical approaches. However, there is a difficulty in using LDA and KNN in frequency domain due to excessive dimensions of data. We overcome the obstacle by using Singular Value Decomposition to select essential frequencies. Two data sets are used to illustrate our approach. The classification error rates of our simple approach are comparable to those of several more complicated methods.  相似文献   

9.
The purpose of this study was to apply support vector machines (SVMs) to bank bankruptcy analysis using practical steps. Although the prediction of the financial distress of companies is done using several statistical and machine learning techniques, bank classification and bankruptcy prediction still need to be investigated because few investigations have been conducted in this field of banking. In this study, SVMs were implemented to analyse financial ratios. Data sets from Turkish commercial banks were used. This study shows that SVMs with the Gaussian kernel are capable of extracting useful information from financial data and can be used as part of an early warning system.  相似文献   

10.
In this paper, we perform an empirical comparison of the classification error of several ensemble methods based on classification trees. This comparison is performed by using 14 data sets that are publicly available and that were used by Lim, Loh and Shih [Lim, T., Loh, W. and Shih, Y.-S., 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40, 203–228.]. The methods considered are a single tree, Bagging, Boosting (Arcing) and random forests (RF). They are compared from different perspectives. More precisely, we look at the effects of noise and of allowing linear combinations in the construction of the trees, the differences between some splitting criteria and, specifically for RF, the effect of the number of variables from which to choose the best split at each given node. Moreover, we compare our results with those obtained by Lim et al. [Lim, T., Loh, W. and Shih, Y.-S., 2000, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40, 203–228.]. In this study, the best overall results are obtained with RF. In particular, RF are the most robust against noise. The effect of allowing linear combinations and the differences between splitting criteria are small on average, but can be substantial for some data sets.  相似文献   

11.
Consider using values of variables X 1, X 2,…, X p to classify entities into one of two classes. Kernel-based procedures such as support vector machines (SVMs) are well suited for this task. In general, the classification accuracy of SVMs can be substantially improved if instead of all p candidate variables, a smaller subset of (say m) variables is used. A new two-step approach to variable selection for SVMs is therefore proposed: best variable subsets of size k = 1,2,…, p are first identified, and then a new data-dependent criterion is used to determine a value for m. The new approach is evaluated in a Monte Carlo simulation study, and on a sample of data sets.  相似文献   

12.
Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables p is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large p. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small n large p” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.  相似文献   

13.
The aim of this article is to assess and compare several statistical methods for hyperspectral image supervised classification only using the spectral dimension. Since hyperspectral profiles may be viewed either as a random vector or a random curve, we propose to confront various multivariate discriminating procedures with functional alternatives. Eight methods representing three important statistical communities (mixture models, machine learning and functional data analysis) have been applied on three hyperspectral datasets following three protocols studying the influence of size and composition of the learning sample, with or without noised labels. Besides this comparative study, this work proposes a functional extension of multinomial logit model as well as a fast computing adaptation of the nonparametric functional discrimination. As a by-product, this work provides a useful comprehensive bibliography and also supplemental material especially oriented towards practitioners.  相似文献   

14.
In this work we present a study on the analysis of a large data set from seismology. A set of different large margin classifiers based on the well-known support vector machine (SVM) algorithm is used to classify the data into two classes based on their magnitude on the Richter scale. Due to the imbalance of nature between the two classes reweighing techniques are used to show the importance of reweighing algorithms. Moreover, we present an incremental algorithm to explore the possibility of predicting the strength of an earthquake with incremental techniques.  相似文献   

15.
In this paper, we consider the classification of high-dimensional vectors based on a small number of training samples from each class. The proposed method follows the Bayesian paradigm, and it is based on a small vector which can be viewed as the regression of the new observation on the space spanned by the training samples. The classification method provides posterior probabilities that the new vector belongs to each of the classes, hence it adapts naturally to any number of classes. Furthermore, we show a direct similarity between the proposed method and the multicategory linear support vector machine introduced in Lee et al. [2004. Multicategory support vector machines: theory and applications to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99 (465), 67–81]. We compare the performance of the technique proposed in this paper with the SVM classifier using real-life military and microarray datasets. The study shows that the misclassification errors of both methods are very similar, and that the posterior probabilities assigned to each class are fairly accurate.  相似文献   

16.
One of the major issues in medical field constitutes the correct diagnosis, including the limitation of human expertise in diagnosing the disease in a manual way. Nowadays, the use of machine learning classifiers, such as support vector machines (SVM), in medical diagnosis is increasing gradually. However, traditional classification algorithms can be limited in their performance when they are applied on highly imbalanced data sets, in which negative examples (i.e. negative to a disease) outnumber the positive examples (i.e. positive to a disease). SVM constitutes a significant improvement and its mathematical formulation allows the incorporation of different weights so as to deal with the problem of imbalanced data. In the present work an extensive study of four medical data sets is conducted using a variant of SVM, called proximal support vector machine (PSVM) proposed by Fung and Mangasarian [9 G.M. Fung and O.L. Mangasarian, Proximal support vector machine classifiers, in Proceedings KDD-2001: Knowledge Discovery and Data Mining, F. Provost and R. Srikant, eds., Association for Computing Machinery, San Francisco, CA, New York, 2001, pp. 77–86. Available at ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [Google Scholar]]. Additionally, in order to deal with the imbalanced nature of the medical data sets we applied both a variant of SVM, referred as two-cost support vector machine and a modification of PSVM referred as modified PSVM. Both algorithms incorporate different weights one for each class examples.  相似文献   

17.
陈凯 《统计教育》2008,(12):3-7
目前集成学习算法已经成为机器学习研究的一大热点,已有人提出许多改进的集成学习算法。本文提出了一种综合了Boosting和Bagging算法特点的选择性集成学习算法--SE-BagBoosting Trees算法。并将其与几种常用的机器学习算法比较研究得出,该算法往往比其它算法具有更小的模型推广误差和更高的预测精度的优点。  相似文献   

18.
This paper proposes an algorithm for the classification of multi-dimensional datasets based on the conjugate Bayesian Multiple Kernel Grouping Learning (BMKGL). Using conjugate Bayesian framework improves the computation efficiency. Multiple kernels instead of a single kernel avoid the kernel selection problem which is also a computationally expensive work. Through grouping parameter learning, BMKGL can simultaneously integrate information from different dimensions and find the dimensions which contribute more to the variations of the outcome for the purpose of interpretable property. Meanwhile, BMKGL can select the most suitable combination of kernels for different dimensions so as to extract the most appropriate measure for each dimension and improve the accuracy of classification results. The simulation results illustrate that our learning process has better performance in prediction results and stability compared to some popular classifiers, such as k-nearest neighbours algorithm, support vector machine algorithm and naive Bayes classifier. BMKGL also outperforms previous methods in terms of accuracy and interpretation for the heart disease and EEG datasets.  相似文献   

19.
Summary.  As a special case of statistical learning, ensemble methods are well suited for the analysis of opportunistically collected data that involve many weak and sometimes specialized predictors, especially when subject-matter knowledge favours inductive approaches. We analyse data on the incidental mortality of dolphins in the purse-seine fishery for tuna in the eastern Pacific Ocean. The goal is to identify those rare purse-seine sets for which incidental mortality would be expected but none was reported. The ensemble method random forests is used to classify sets according to whether mortality was (response 1) or was not (response 0) reported. To identify questionable reporting practice, we construct 'residuals' as the difference between the categorical response (0,1) and the proportion of trees in the forest that classify a given set as having mortality. Two uses of these residuals to identify suspicious data are illustrated. This approach shows promise as a means of identifying suspect data gathered for environmental monitoring.  相似文献   

20.
This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble. Our method is shown to improve test set accuracy over random forests by up to 70% on example problems from the UCI repository.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号