首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Partially linear models (PLMs) are an important tool in modelling economic and biometric data and are considered as a flexible generalization of the linear model by including a nonparametric component of some covariate into the linear predictor. Usually, the error component is assumed to follow a normal distribution. However, the theory and application (through simulation or experimentation) often generate a great amount of data sets that are skewed. The objective of this paper is to extend the PLMs allowing the errors to follow a skew-normal distribution [A. Azzalini, A class of distributions which includes the normal ones, Scand. J. Statist. 12 (1985), pp. 171–178], increasing the flexibility of the model. In particular, we develop the expectation-maximization (EM) algorithm for linear regression models and diagnostic analysis via local influence as well as generalized leverage, following [H. Zhu and S. Lee, Local influence for incomplete-data models, J. R. Stat. Soc. Ser. B 63 (2001), pp. 111–126]. A simulation study is also conducted to evaluate the efficiency of the EM algorithm. Finally, a suitable transformation is applied in a data set on ragweed pollen concentration in order to fit PLMs under asymmetric distributions. An illustrative comparison is performed between normal and skew-normal errors.  相似文献   

2.
Recently, a new ensemble classification method named Canonical Forest (CF) has been proposed by Chen et al. [Canonical forest. Comput Stat. 2014;29:849–867]. CF has been proven to give consistently good results in many data sets and comparable to other widely used classification ensemble methods. However, CF requires an adopting feature reduction method before classifying high-dimensional data. Here, we extend CF to a high-dimensional classifier by incorporating a random feature subspace algorithm [Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:832–844]. This extended algorithm is called HDCF (high-dimensional CF) as it is specifically designed for high-dimensional data. We conducted an experiment using three data sets – gene imprinting, oestrogen, and leukaemia – to compare the performance of HDCF with several popular and successful classification methods on high-dimensional data sets, including Random Forest [Breiman L. Random forest. Mach Learn. 2001;45:5–32], CERP [Ahn H, et al. Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal. 2007;51:6166–6179], and support vector machines [Vapnik V. The nature of statistical learning theory. New York: Springer; 1995]. Besides the classification accuracy, we also investigated the balance between sensitivity and specificity for all these four classification methods.  相似文献   

3.
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ? n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher's discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L(1) and fused lasso penalties. Our proposal is equivalent to recasting Fisher's discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.  相似文献   

4.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

5.
ABSTRACT

Fisher's linear discriminant analysis (FLDA) is known as a method to find a discriminative feature space for multi-class classification. As a theory of extending FLDA to an ultimate nonlinear form, optimal nonlinear discriminant analysis (ONDA) has been proposed. ONDA indicates that the best theoretical nonlinear map for maximizing the Fisher's discriminant criterion is formulated by using the Bayesian a posterior probabilities. In addition, the theory proves that FLDA is equivalent to ONDA when the Bayesian a posterior probabilities are approximated by linear regression (LR). Due to some limitations of the linear model, there is room to modify FLDA by using stronger approximation/estimation methods. For the purpose of probability estimation, multi-nominal logistic regression (MLR) is more suitable than LR. Along this line, in this paper, we develop a nonlinear discriminant analysis (NDA) in which the posterior probabilities in ONDA are estimated by MLR. In addition, in this paper, we develop a way to introduce sparseness into discriminant analysis. By applying L1 or L2 regularization to LR or MLR, we can incorporate sparseness in FLDA and our NDA to increase generalization performance. The performance of these methods is evaluated by benchmark experiments using last_exam17 standard datasets and a face classification experiment.  相似文献   

6.
It is often the case that high-dimensional data consist of only a few informative components. Standard statistical modeling and estimation in such a situation is prone to inaccuracies due to overfitting, unless regularization methods are practiced. In the context of classification, we propose a class of regularization methods through shrinkage estimators. The shrinkage is based on variable selection coupled with conditional maximum likelihood. Using Stein's unbiased estimator of the risk, we derive an estimator for the optimal shrinkage method within a certain class. A comparison of the optimal shrinkage methods in a classification context, with the optimal shrinkage method when estimating a mean vector under a squared loss, is given. The latter problem is extensively studied, but it seems that the results of those studies are not completely relevant for classification. We demonstrate and examine our method on simulated data and compare it to feature annealed independence rule and Fisher's rule.  相似文献   

7.
Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets.  相似文献   

8.
The problem of constructing classification methods based on both labeled and unlabeled data sets is considered for analyzing data with complex structures. We introduce a semi-supervised logistic discriminant model with Gaussian basis expansions. Unknown parameters included in the logistic model are estimated by regularization method along with the technique of EM algorithm. For selection of adjusted parameters, we derive a model selection criterion from Bayesian viewpoints. Numerical studies are conducted to investigate the effectiveness of our proposed modeling procedures.  相似文献   

9.
The naïve Bayes rule (NBR) is a popular and often highly effective technique for constructing classification rules. This study examines the effectiveness of NBR as a method for constructing classification rules (credit scorecards) in the context of screening credit applicants (credit scoring). For this purpose, the study uses two real-world credit scoring data sets to benchmark NBR against linear discriminant analysis, logistic regression analysis, k-nearest neighbours, classification trees and neural networks. Of the two aforementioned data sets, the first one is taken from a major Greek bank whereas the second one is the Australian Credit Approval data set taken from the UCI Machine Learning Repository (available at http://www.ics.uci.edu/~mlearn/MLRepository.html). The predictive ability of scorecards is measured by the total percentage of correctly classified cases, the Gini coefficient and the bad rate amongst accepts. In each of the data sets, NBR is found to have a lower predictive ability than some of the other five methods under all measures used. Reasons that may negatively affect the predictive ability of NBR relative to that of alternative methods in the context of credit scoring are examined.  相似文献   

10.
Bayesian model averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ?1 regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ?1 regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional data sets. We apply our technique in simulations, as well as to some applications that arise in genomics.  相似文献   

11.
In this paper, we propose a new Bayesian inference approach for classification based on the traditional hinge loss used for classical support vector machines, which we call the Bayesian Additive Machine (BAM). Unlike existing approaches, the new model has a semiparametric discriminant function where some feature effects are nonlinear and others are linear. This separation of features is achieved automatically during model fitting without user pre-specification. Following the literature on sparse regression of high-dimensional models, we can also identify the irrelevant features. By introducing spike-and-slab priors using two sets of indicator variables, these multiple goals are achieved simultaneously and automatically, without any parameter tuning such as cross-validation. An efficient partially collapsed Markov chain Monte Carlo algorithm is developed for posterior exploration based on a data augmentation scheme for the hinge loss. Our simulations and three real data examples demonstrate that the new approach is a strong competitor to some approaches that were proposed recently for dealing with challenging classification examples with high dimensionality.  相似文献   

12.
In this paper, we introduce a new estimator of entropy of a continuous random variable. We compare the proposed estimator with the existing estimators, namely, Vasicek [A test for normality based on sample entropy, J. Roy. Statist. Soc. Ser. B 38 (1976), pp. 54–59], van Es [Estimating functionals related to a density by class of statistics based on spacings, Scand. J. Statist. 19 (1992), pp. 61–72], Correa [A new estimator of entropy, Commun. Statist. Theory and Methods 24 (1995), pp. 2439–2449] and Wieczorkowski-Grzegorewski [Entropy estimators improvements and comparisons, Commun. Statist. Simulation and Computation 28 (1999), pp. 541–567]. We next introduce a new test for normality. By simulation, the powers of the proposed test under various alternatives are compared with normality tests proposed by Vasicek (1976) and Esteban et al. [Monte Carlo comparison of four normality tests using different entropy estimates, Commun. Statist.–Simulation and Computation 30(4) (2001), pp. 761–785].  相似文献   

13.
In the classical discriminant analysis, when two multivariate normal distributions with equal variance–covariance matrices are assumed for two groups, the classical linear discriminant function is optimal with respect to maximizing the standardized difference between the means of two groups. However, for a typical case‐control study, the distributional assumption for the case group often needs to be relaxed in practice. Komori et al. (Generalized t ‐statistic for two‐group classification. Biometrics 2015, 71: 404–416) proposed the generalized t ‐statistic to obtain a linear discriminant function, which allows for heterogeneity of case group. Their procedure has an optimality property in the class of consideration. We perform a further study of the problem and show that additional improvement is achievable. The approach we propose does not require a parametric distributional assumption on the case group. We further show that the new estimator is efficient, in that no further improvement is possible to construct the linear discriminant function more efficiently. We conduct simulation studies and real data examples to illustrate the finite sample performance and the gain that it produces in comparison with existing methods.  相似文献   

14.
In this study, classical and Bayesian inference methods are introduced to analyze lifetime data sets in the presence of left censoring considering two generalizations of the Lindley distribution: a first generalization proposed by Ghitany et al. [Power Lindley distribution and associated inference, Comput. Statist. Data Anal. 64 (2013), pp. 20–33], denoted as a power Lindley distribution and a second generalization proposed by Sharma et al. [The inverse Lindley distribution: A stress–strength reliability model with application to head and neck cancer data, J. Ind. Prod. Eng. 32 (2015), pp. 162–173], denoted as an inverse Lindley distribution. In our approach, we have used a distribution obtained from these two generalizations denoted as an inverse power Lindley distribution. A numerical illustration is presented considering a dataset of thyroglobulin levels present in a group of individuals with differentiated cancer of thyroid.  相似文献   

15.
In this paper, we consider the classification of high-dimensional vectors based on a small number of training samples from each class. The proposed method follows the Bayesian paradigm, and it is based on a small vector which can be viewed as the regression of the new observation on the space spanned by the training samples. The classification method provides posterior probabilities that the new vector belongs to each of the classes, hence it adapts naturally to any number of classes. Furthermore, we show a direct similarity between the proposed method and the multicategory linear support vector machine introduced in Lee et al. [2004. Multicategory support vector machines: theory and applications to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99 (465), 67–81]. We compare the performance of the technique proposed in this paper with the SVM classifier using real-life military and microarray datasets. The study shows that the misclassification errors of both methods are very similar, and that the posterior probabilities assigned to each class are fairly accurate.  相似文献   

16.
Cross-validation has been widely used in the context of statistical linear models and multivariate data analysis. Recently, technological advancements give possibility of collecting new types of data that are in the form of curves. Statistical procedures for analysing these data, which are of infinite dimension, have been provided by functional data analysis. In functional linear regression, using statistical smoothing, estimation of slope and intercept parameters is generally based on functional principal components analysis (FPCA), that allows for finite-dimensional analysis of the problem. The estimators of the slope and intercept parameters in this context, proposed by Hall and Hosseini-Nasab [On properties of functional principal components analysis, J. R. Stat. Soc. Ser. B: Stat. Methodol. 68 (2006), pp. 109–126], are based on FPCA, and depend on a smoothing parameter that can be chosen by cross-validation. The cross-validation criterion, given there, is time-consuming and hard to compute. In this work, we approximate this cross-validation criterion by such another criterion so that we can turn to a multivariate data analysis tool in some sense. Then, we evaluate its performance numerically. We also treat a real dataset, consisting of two variables; temperature and the amount of precipitation, and estimate the regression coefficients for the former variable in a model predicting the latter one.  相似文献   

17.
For time series data with obvious periodicity (e.g., electric motor systems and cardiac monitor) or vague periodicity (e.g., earthquake and explosion, speech, and stock data), frequency-based techniques using the spectral analysis can usually capture the features of the series. By this approach, we are able not only to reduce the data dimensions into frequency domain but also utilize these frequencies by general classification methods such as linear discriminant analysis (LDA) and k-nearest-neighbor (KNN) to classify the time series. This is a combination of two classical approaches. However, there is a difficulty in using LDA and KNN in frequency domain due to excessive dimensions of data. We overcome the obstacle by using Singular Value Decomposition to select essential frequencies. Two data sets are used to illustrate our approach. The classification error rates of our simple approach are comparable to those of several more complicated methods.  相似文献   

18.
19.
In this article, a sequential correction of two linear methods: linear discriminant analysis (LDA) and perceptron is proposed. This correction relies on sequential joining of additional features on which the classifier is trained. These new features are posterior probabilities determined by a basic classification method such as LDA and perceptron. In each step, we add the probabilities obtained on a slightly different data set, because the vector of added probabilities varies at each step. We therefore have many classifiers of the same type trained on slightly different data sets. Four different sequential correction methods are presented based on different combining schemas (e.g. mean rule and product rule). Experimental results on different data sets demonstrate that the improvements are efficient, and that this approach outperforms classical linear methods, providing a significant reduction in the mean classification error rate.  相似文献   

20.
High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by applications in high-throughput genomic data analysis. In this paper, we propose a class of regularization methods, integrating both the penalized empirical likelihood and pseudoscore approaches, for variable selection and estimation in sparse and high-dimensional additive hazards regression models. When the number of covariates grows with the sample size, we establish asymptotic properties of the resulting estimator and the oracle property of the proposed method. It is shown that the proposed estimator is more efficient than that obtained from the non-concave penalized likelihood approach in the literature. Based on a penalized empirical likelihood ratio statistic, we further develop a nonparametric likelihood approach for testing the linear hypothesis of regression coefficients and constructing confidence regions consequently. Simulation studies are carried out to evaluate the performance of the proposed methodology and also two real data sets are analyzed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号