首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.

There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified.

  相似文献   

2.
3.
This article enlarges the covariance configurations, on which the classical linear discriminant analysis is based, by considering the four models arising from the spectral decomposition when eigenvalues and/or eigenvectors matrices are allowed to vary or not between groups. As in the classical approach, the assessment of these configurations is accomplished via a test on the training set. The discrimination rule is then built upon the configuration provided by the test, considering or not the unlabeled data. Numerical experiments, on simulated and real data, have been performed to evaluate the gain of our proposal with respect to the linear discriminant analysis.  相似文献   

4.
The problem of updating a discriminant function on the basis of data of unknown origin is studied. There are observations of known origin from each of the underlying populations, and subsequently there is available a limited number of unclassified observations assumed to have been drawn from a mixture of the underlying populations. A sample discriminant function can be formed initially from the classified data. The question of whether the subsequent updating of this discriminant function on the basis of the unclassified data produces a reduction in the error rate of sufficient magnitude to warrant the computational effort is considered by carrying out a series of Monte Carlo experiments. The simulation results are contrasted with available asymptotic results.  相似文献   

5.
The current paradigm for the identification of candidate drugs within the pharmaceutical industry typically involves the use of high-throughput screens. High-content screening (HCS) is the term given to the process of using an imaging platform to screen large numbers of compounds for some desirable biological activity. Classification methods have important applications in HCS experiments, where they are used to predict which compounds have the potential to be developed into new drugs. In this paper, a new classification method is proposed for batches of compounds where the rule is updated sequentially using information from the classification of previous batches. This methodology accounts for the possibility that the training data are not a representative sample of the test data and that the underlying group distributions may change as new compounds are analysed. This technique is illustrated on an example data set using linear discriminant analysis, k-nearest neighbour and random forest classifiers. Random forests are shown to be superior to the other classifiers and are further improved by the additional updating algorithm in terms of an increase in the number of true positives as well as a decrease in the number of false positives.  相似文献   

6.
The main contribution of this paper is is updating a nonlinear discriminant function on the basis of data of unknown origin. Specifically a procedure is developed for updating the nonlinear discriminant function on the basis of two Burr Type III distributions (TBIIID) when the additional observations are mixed or classified. First the nonlinear discriminant function of the assumed model is obtained. Then the total probabilities of misclassification are calculated. In addition a Monte carlo simulation runs are used to compute the relative efficiencies in order to investigate the performance of the developed updating procedures. Finally the results obtained in this paper are illustrated through a real and simulated data set.  相似文献   

7.
The performance of Anderson's classification statistic based on a post-stratified random sample is examined. It is assumed that the training sample is a random sample from a stratified population consisting of two strata with unknown stratum weights. The sample is first segregated into the two strata by post-stratification. The unknown parameters for each of the two populations are then estimated and used in the construction of the plug-in discriminant. Under this procedure, it is shown that additional estimation of the stratum weight will not seriously affect the performance of Anderson's classification statistic. Furthermore, our discriminant enjoys a much higher efficiency than the procedure based on an unclassified sample from a mixture of normals investigated by Ganesalingam and McLachlan (1978).  相似文献   

8.
The sample linear discriminant function (LDF) is known to perform poorly when the number of features p is large relative to the size of the training samples, A simple and rarely applied alternative to the sample LDF is the sample Euclidean distance classifier (EDC). Raudys and Pikelis (1980) have compared the sample LDF with three other discriminant functions, including thesample EDC, when classifying individuals from two spherical normal populations. They have concluded that the sample EDC outperforms the sample LDF when p is large relative to the training sample size. This paper derives conditions for which the two classifiers are equivalent when all parameters are known and employs a Monte Carlo simulation to compare the sample EDC with the sample LDF no only for the spherical normal case but also for several nonspherical parameter configurations. Fo many practical situations, the sample EDC performs as well as or superior to the sample LDF, even for nonspherical covariance configurations.  相似文献   

9.
A new method of discrimination and classification based on a Hausdorff type distance is proposed. In two groups, the Hausdorff distance is defined as the sum of the furthest distance of the nearest elements of one set to another. This distance has some useful properties and is exploited in developing a discriminant criterion between individual objects belonging to two groups based on a finite number of classification variables. The discrimination criterion is generalized to more than two groups in a couple of ways. Several data sets are analysed and their classification accuracy is compared to that obtained from linear discriminant function and the results are encouraging. The method in simple, lends itself to parallel computation and imposes less stringent conditions on the data.  相似文献   

10.
In this article, a sequential correction of two linear methods: linear discriminant analysis (LDA) and perceptron is proposed. This correction relies on sequential joining of additional features on which the classifier is trained. These new features are posterior probabilities determined by a basic classification method such as LDA and perceptron. In each step, we add the probabilities obtained on a slightly different data set, because the vector of added probabilities varies at each step. We therefore have many classifiers of the same type trained on slightly different data sets. Four different sequential correction methods are presented based on different combining schemas (e.g. mean rule and product rule). Experimental results on different data sets demonstrate that the improvements are efficient, and that this approach outperforms classical linear methods, providing a significant reduction in the mean classification error rate.  相似文献   

11.
This article introduces BestClass, a set of SAS macros, available in the mainframe and workstation environment, designed for solving two-group classification problems using a class of recently developed nonparametric classification methods. The criteria used to estimate the classification function are based on either minimizing a function of the absolute deviations from the surface which separates the groups, or directly minimizing a function of the number of misclassified entities in the training sample. The solution techniques used by BestClass to estimate the classification rule use the mathematical programming routines of the SAS/OR software. Recently, a number of research studies have reported that under certain data conditions this class of classification methods can provide more accurate classification results than existing methods, such as Fisher's linear discriminant function and logistic regression. However, these robust classification methods have not yet been implemented in the major statistical packages, and hence are beyond the reach of those statistical analysts who are unfamiliar with mathematical programming techniques. We use a limited simulation experiment and an example to compare and contrast properties of the methods included in Best-Class with existing parametric and nonparametric methods. We believe that BestClass contributes significantly to the field of nonparametric classification analysis, in that it provides the statistical community with convenient access to this recently developed class of methods. BestClass is available from the authors.  相似文献   

12.
This paper discusses a supervised classification approach for the differential diagnosis of Raynaud's phenomenon (RP). The classification of data from healthy subjects and from patients suffering for primary and secondary RP is obtained by means of a set of classifiers derived within the framework of linear discriminant analysis. A set of functional variables and shape measures extracted from rewarming/reperfusion curves are proposed as discriminant features. Since the prediction of group membership is based on a large number of these features, the high dimension/small sample size problem is considered to overcome the singularity problem of the within-group covariance matrix. Results on a data set of 72 subjects demonstrate that a satisfactory classification of the subjects can be achieved through the proposed methodology.  相似文献   

13.
Several mathematical programming approaches to the classification problem in discriminant analysis have recently been introduced. This paper empirically compares these newly introduced classification techniques with Fisher's linear discriminant analysis (FLDA), quadratic discriminant analysis (QDA), logit analysis, and several rank-based procedures for a variety of symmetric and skewed distributions. The percent of correctly classified observations by each procedure in a holdout sample indicate that while under some experimental conditions the linear programming approaches compete well with the classical procedures, overall, however, their performance lags behind that of the classical procedures.  相似文献   

14.
In this paper, a simulation study is conducted to systematically investigate the impact of different types of missing data on six different statistical analyses: four different likelihood‐based linear mixed effects models and analysis of covariance (ANCOVA) using two different data sets, in non‐inferiority trial settings for the analysis of longitudinal continuous data. ANCOVA is valid when the missing data are completely at random. Likelihood‐based linear mixed effects model approaches are valid when the missing data are at random. Pattern‐mixture model (PMM) was developed to incorporate non‐random missing mechanism. Our simulations suggest that two linear mixed effects models using unstructured covariance matrix for within‐subject correlation with no random effects or first‐order autoregressive covariance matrix for within‐subject correlation with random coefficient effects provide well control of type 1 error (T1E) rate when the missing data are completely at random or at random. ANCOVA using last observation carried forward imputed data set is the worst method in terms of bias and T1E rate. PMM does not show much improvement on controlling T1E rate compared with other linear mixed effects models when the missing data are not at random but is markedly inferior when the missing data are at random. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

15.
In this paper, we investigate the problem of updating a discriminant function on the basis of data of unknown origin. We consider the updating procedure for the nonlinear discriminant function on the basis of two inverse Weibull distributions in situations when the additional observations are mixed or classified. Then, we introduce the nonlinear discriminant function of the underlying model. Also, we calculate the total probabilities of misclassification. In addition, we investigate the performance of the updating procedures through series of simulation experiments by means of the relative efficiencies. Finally, we analyze a simulated data set by using the findings of the paper.  相似文献   

16.
This paper develops statistical inference for population quantiles based on a partially rank-ordered set (PROS) sample design. A PROS sample design is similar to a ranked set sample with some clear differences. This design first creates partially rank-ordered subsets by allowing ties whenever the units in a set cannot be ranked with high confidence. It then selects a unit for full measurement at random from one of these partially rank-ordered subsets. The paper develops a point estimator, confidence interval and hypothesis testing procedure for the population quantile of order p. Exact, as well as asymptotic, distribution of the test statistic is derived. It is shown that the null distribution of the test statistic is distribution-free, and statistical inference is reasonably robust against possible ranking errors in ranking process.  相似文献   

17.
A procedure is presented for finding maximum likelihood estimates of the parameters of a mixture of two random walk distributions in two cases, using classified and unclassified observations. Based on small sample size, estimation of nonlinear discriminant functions is considered. Throughout simulation experiments, the performance of the corresponding estimated nonlinear discriminant functions is investigated. The total probabilities of misclassification and percentage biases are evaluated and discussed.  相似文献   

18.
Typical joint modeling of longitudinal measurements and time to event data assumes that two models share a common set of random effects with a normal distribution assumption. But, sometimes the underlying population that the sample is extracted from is a heterogeneous population and detecting homogeneous subsamples of it is an important scientific question. In this paper, a finite mixture of normal distributions for the shared random effects is proposed for considering the heterogeneity in the population. For detecting whether the unobserved heterogeneity exits or not, we use a simple graphical exploratory diagnostic tool proposed by Verbeke and Molenberghs [34] to assess whether the traditional normality assumption for the random effects in the mixed model is adequate. In the joint modeling setting, in the case of evidence against normality (homogeneity), a finite mixture of normals is used for the shared random-effects distribution. A Bayesian MCMC procedure is developed for parameter estimation and inference. The methodology is illustrated using some simulation studies. Also, the proposed approach is used for analyzing a real HIV data set, using the heterogeneous joint model for this data set, the individuals are classified into two groups: a group with high risk and a group with moderate risk.  相似文献   

19.
DECORATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples) is a classifier combination technique to construct a set of diverse base classifiers using additional artificially generated training instances. The predictions from the base classifiers are then integrated into one by the mean combination rule. In order to gain more insight about its effectiveness and advantages, this paper utilizes a large experiment to study the bias–variance analysis of DECORATE as well as some other widely used ensemble methods (such as bagging, AdaBoost, random forest) at different training sample sizes. The experimental results yield the following conclusions. For small training sets, DECORATE has a dominant advantage over its rivals and its success is attributed to the larger bias reduction achieved by it than the other algorithms. With increase in training data, AdaBoost benefits most and the bias reduced by it gradually turns to be significant while its variance reduction is also medium. Thus, AdaBoost performs best with large training samples. Moreover, random forest behaves always second best regardless of small or large training sets and it is seen to mainly decrease variance while maintaining low bias. Bagging seems to be an intermediate one since it reduces variance primarily.  相似文献   

20.
ABSTRACT

Classification of data consisting of both categorical and continuous variables between two groups is often handled by the sample location linear discriminant function confined to each of the locations specified by the observed values of the categorical variables. Homoscedasticity of across-location conditional dispersion matrices of the continuous variables is often assumed. Quite often, interactions between continuous and categorical variables cause across-location heteroscedasticity. In this article, we examine the effect of heterogeneous across-location conditional dispersion matrices on the overall expected and actual error rates associated with the sample location linear discriminant function. Performance of the sample location linear discriminant function is evaluated against the results for the restrictive classifier adjusted for across-location heteroscedasticity. Conclusions based on a Monte Carlo study are reported.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号