首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The k nearest neighbors (k-NN) classifier is one of the most popular methods for statistical pattern recognition and machine learning. In practice, the size k, the number of neighbors used for classification, is usually arbitrarily set to one or some other small numbers, or based on the cross-validation procedure. In this study, we propose a novel alternative approach to decide the size k. Based on a k-NN-based multivariate multi-sample test, we assign each k a permutation test based Z-score. The number of NN is set to the k with the highest Z-score. This approach is computationally efficient since we have derived the formulas for the mean and variance of the test statistic under permutation distribution for multiple sample groups. Several simulation and real-world data sets are analyzed to investigate the performance of our approach. The usefulness of our approach is demonstrated through the evaluation of prediction accuracies using Z-score as a criterion to select the size k. We also compare our approach to the widely used cross-validation approaches. The results show that the size k selected by our approach yields high prediction accuracies when informative features are used for classification, whereas the cross-validation approach may fail in some cases.  相似文献   

2.
Supersaturated designs (SSDs) are useful in examining many factors with a restricted number of experimental units. Many analysis methods have been proposed to analyse data from SSDs, with some methods performing better than others when data are normally distributed. It is possible that data sets violate assumptions of standard analysis methods used to analyse data from SSDs, and to date the performance of these analysis methods have not been evaluated using nonnormally distributed data sets. We conducted a simulation study with normally and nonnormally distributed data sets to compare the identification rates, power and coverage of the true models using a permutation test, the stepwise procedure and the smoothly clipped absolute deviation (SCAD) method. Results showed that at the level of significance α=0.01, the identification rates of the true models of the three methods were comparable; however at α=0.05, both the permutation test and stepwise procedures had considerably lower identification rates than SCAD. For most cases, the three methods produced high power and coverage. The experimentwise error rates (EER) were close to the nominal level (11.36%) for the stepwise method, while they were somewhat higher for the permutation test. The EER for the SCAD method were extremely high (84–87%) for the normal and t-distributions, as well as for data with outlier.  相似文献   

3.
Consider testing multiple hypotheses using tests that can only be evaluated by simulation, such as permutation tests or bootstrap tests. This article introduces MMCTest , a sequential algorithm that gives, with arbitrarily high probability, the same classification as a specific multiple testing procedure applied to ideal p‐values. The method can be used with a class of multiple testing procedures that include the Benjamini and Hochberg false discovery rate procedure and the Bonferroni correction controlling the familywise error rate. One of the key features of the algorithm is that it stops sampling for all the hypotheses that can already be decided as being rejected or non‐rejected. MMCTest can be interrupted at any stage and then returns three sets of hypotheses: the rejected, the non‐rejected and the undecided hypotheses. A simulation study motivated by actual biological data shows that MMCTest is usable in practice and that, despite the additional guarantee, it can be computationally more efficient than other methods.  相似文献   

4.
基于平均自下而上时间的两种分类方法的比较   总被引:1,自引:1,他引:0  
金华 《统计研究》2008,25(1):98-103
内容提要:诸如疾病分类系统的预后预测和分类方法,常可用于帮助进行临床管理决策。同一疾病总体常可得到多种分类方法,因此有必要比较这些方法以确定最优分类,或者寻找不逊于最优分类的替代方法。本文基于约束平均寿命引入分离度指标来度量分类方法的预后分类效率,这个指标可用来比较以生存时间为结局的两种分类方法的功效,特别是用于非劣性和等效性检验。我们给出了基于配对数据的两个分离度的估计与检验方法。模拟结果提示,检验方法在适当的样本量条件下能够控制第一类错误,两个实例表明在医学临床中的应用。  相似文献   

5.
The F-ratio test for equality of dispersion in two samples is by no means robust, while non-parametric tests either assume a common median, or are not very powerful. Two new permutation tests are presented, which do not suffer from either of these problems. Algorithms for Monte Carlo calculation of P values and confidence intervals are given, and the performance of the tests are studied and compared using Monte Carlo simulations for a range of distributional types. The methods used to speed up Monte Carlo calculations, e.g. stratification, are of wider applicability.  相似文献   

6.
Sunset Salvo     
The Wilcoxon—Mann—Whitney test enjoys great popularity among scientists comparing two groups of observations, especially when measurements made on a continuous scale are non-normally distributed. Triggered by different results for the procedure from two statistics programs, we compared the outcomes from 11 PC-based statistics packages. The findings were that the delivered p values ranged from significant to nonsignificant at the 5% level, depending on whether a large-sample approximation or an exact permutation form of the test was used and, in the former case, whether or not a correction for continuity was used and whether or not a correction for ties was made. Some packages also produced pseudo-exact p values, based on the null distribution under the assumption of no ties. A further crucial point is that the variant of the algorithm used for computation by the packages is rarely indicated in the output or documented in the Help facility and the manuals. We conclude that the only accurate form of the Wilcoxon—Mann—Whitney procedure is one in which the exact permutation null distribution is compiled for the actual data.  相似文献   

7.
Summary.  An authentic food is one that is what it purports to be. Food processors and consumers need to be assured that, when they pay for a specific product or ingredient, they are receiving exactly what they pay for. Classification methods are an important tool in food authenticity studies where they are used to assign food samples of unknown type to known types. A classification method is developed where the classification rule is estimated by using both the labelled and the unlabelled data, in contrast with many classical methods which use only the labelled data for estimation. This methodology models the data as arising from a Gaussian mixture model with parsimonious covariance structure, as is done in model-based clustering. A missing data formulation of the mixture model is used and the models are fitted by using the EM and classification EM algorithms. The methods are applied to the analysis of spectra of food-stuffs recorded over the visible and near infra-red wavelength range in food authenticity studies. A comparison of the performance of model-based discriminant analysis and the method of classification proposed is given. The classification method proposed is shown to yield very good misclassification rates. The correct classification rate was observed to be as much as 15% higher than the correct classification rate for model-based discriminant analysis.  相似文献   

8.
Under a randomization model for a completely randomized design permutation tests are considered based on the usual F statistic and on a multi-response permutation procedure statistic. For the first statistic the first two moments are obtained so a comparision with the distribution under the normal theory model can be made. The second statistic is shown to converge in distribution to an infinite weighted sum of chi-squared variates, the weights being the limits of the eigenvalues of a matrix depending on the distance measure used and the order statistics of the observations.  相似文献   

9.
10.
Model based labeling for mixture models   总被引:1,自引:0,他引:1  
Label switching is one of the fundamental problems for Bayesian mixture model analysis. Due to the permutation invariance of the mixture posterior, we can consider that the posterior of a m-component mixture model is a mixture distribution with m! symmetric components and therefore the object of labeling is to recover one of the components. In order to do labeling, we propose to first fit a symmetric m!-component mixture model to the Markov chain Monte Carlo (MCMC) samples and then choose the label for each sample by maximizing the corresponding classification probabilities, which are the probabilities of all possible labels for each sample. Both parametric and semi-parametric ways are proposed to fit the symmetric mixture model for the posterior. Compared to the existing labeling methods, our proposed method aims to approximate the posterior directly and provides the labeling probabilities for all possible labels and thus has a model explanation and theoretical support. In addition, we introduce a situation in which the “ideally” labeled samples are available and thus can be used to compare different labeling methods. We demonstrate the success of our new method in dealing with the label switching problem using two examples.  相似文献   

11.
The depths, which have been used to detect outliers or to extract a representative subset, can be applied to classification. We propose a resampling-based classification method based on the fact that resampling techniques yield a consistent estimator of the distribution of a statistic. The performance of this method was evaluated with eight contaminated models in terms of Correct Classification Rates (CCRs), and the results were compared with other known methods. The proposed method consistently showed higher average CCRs and 4% higher CCR at the maximum compared to other methods. In addition, this method was applied to Berkeley data. The average CCRs were between 0.79 and 0.85.  相似文献   

12.
In this paper, we propose a nonparametric test for homogeneity of overall variabilities for two multi-dimensional populations. Comparisons between the proposed nonparametric procedure and the asymptotic parametric procedure and a permutation test based on standardized generalized variances are made when the underlying populations are multivariate normal. We also study the performance of these test procedures when the underlying populations are non-normal. We observe that the nonparametric procedure and the permutation test based on standardized generalized variances are not as powerful as the asymptotic parametric test under normality. However, they are reliable and powerful tests for comparing overall variability under other multivariate distributions such as the multivariate Cauchy, the multivariate Pareto and the multivariate exponential distributions, even with small sample sizes. A Monte Carlo simulation study is used to evaluate the performance of the proposed procedures. An example from an educational study is used to illustrate the proposed nonparametric test.  相似文献   

13.
To carry out a permutation test we have to examine the n! permutations of the observations. In order to make the permutation test feasible, Dwass (1957) proposed to examine only a sample of these permutations. With the help of sequential methods, we obtain a test which is never less efficient than that proposed by Dwass or the permutation test itself, in the sense that it is as powerful and never requires more permutations to make a decision. In practice, we can expect to gain much efficiency.  相似文献   

14.
A new method of statistical classification (discrimination) is proposed. The method is most effective for high dimension, low sample size data. It uses a robust mean difference as the direction vector and locates the classification boundary by minimizing the error rates. Asymptotic results for assessment and comparison to several popular methods are obtained by using a type of asymptotics of finite sample size and infinite dimensions. The value of the proposed approach is demonstrated by simulations. Real data examples are used to illustrate the performance of different classification methods.  相似文献   

15.
This study compares empirical type I error and power of different permutation techniques that can be used for partial correlation analysis involving three data vectors and for partial Mantel tests. The partial Mantel test is a form of first-order partial correlation analysis involving three distance matrices which is widely used in such fields as population genetics, ecology, anthropology, psychometry and sociology. The methods compared are the following: (1) permute the objects in one of the vectors (or matrices); (2) permute the residuals of a null model; (3) correlate residualized vector 1 (or matrix A) to residualized vector 2 (or matrix B); permute one of the residualized vectors (or matrices); (4) permute the residuals of a full model. In the partial correlation study, the results were compared to those of the parametric t-test which provides a reference under normality. Simulations were carried out to measure the type I error and power of these permutatio methods, using normal and non-normal data, without and with an outlier. There were 10 000 simulations for each situation (100 000 when n = 5); 999 permutations were produced per test where permutations were used. The recommended testing procedures are the following:(a) In partial correlation analysis, most methods can be used most of the time. The parametric t-test should not be used with highly skewed data. Permutation of the raw data should be avoided only when highly skewed data are combined with outliers in the covariable. Methods implying permutation of residuals, which are known to only have asymptotically exact significance levels, should not be used when highly skewed data are combined with small sample size. (b) In partial Mantel tests, method 2 can always be used, except when highly skewed data are combined with small sample size. (c) With small sample sizes, one should carefully examine the data before partial correlation or partial Mantel analysis. For highly skewed data, permutation of the raw data has correct type I error in the absence of outliers. When highly skewed data are combined with outliers in the covariable vector or matrix, it is still recommended to use the permutation of raw data. (d) Method 3 should never be used.  相似文献   

16.
Summary This paper deals with nonparametric methods for combining dependent permutation or randomization tests. Particularly, they are nonparametric with respect to the underlying dependence structure. The methods are based on a without replacement resampling procedure (WRRP) conditional on the observed data, also called conditional simulation, which provide suitable estimates, as good as computing time permits, of the permutational distribution of any statistic. A class C of combining functions is characterized in such a way that all its members, under suitable and reasonable conditions, are found to be consistent and unbiased. Moreover, for some of its members, their almost sure asymptotic equivalence with respect to best tests, in particular cases, is shown. An applicational example to a multivariate permutationalt-paired test is also discussed.  相似文献   

17.
We consider the problem of constructing multi-class classification methods for analyzing data with complex structure. A nonlinear logistic discriminant model is introduced based on Gaussian basis functions constructed by the self-organizing map. In order to select adjusted parameters, we employ model selection criteria derived from information-theoretic and Bayesian approaches. Numerical examples are conducted to investigate the performance of the proposed multi-class discriminant procedure. Our modeling procedure is also applied to protein structure recognition in life science. The results indicate the effectiveness of our strategy in terms of prediction accuracy.  相似文献   

18.
This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure.  相似文献   

19.
Using Monte Carlo simulation, we compare the performance of five asymptotic test procedures and a randomized permutation test procedure for testing the homogeneity of odds ratio under the stratified matched-pair design. We note that the weighted-least-square test procedure is liberal, while Pearson's goodness-of-fit (PGF) test procedure with the continuity correction is conservative. We note that PGF without the continuity correction, the conditional likelihood ratio test procedure, and the randomized permutation test procedure can generally perform well with respect to Type I error. We use the data taken from a case–control study regarding the endometrial cancer incidence published elsewhere to illustrate the use of these test procedures.  相似文献   

20.
ABSTRACT

In this study, Monte Carlo simulation experiments were employed to examine the performance of four statistical two-group classification methods when the data distributions are skewed and misclassification costs are unequal, conditions frequently encountered in business and economic applications. The classification methods studied are linear and quadratic parametric, nearest neighbor and logistic regression methods. It was found that when skewness is moderate, the parametric methods tend to give best results. Depending on the specific data condition, when skewness is high, either the linear parametric, logistic regression, or the nearest-neighbor method gives the best results. When misclassification costs differ widely across groups, the linear parametric method is favored over the other methods for many of the data conditions studied.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号