首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The k nearest neighbors (k-NN) classifier is one of the most popular methods for statistical pattern recognition and machine learning. In practice, the size k, the number of neighbors used for classification, is usually arbitrarily set to one or some other small numbers, or based on the cross-validation procedure. In this study, we propose a novel alternative approach to decide the size k. Based on a k-NN-based multivariate multi-sample test, we assign each k a permutation test based Z-score. The number of NN is set to the k with the highest Z-score. This approach is computationally efficient since we have derived the formulas for the mean and variance of the test statistic under permutation distribution for multiple sample groups. Several simulation and real-world data sets are analyzed to investigate the performance of our approach. The usefulness of our approach is demonstrated through the evaluation of prediction accuracies using Z-score as a criterion to select the size k. We also compare our approach to the widely used cross-validation approaches. The results show that the size k selected by our approach yields high prediction accuracies when informative features are used for classification, whereas the cross-validation approach may fail in some cases.  相似文献   

2.
The article extends the REBMIX to multivariate data. Random variables may follow normal, lognormal, or Weibull parametric families and should be independent within components. The initial weights and component parameters are not required. Preprocessing of observations folows the histogram, Parzen window, or k-nearest neighbor approach. The number of components, weights, and component parameters are gained iteratively by using information measures of the distance, such as the total of positive relative deviations and the information criterion. The number of classes or the number of the nearest neighbors can be optimized, as well. The REBMIX software is available on http://www.fs.uni-lj.si/lavek.  相似文献   

3.
Caren Hasler  Yves Tillé 《Statistics》2016,50(6):1310-1331
Random imputation is an interesting class of imputation methods to handle item nonresponse because it tends to preserve the distribution of the imputed variable. However, such methods amplify the total variance of the estimators because values are imputed at random. This increase in variance is called imputation variance. In this paper, we propose a new random hot-deck imputation method that is based on the k-nearest neighbour methodology. It replaces the missing value of a unit with the observed value of a similar unit. Calibration and balanced sampling are applied to minimize the imputation variance. Moreover, our proposed method provides triple protection against nonresponse bias. This means that if at least one out of three specified models holds, then the resulting total estimator is unbiased. Finally, our approach allows the user to perform consistency edits and to impute simultaneously.  相似文献   

4.
The trend test is often used for the analysis of 2×K ordered categorical data, in which K pre-specified increasing scores are used. There have been discussions on how to assign these scores and the impact of the outcomes on different scores. The scores are often assigned based on the data-generating model. When this model is unknown, using the trend test is not robust. We discuss the weighted average of a trend test over all scientifically plausible choices of scores or models. This approach is more computationally efficient than a commonly used robust test MAX when K is large. Our discussion is for any ordered 2×K table, but simulation and applications to real data are focused on case-control genetic association studies. Although there is no single test optimal for all choices of scores, our numerical results show that some score averaging tests can achieve the performance of MAX.  相似文献   

5.
Abstract

K-means inverse regression was developed as an easy-to-use dimension reduction procedure for multivariate regression. This approach is similar to the original sliced inverse regression method, with the exception that the slices are explicitly produced by a K-means clustering of the response vectors. In this article, we propose K-medoids clustering as an alternative clustering approach for slicing and compare its performance to K-means in a simulation study. Although the two methods often produce comparable results, K-medoids tends to yield better performance in the presence of outliers. In addition to isolation of outliers, K-medoids clustering also has the advantage of accommodating a broader range of dissimilarity measures, which could prove useful in other graphical regression applications where slicing is required.  相似文献   

6.
Most classification models have presented an imbalanced learning state when dealing with the imbalanced datasets. This article proposes a novel approach for learning from imbalanced datasets, which based on an improved SMOTE (synthetic Minority Over-sampling technique) algorithm. By organically combining the over-sampling and the under-sampling method, this approach aims to choose neighbors targetedly and synthesize samples with different strategy. Experiments show that most classifiers have achieved an ideal performance on the classification problem of the positive and negative class after dealing imbalanced datasets with our algorithm.  相似文献   

7.
ABSTRACT

Cylindrical data are bivariate data from the combination of circular and linear variables. However, up to now no work has been done on the detection of outlier in cylindrical data. We introduce a definition of outlier for cylindrical data and present a new test of discordancy to detect outlier in this type of data, based on the k-nearest neighbor’s distance. Cut-off points of the new test statistic based on the Johnson-Wehrly distribution are calculated and its performance is examined using simulation. A practical example is presented using wind speed and wind direction data obtained from the Malaysian Meteorological Department.  相似文献   

8.
This paper proposes a new probabilistic classification algorithm using a Markov random field approach. The joint distribution of class labels is explicitly modelled using the distances between feature vectors. Intuitively, a class label should depend more on class labels which are closer in the feature space, than those which are further away. Our approach builds on previous work by Holmes and Adams (J. R. Stat. Soc. Ser. B 64:295–306, 2002; Biometrika 90:99–112, 2003) and Cucala et al. (J. Am. Stat. Assoc. 104:263–273, 2009). Our work shares many of the advantages of these approaches in providing a probabilistic basis for the statistical inference. In comparison to previous work, we present a more efficient computational algorithm to overcome the intractability of the Markov random field model. The results of our algorithm are encouraging in comparison to the k-nearest neighbour algorithm.  相似文献   

9.
In this paper, we investigate the k-nearest neighbours (kNN) estimation of nonparametric regression model for strong mixing functional time series data. More precisely, we establish the uniform almost complete convergence rate of the kNN estimator under some mild conditions. Furthermore, a simulation study and an empirical application to the real data analysis of sea surface temperature (SST) are carried out to illustrate the finite sample performances and the usefulness of the kNN approach.  相似文献   

10.
Classes of higher-order kernels for estimation of a probability density are constructed by iterating the twicing procedure. Given a kernel K of order l, we build a family of kernels Km of orders l(m + 1) with the attractive property that their Fourier transforms are simply 1 — {1 —$(.)}m+1, where ? is the Fourier transform of K. These families of higher-order kernels are well suited when the fast Fourier transform is used to speed up the calculation of the kernel estimate or the least-squares cross-validation procedure for selection of the window width. We also compare the theoretical performance of the optimal polynomial-based kernels with that of the iterative twicing kernels constructed from some popular second-order kernels.  相似文献   

11.
This paper proposes an algorithm for the classification of multi-dimensional datasets based on the conjugate Bayesian Multiple Kernel Grouping Learning (BMKGL). Using conjugate Bayesian framework improves the computation efficiency. Multiple kernels instead of a single kernel avoid the kernel selection problem which is also a computationally expensive work. Through grouping parameter learning, BMKGL can simultaneously integrate information from different dimensions and find the dimensions which contribute more to the variations of the outcome for the purpose of interpretable property. Meanwhile, BMKGL can select the most suitable combination of kernels for different dimensions so as to extract the most appropriate measure for each dimension and improve the accuracy of classification results. The simulation results illustrate that our learning process has better performance in prediction results and stability compared to some popular classifiers, such as k-nearest neighbours algorithm, support vector machine algorithm and naive Bayes classifier. BMKGL also outperforms previous methods in terms of accuracy and interpretation for the heart disease and EEG datasets.  相似文献   

12.
In this paper, we provide probabilistic predictions for soccer games of the 2010 FIFA World Cup modelling the number of goals scored in a game by each team. We use a Poisson distribution for the number of goals for each team in a game, where the scoring rate is considered unknown. We use a Gamma distribution for the scoring rate and the Gamma parameters are chosen using historical data and difference among teams defined by a strength factor for each team. The strength factor is a measure of discrimination among the national teams obtained from their memberships to fuzzy clusters. The clusters are obtained with the use of the Fuzzy C-means algorithm applied to a vector of variables, most of them available on the official FIFA website. Static and dynamic models were used to predict the World Cup outcomes and the performance of our predictions was evaluated using two comparison methods.  相似文献   

13.
Variable selection over a potentially large set of covariates in a linear model is quite popular. In the Bayesian context, common prior choices can lead to a posterior expectation of the regression coefficients that is a sparse (or nearly sparse) vector with a few nonzero components, those covariates that are most important. This article extends the “global‐local” shrinkage idea to a scenario where one wishes to model multiple response variables simultaneously. Here, we have developed a variable selection method for a K‐outcome model (multivariate regression) that identifies the most important covariates across all outcomes. The prior for all regression coefficients is a mean zero normal with coefficient‐specific variance term that consists of a predictor‐specific factor (shared local shrinkage parameter) and a model‐specific factor (global shrinkage term) that differs in each model. The performance of our modeling approach is evaluated through simulation studies and a data example.  相似文献   

14.
This paper examines team performance in the NBA over the last five decades. It was motivated by two previous observational studies, one of which studied the winning percentages of professional baseball teams over time, while the other examined individual player performance in the NBA. These studies considered professional sports as evolving systems, a view proposed by evolutionary biologist Stephen Jay Gould, who wrote extensively on the disappearance of .400 hitters in baseball. Gould argued that the disappearance is actually a sign of improvement in the quality of play, reflected in the reduction of variability in hitting performance. The previous studies reached similar conclusions in terms of winning percentages of baseball teams, and performance of individual players in basketball. This paper uses multivariate measures of team performance in the NBA to see if similar characteristics of evolution can be observed. The conclusion does not appear to be clearly affirmative, as in previous studies, and possible reasons for this are discussed.  相似文献   

15.
In a previous paper, it was demonstrated that distinctly different prediction methods when applied to 2435 American college and professional football games resulted in essentially the same fraction of correct selections of the winning team and essentially the same average absolute error for predicting the margin of victory. These results are now extended to 1446 Australian rules football games. Two distinctly different prediction methods are applied. A least-squares method provides a set of ratings. The predicted margin of victory in the next contest is less than the rating difference, corrected for home-ground advantage, while a 0.75 power method shrinks the ratings compared with those found by the least-squares technique and then performs predictions based on the rating difference and home-ground advantage. Both methods operate upon past margins of victory corrected for home advantage to obtain the ratings. It is shown that both methods perform similarly, based on the fraction of correct selections of the winning team and the average absolute error for predicting the margin of victory. That is, differing predictors using the same information tend to converge to a limiting level of accuracy. The least-squares approach also provides estimates of the accuracy of each prediction. The home advantage is evaluated for all teams collectively and also for individual teams. The data permit comparisons with other sports in other countries. The home team appears to have an advantage (the visiting team has a disadvantage) due to three factors:the visiting team suffers from travel fatigue; crowd intimidation by the home team fans; lack of familiarity with the playing conditions  相似文献   

16.
In this paper, we study the performance of a soccer player based on analysing an incomplete data set. To achieve this aim, we fit the bivariate Rayleigh distribution to the soccer dataset by the maximum likelihood method. In this way, the missing data and right censoring problems, that usually happen in such studies, are considered. Our aim is to inference about the performance of a soccer player by considering the stress and strength components. The first goal of the player of interest in a match is assumed as the stress component and the second goal of the match is assumed as the strength component. We propose some methods to overcome incomplete data problem and we use these methods to inference about the performance of a soccer player.  相似文献   

17.
Summary.  When an individual player or team enjoys periods of good form, and when these occur, is a widely observed phenomenon typically called 'streakiness'. It is interesting to assess which team is a streaky team, or who is a streaky player in sports. Such competitors might have a large number of successes during some periods and few or no successes during other periods. Thus, their success rate is not constant over time. We provide a Bayesian binary segmentation procedure for locating changepoints and the associated success rates simultaneously for these competitors. The procedure is based on a series of nested hypothesis tests each using the Bayes factor or the Bayesian information criterion. At each stage, we only need to compare a model with one changepoint with a model based on a constant success rate. Thus, the method circumvents the computational complexity that we would normally face in problems with an unknown number of changepoints. We apply the procedure to data corresponding to sports teams and players from basketball, golf and baseball.  相似文献   

18.
Trimming principles play an important role in robust statistics. However, their use for clustering typically requires some preliminary information about the contamination rate and the number of groups. We suggest a fresh approach to trimming that does not rely on this knowledge and that proves to be particularly suited for solving problems in robust cluster analysis. Our approach replaces the original K‐population (robust) estimation problem with K distinct one‐population steps, which take advantage of the good breakdown properties of trimmed estimators when the trimming level exceeds the usual bound of 0.5. In this setting, we prove that exact affine equivariance is lost on one hand but, on the other hand, an arbitrarily high breakdown point can be achieved by “anchoring” the robust estimator. We also support the use of adaptive trimming schemes, in order to infer the contamination rate from the data. A further bonus of our methodology is its ability to provide a reliable choice of the usually unknown number of groups.  相似文献   

19.
In this paper, we propose two SUR type estimators based on combining the SUR ridge regression and the restricted least squares methods. In the sequel these estimators are designated as the restricted ridge Liu estimator and the restricted ridge HK estimator (see Liu in Commun Statist Thoery Methods 22(2):393–402, 1993; Sarkar in Commun Statist A 21:1987–2000, 1992). The study has been made using Monte Carlo techniques, (1,000 replications), under certain conditions where a number of factors that may effect their performance have been varied. The performance of the proposed and some of the existing estimators are evaluated by means of the TMSE and the PR criteria. Our results indicate that the proposed SUR restricted ridge estimators based on K SUR, K Sratio, K Mratio and [(K)\ddot]{\ddot{K}} produced smaller TMSE and/or PR values than the remaining estimators. In contrast with other ridge estimators, components of [(K)\ddot]{\ddot{K}} are defined in terms of the eigenvalues of X* X*{X^{{\ast^{\prime}}} X^{\rm \ast}} and all lie in the open interval (0, 1).  相似文献   

20.
Making predictions of future realized values of random variables based on currently available data is a frequent task in statistical applications. In some applications, the interest is to obtain a two-sided simultaneous prediction interval (SPI) to contain at least k out of m future observations with a certain confidence level based on n previous observations from the same distribution. A closely related problem is to obtain a one-sided upper (or lower) simultaneous prediction bound (SPB) to exceed (or be exceeded) by at least k out of m future observations. In this paper, we provide a general approach for computing SPIs and SPBs based on data from a particular member of the (log)-location-scale family of distributions with complete or right censored data. The proposed simulation-based procedure can provide exact coverage probability for complete and Type II censored data. For Type I censored data, our simulation results show that our procedure provides satisfactory results in small samples. We use three applications to illustrate the proposed simultaneous prediction intervals and bounds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号