首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this article, two classifiers, which generalize the nearest neighbor method, are introduced and studied. The first of them is based on calculating the distances to all objects from a learning sample. The second one additionally considers directions of the objects. Both of them have locally nonlinear classification borders. A number of real and artificial datasets and methods of error estimation are used.  相似文献   

2.
The main problem with localized discriminant techniques is the curse of dimensionality, which seems to restrict their use to the case of few variables. However, if localization is combined with a reduction of dimension the initial number of variables is less restricted. In particular it is shown that localization yields powerful classifiers even in higher dimensions if localization is combined with locally adaptive selection of predictors. A robust localized logistic regression (LLR) method is developed for which all tuning parameters are chosen data-adaptively. In an extended simulation study we evaluate the potential of the proposed procedure for various types of data and compare it to other classification procedures. In addition we demonstrate that automatic choice of localization, predictor selection and penalty parameters based on cross validation is working well. Finally the method is applied to real data sets and its real world performance is compared to alternative procedures.  相似文献   

3.
Bayesian model learning based on a parallel MCMC strategy   总被引:1,自引:0,他引:1  
We introduce a novel Markov chain Monte Carlo algorithm for estimation of posterior probabilities over discrete model spaces. Our learning approach is applicable to families of models for which the marginal likelihood can be analytically calculated, either exactly or approximately, given any fixed structure. It is argued that for certain model neighborhood structures, the ordinary reversible Metropolis-Hastings algorithm does not yield an appropriate solution to the estimation problem. Therefore, we develop an alternative, non-reversible algorithm which can avoid the scaling effect of the neighborhood. To efficiently explore a model space, a finite number of interacting parallel stochastic processes is utilized. Our interaction scheme enables exploration of several local neighborhoods of a model space simultaneously, while it prevents the absorption of any particular process to a relatively inferior state. We illustrate the advantages of our method by an application to a classification model. In particular, we use an extensive bacterial database and compare our results with results obtained by different methods for the same data.  相似文献   

4.
Feature selection is an important technique for ultrahigh-dimensional data analysis. Most feature selection methods such as SIS and its relevant versions heavily depend on the specified model structures. Furthermore, feature interactions are usually not taken into account in the existing literature. In this paper, we present a novel feature selection method for the model with variable interactions, without the use of structure assumption. Thus, the new ranking criterion is flexible and can deal with the models that contain interactions. Moreover, the new screening procedures are not complex, consequently, they are computationally efficient and the theoretical properties such as the ranking consistency and sure screening properties can be easily obtained. Several real and simulation examples are presented to illustrate the methodology.  相似文献   

5.
Distance between two probability densities or two random variables is a well established concept in statistics. The present paper considers generalizations of distances to separation measurements for three or more elements in a function space. Geometric intuition and examples from hypothesis testing suggest lower and upper bounds for such measurements in terms of pairwise distances; but also in Lp spaces some useful non-pairwise separation measurements always lie within these bounds. Examples of such separation measurements are the Bayes probability of correct classification among several arbitrary distributions, and the expected range among several random variables.  相似文献   

6.
Indices of Dependence Between Types in Multivariate Point Patterns   总被引:2,自引:0,他引:2  
We propose new summary statistics quantifying several forms of dependence between points of different types in a multi-type spatial point pattern. These statistics are the multivariate counterparts of the J -function for point processes of a single type, introduced by Lieshout & Baddeley (1996). They are based on comparing distances from a type i point to either the nearest type j point or to the nearest point in the pattern regardless of type to these distances seen from an arbitrary point in space. Information about the range of interaction can also be inferred. Our statistics can be computed explicitly for a range of well-known multivariate point process models. Some applications to bivariate and trivariate data sets are presented as well.  相似文献   

7.
Survey calibration methods modify minimally sample weights to satisfy domain-level benchmark constraints (BC), e.g. census totals. This allows exploitation of auxiliary information to improve the representativeness of sample data (addressing coverage limitations, non-response) and the quality of sample-based estimates of population parameters. Calibration methods may fail with samples presenting small/zero counts for some benchmark groups or when range restrictions (RR), such as positivity, are imposed to avoid unrealistic or extreme weights. User-defined modifications of BC/RR performed after encountering non-convergence allow little control on the solution, and penalisation approaches modelling infeasibility may not guarantee convergence. Paradoxically, this has led to underuse in calibration of highly disaggregated information, when available. We present an always-convergent flexible two-step global optimisation (GO) survey calibration approach. The feasibility of the calibration problem is assessed, and automatically controlled minimum errors in BC or changes in RR are allowed to guarantee convergence in advance, while preserving the good properties of calibration estimators. Modelling alternatives under different scenarios using various error/change and distance measures are formulated and discussed. The GO approach is validated by calibrating the weights of the 2012 Health Survey for England to a fine age–gender–region cross-tabulation (378 counts) from the 2011 Census in England and Wales.  相似文献   

8.
Although quantile regression estimators are robust against low leverage observations with atypically large responses (Koenker & Bassett 1978), they can be seriously affected by a few points that deviate from the majority of the sample covariates. This problem can be alleviated by downweighting observations with high leverage. Unfortunately, when the covariates are not elliptically distributed, Mahalanobis distances may not be able to correctly identify atypical points. In this paper the authors discuss the use of weights based on a new leverage measure constructed using Rosenblatt's multivariate transformation which is able to reflect nonelliptical structures in the covariate space. The resulting weighted estimators are consistent, asymptotically normal, and have a bounded influence function. In addition, the authors also discuss a selection criterion for choosing the downweighting scheme. They illustrate their approach with child growth data from Finland. Finally, their simulation studies suggest that this methodology has good finite‐sample properties.  相似文献   

9.
Most methods for variable selection work from the top down and steadily remove features until only a small number remain. They often rely on a predictive model, and there are usually significant disconnections in the sequence of methodologies that leads from the training samples to the choice of the predictor, then to variable selection, then to choice of a classifier, and finally to classification of a new data vector. In this paper we suggest a bottom‐up approach that brings the choices of variable selector and classifier closer together, by basing the variable selector directly on the classifier, removing the need to involve predictive methods in the classification decision, and enabling the direct and transparent comparison of different classifiers in a given problem. Specifically, we suggest ‘wrapper methods’, determined by classifier type, for choosing variables that minimize the classification error rate. This approach is particularly useful for exploring relationships among the variables that are chosen for the classifier. It reveals which variables have a high degree of leverage for correct classification using different classifiers; it shows which variables operate in relative isolation, and which are important mainly in conjunction with others; it permits quantification of the authority with which variables are selected; and it generally leads to a reduced number of variables for classification, in comparison with alternative approaches based on prediction.  相似文献   

10.
Joinpoint regression model identifies significant changes in the trends of the incidence, mortality, and survival of a specific disease in a given population. The purpose of the present study is to develop an age-stratified Bayesian joinpoint regression model to describe mortality trend assuming that the observed counts are probabilistically characterized by the Poisson distribution. The proposed model is based on Bayesian model selection criteria with the smallest number of joinpoints that are sufficient to explain the Annual Percentage Change. The prior probability distributions are chosen in such a way that they are automatically derived from the model index contained in the model space. The proposed model and methodology estimates the age-adjusted mortality rates in different epidemiological studies to compare the trends by accounting the confounding effects of age. In developing the subject methods, we use the cancer mortality counts of adult lung and bronchus cancer, and brain and other Central Nervous System cancer patients obtained from the Surveillance Epidemiology and End Results data base of the National Cancer Institute.  相似文献   

11.
Summary.  We introduce a flexible marginal modelling approach for statistical inference for clustered and longitudinal data under minimal assumptions. This estimated estimating equations approach is semiparametric and the proposed models are fitted by quasi-likelihood regression, where the unknown marginal means are a function of the fixed effects linear predictor with unknown smooth link, and variance–covariance is an unknown smooth function of the marginal means. We propose to estimate the nonparametric link and variance–covariance functions via smoothing methods, whereas the regression parameters are obtained via the estimated estimating equations. These are score equations that contain nonparametric function estimates. The proposed estimated estimating equations approach is motivated by its flexibility and easy implementation. Moreover, if data follow a generalized linear mixed model, with either a specified or an unspecified distribution of random effects and link function, the model proposed emerges as the corresponding marginal (population-average) version and can be used to obtain inference for the fixed effects in the underlying generalized linear mixed model, without the need to specify any other components of this generalized linear mixed model. Among marginal models, the estimated estimating equations approach provides a flexible alternative to modelling with generalized estimating equations. Applications of estimated estimating equations include diagnostics and link selection. The asymptotic distribution of the proposed estimators for the model parameters is derived, enabling statistical inference. Practical illustrations include Poisson modelling of repeated epileptic seizure counts and simulations for clustered binomial responses.  相似文献   

12.
Three linear prediction methods of a single missing value for a stationary first order multiplicative spatial autoregressive model are proposed based on the quarter observations, observations in the first neighborhood, and observations in the nearest neighborhood. Three different types of innovations including Gaussian (symmetric and thin tailed), exponential (skew to right), and asymmetric Laplace (skew and heavy tailed) are considered. In each case, the proposed predictors are compared based on the two well-known criteria: mean square prediction and Pitman's measure of closeness. Parameter estimation is performed by maximum likelihood, least square errors, and Markov chain Monte Carlo (MCMC).  相似文献   

13.
Abstract.  A flexible semi-parametric regression model is proposed for modelling the relationship between a response and multivariate predictor variables. The proposed multiple-index model includes smooth unknown link and variance functions that are estimated non-parametrically. Data-adaptive methods for automatic smoothing parameter selection and for the choice of the number of indices M are considered. This model adapts to complex data structures and provides efficient adaptive estimation through the variance function component in the sense that the asymptotic distribution is the same as if the non-parametric components are known. We develop iterative estimation schemes, which include a constrained projection method for the case where the regression parameter vectors are mutually orthogonal. The proposed methods are illustrated with the analysis of data from a growth bioassay and a reproduction experiment with medflies. Asymptotic properties of the estimated model components are also obtained.  相似文献   

14.
针对灰色聚类指标权重确定的问题,通过定义白化权函数的分类区分度来度量各指标对聚类对象的分类所作的贡献,并据此确定分类指标的权重。在此基础上,提出了变权灰色聚类方法。结果表明,该方法能够融合聚类对象的样本信息和专家的经验,有效确定不同聚类对象的各指标权重,且适用于聚类指标的量纲不同、数量级悬殊较大的情形。最后通过一个实例说明了变权灰色聚类的实用性和有效性。  相似文献   

15.
Regression tends to give very unstable and unreliable regression weights when predictors are highly collinear. Several methods have been proposed to counter this problem. A subset of these do so by finding components that summarize the information in the predictors and the criterion variables. The present paper compares six such methods (two of which are almost completely new) to ordinary regression: Partial least Squares (PLS), Principal Component regression (PCR), Principle covariates regression, reduced rank regression, and two variants of what is called power regression. The comparison is mainly done by means of a series of simulation studies, in which data are constructed in various ways, with different degrees of collinearity and noise, and the methods are compared in terms of their capability of recovering the population regression weights, as well as their prediction quality for the complete population. It turns out that recovery of regression weights in situations with collinearity is often very poor by all methods, unless the regression weights lie in the subspace spanning the first few principal components of the predictor variables. In those cases, typically PLS and PCR give the best recoveries of regression weights. The picture is inconclusive, however, because, especially in the study with more real life like simulated data, PLS and PCR gave the poorest recoveries of regression weights in conditions with relatively low noise and collinearity. It seems that PLS and PCR are particularly indicated in cases with much collinearity, whereas in other cases it is better to use ordinary regression. As far as prediction is concerned: Prediction suffers far less from collinearity than recovery of the regression weights.  相似文献   

16.
Multivariate extreme events are typically modelled using multivariate extreme value distributions. Unfortunately, there exists no finite parametrization for the class of multivariate extreme value distributions. One common approach is to model extreme events using some flexible parametric subclass. This approach has been limited to only two or three dimensions, primarily because suitably flexible high-dimensional parametric models have prohibitively complex density functions. We present an approach that allows a number of popular flexible models to be used in arbitrarily high dimensions. The approach easily handles missing and censored data, and can be employed when modelling componentwise maxima and multivariate threshold exceedances. The approach is based on a representation using conditionally independent marginal components, conditioning on positive stable random variables. We use Bayesian inference, where the conditioning variables are treated as auxiliary variables within Markov chain Monte Carlo simulations. We demonstrate these methods with an application to sea-levels, using data collected at 10 sites on the east coast of England.  相似文献   

17.
A class of distribution-free tests is proposed for the independence of two subsets of response coordinates. The tests are based on the pairwise distances across subjects within each subset of the response. A complete graph is induced by each subset of response coordinates, with the sample points as nodes and the pairwise distances as the edge weights. The proposed test statistic depends only on the rank order of edges in these complete graphs. The response vector may be of any dimensions. In particular, the number of samples may be smaller than the dimensions of the response. The test statistic is shown to have a normal limiting distribution with known expectation and variance under the null hypothesis of independence. The exact distribution free null distribution of the test statistic is given for a sample of size 14, and its Monte-Carlo approximation is considered for larger sample sizes. We demonstrate in simulations that this new class of tests has good power properties for very general alternatives.  相似文献   

18.
Analysis of data in the form of a set of points irregularly distributed within a region of space usually involves the study of some property of the distribution of inter-event distances. One such function is G, the distribution of the distance from an event to its nearest neighbor. In practice, point processes are commonly observed through a bounded window, thus making edge effects an important component in the estimation of G. Several estimators have been proposed, all dealing with the edge effect problem in different ways. This paper proposes a new alternative for estimating the nearest neighbor distribution and compares it to other estimators. The result is an estimator with relatively small mean squared error for a wide variety of stationary processes.  相似文献   

19.
Vine copulas are a flexible class of dependence models consisting of bivariate building blocks and have proven to be particularly useful in high dimensions. Classical model distance measures require multivariate integration and thus suffer from the curse of dimensionality. In this paper, we provide numerically tractable methods to measure the distance between two vine copulas even in high dimensions. For this purpose, we consecutively develop three new distance measures based on the Kullback–Leibler distance, using the result that it can be expressed as the sum over expectations of KL distances between univariate conditional densities, which can be easily obtained for vine copulas. To reduce numerical calculations, we approximate these expectations on adequately designed grids, outperforming Monte Carlo integration with respect to computational time. For the sake of interpretability, we provide a baseline calibration for the proposed distance measures. We further develop similar substitutes for the Jeffreys distance, a symmetrized version of the Kullback–Leibler distance. In numerous examples and applications, we illustrate the strengths and weaknesses of the developed distance measures.  相似文献   

20.
It is often the case that high-dimensional data consist of only a few informative components. Standard statistical modeling and estimation in such a situation is prone to inaccuracies due to overfitting, unless regularization methods are practiced. In the context of classification, we propose a class of regularization methods through shrinkage estimators. The shrinkage is based on variable selection coupled with conditional maximum likelihood. Using Stein's unbiased estimator of the risk, we derive an estimator for the optimal shrinkage method within a certain class. A comparison of the optimal shrinkage methods in a classification context, with the optimal shrinkage method when estimating a mean vector under a squared loss, is given. The latter problem is extensively studied, but it seems that the results of those studies are not completely relevant for classification. We demonstrate and examine our method on simulated data and compare it to feature annealed independence rule and Fisher's rule.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号