首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The canonical variates in canonical correlation analysis are often interpreted by looking at the weights or loadings of the variables in each canonical variate and effectively ignoring those variables whose weights or loadings are small. It is shown that such a procedure can be misleading. The related problem of selecting a subset of the original variables which preserves the information in the most important canonical variates is also examined. Because of different possible definitions of ‘the information in canonical variates’, any such subset selection needs very careful consideration.  相似文献   

2.
Abstract

An aspect of cluster analysis which has been widely studied in recent years is the weighting and selection of variables. Procedures have been proposed which are able to identify the cluster structure present in a data matrix when that structure is confined to a subset of variables. Other methods assess the relative importance of each variable as revealed by a suitably chosen weight. But when a cluster structure is present in more than one subset of variables and is different from one subset to another, those solutions as well as standard clustering algorithms can lead to misleading results. Some very recent methodologies for finding consensus classifications of the same set of units can be useful also for the identification of cluster structures in a data matrix, but each one seems to be only partly satisfactory for the purpose at hand. Therefore a new more specific procedure is proposed and illustrated by analyzing two real data sets; its performances are evaluated by means of a simulation experiment.  相似文献   

3.
Summary.  A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented.  相似文献   

4.
In this paper, we examine the potential determinants of foreign direct investment. For this purpose, we apply new exact subset selection procedures, which are based on idealized assumptions, as well as their possibly more plausible empirical counterparts to an international data set to select the optimal set of predictors. Unlike the standard model selection procedures AIC and BIC, which penalize only the number of variables included in a model, and the subset selection procedures RIC and MRIC, which consider also the total number of available candidate variables, our data-specific procedures even take the correlation structure of all candidate variables into account. Our main focus is on a new procedure, which we have designed for situations where some of the potential predictors are certain to be included in the model. For a sample of 73 developing countries, this procedure selects only four variables, namely imports, net income from abroad, gross capital formation, and GDP per capita. An important secondary finding of our study is that the data-specific procedures, which are based on extensive simulations and are therefore very time-consuming, can be approximated reasonably well by the much simpler exact methods.  相似文献   

5.
This paper describes an algorithm for determining the best subset of k variables in regression problems using any LP-norm with 1≤p<∞. The procedure is based on partial enumeration. It also makes a suggestion for finding the equation with least prediction error.  相似文献   

6.
Consider a linear regression model with [p-1] predictor variables which is taken as the "true" model.The goal Is to select a subset of all possible reduced models such that all inferior models ‘to be defined’ are excluded with a guaranteed minimum probability.A procedure is proposed for which the exact evaluation of the probability of a correct decision 1s difficult; however, 1t is shown that the probability requirement can be met for sufficiently large sample size.Monte Carlo evaluation of the constant associated with the procedure and some ways to reduce the amount of computations Involved in the Implementation of the procedure are discussed.  相似文献   

7.
Using a forward selection procedure for selecting the best subset of regression variables involves the calculation of critical values (cutoffs) for an F-ratio at each step of a multistep search process. On dropping the restrictive (unrealistic) assumptions used in previous works, the null distribution of the F-ratio depends on unknown regression parameters for the variables already included in the subset. For the case of known σ, by conditioning the F-ratio on the set of regressors included so far and also on the observed (estimated) values of their regression coefficients, we obtain a forward selection procedure whose stepwise type I error does not depend on the unknown (nuisance) parameters. A numerical example with an orthogonal design matrix illustrates the difference between conditional cutoffs, cutoffs for the centralF-distribution, and cutoffs suggested by Pope and Webster.  相似文献   

8.
In the context of local interpolators, radial basis functions (RBFs) are known to reduce the computational time by using a subset of the data for prediction purposes. In this paper, we propose a new distance-based spatial RBFs method which allows modeling spatial continuous random variables. The trend is incorporated into a RBF according to a detrending procedure with mixed variables, among which we may have categorical variables. In order to evaluate the efficiency of the proposed method, a simulation study is carried out for a variety of practical scenarios for five distinct RBFs, incorporating principal coordinates. Finally, the proposed method is illustrated with an application of prediction of calcium concentration measured at a depth of 0–20 cm in Brazil, selecting the smoothing parameter by cross-validation.  相似文献   

9.
The selection of an appropriate subset of explanatory variables to use in a linear regression model is an important aspect of a statistical analysis. Classical stepwise regression is often used with this aim but it could be invalidated by a few outlying observations. In this paper, we introduce a robust F-test and a robust stepwise regression procedure based on weighted likelihood in order to achieve robustness against the presence of outliers. The introduced methodology is asymptotically equivalent to the classical one when no contamination is present. Some examples and simulation are presented.  相似文献   

10.
In this article, we propose a multiple decision procedure to test the homogeneity of normal variances. If the null-hypothesis is rejected, our goal is to select a subset containing the population associated with the largest variance. An approximation for the critical value is obtained by deriving an approximate distribution for a linear combination of independent log-gamma distributed random variables. A lower bound for the probability of correct decision is obtained. We also study the determination of the common sample size in order to satisfy a given probability of correct decision when the largest variance is “sufficiently” larger than the rest.  相似文献   

11.
Optimal design theory deals with the assessment of the optimal joint distribution of all independent variables prior to data collection. In many practical situations, however, covariates are involved for which the distribution is not previously determined. The optimal design problem may then be reformulated in terms of finding the optimal marginal distribution for a specific set of variables. In general, the optimal solution may depend on the unknown (conditional) distribution of the covariates. This article discusses the D A -maximin procedure to account for the uncertain distribution of the covariates. Sufficient conditions will be given under which the uniform design of a subset of independent discrete variables is D A -maximin. The sufficient conditions are formulated for Generalized Linear Mixed Models with an arbitrary number of quantitative and qualitative independent variables and random effects.  相似文献   

12.
A subset selection procedure is developed for selecting a subset containing the multinomial population that has the highest value of a certain linear combination of the multinomial cell probabilities; such population is called the ‘best’. The multivariate normal large sample approximation to the multinomial distribution is used to derive expressions for the probability of a correct selection, and for the threshold constant involved in the procedure. The procedure guarantees that the probability of a correct selection is at least at a pre-assigned level. The proposed procedure is an extension of Gupta and Sobel's [14] selection procedure for binomials and of Bakir's [2] restrictive selection procedure for multinomials. One illustration of the procedure concerns population income mobility in four countries: Peru, Russia, South Africa and the USA. Analysis indicates that Russia and Peru fall in the selected subset containing the best population with respect to income mobility from poverty to a higher-income status. The procedure is also applied to data concerning grade distribution for students in a certain freshman class.  相似文献   

13.
In discriminant analysis it is often desirable to find a small subset of the variables that were measured on the individuals of known origin, to be used for classifying individuals of unknown origin. In this paper a Bayesian approach to variable selection is used that includes an additional subset of variables for future classification if the additional measurement costs for this subsst are lower than the resulting reduction in expected misclassification costs.  相似文献   

14.
We provide a method for simultaneous variable selection and outlier identification using the mean-shift outlier model. The procedure consists of two steps: the first step is to identify potential outliers, and the second step is to perform all possible subset regressions for the mean-shift outlier model containing the potential outliers identified in step 1. This procedure is helpful for model selection while simultaneously considering outlier identification, and can be used to identify multiple outliers. In addition, we can evaluate the impact on the regression model of simultaneous omission of variables and interesting observations. In an example, we provide detailed output from the R system, and compare the results with those using posterior model probabilities as proposed by Hoeting et al. [Comput. Stat. Data Anal. 22 (1996), pp. 252-270] for simultaneous variable selection and outlier identification.  相似文献   

15.
The structural approach of inference for the parameters of a simultaneous equation model with heteroscedastic error variance is investigated in this paper. The joint and the marginal structural distributions for the coefficients of the exogenous variables and the scale parameters of the error variables, and the marginal likelihood function of the coefficients of the endogenous variables have been derived. The estimates are directly obtainable from the structural distribution and the marginal likelihood function of the parameters. The marginal distribution of a subset of coefficients of exogenous variables provides the basis for making inference for a particular subset of parameter of interest.  相似文献   

16.
SUMMARY In regression analysis, a best subset of regressors is usually selected by minimizing Mallows's C statistic or some other equivalent criterion, such as the Akaike lambda information criterion or cross-validation. It is known that the resulting procedure suffers from a lack of consistency that can lead to a model with too many variables. For this reason, corrections have been proposed that yield consistent procedures. The object of this paper is to show that these corrected criteria, although asymptotically consistent, are usually too conservative for finite sample sizes. The paper also proposes a new correction of Mallows's statistic that yields better results. A simulation study is conducted that shows that the proposed criterion performs well in a variety of situations.  相似文献   

17.
Here we consider a multinomial probit regression model where the number of variables substantially exceeds the sample size and only a subset of the available variables is associated with the response. Thus selecting a small number of relevant variables for classification has received a great deal of attention. Generally when the number of variables is substantial, sparsity-enforcing priors for the regression coefficients are called for on grounds of predictive generalization and computational ease. In this paper, we propose a sparse Bayesian variable selection method in multinomial probit regression model for multi-class classification. The performance of our proposed method is demonstrated with one simulated data and three well-known gene expression profiling data: breast cancer data, leukemia data, and small round blue-cell tumors. The results show that compared with other methods, our method is able to select the relevant variables and can obtain competitive classification accuracy with a small subset of relevant genes.  相似文献   

18.
Variable selection is an important task in regression analysis. Performance of the statistical model highly depends on the determination of the subset of predictors. There are several methods to select most relevant variables to construct a good model. However in practice, the dependent variable may have positive continuous values and not normally distributed. In such situations, gamma distribution is more suitable than normal for building a regression model. This paper introduces an heuristic approach to perform variable selection using artificial bee colony optimization for gamma regression models. We evaluated the proposed method against with classical selection methods such as backward and stepwise. Both simulation studies and real data set examples proved the accuracy of our selection procedure.  相似文献   

19.
In market research and some other areas, it is common that a sample of n judges (consumers, evaluators, etc.) are asked to independently rank a series of k objects or candidates. It is usually difficult to obtain the judges' full cooperation to completely rank all k objects. A practical way to overcome this difficulty is to give each judge the freedom to choose the number of top candidates he is willing to rank. A frequently encountered question in this type of survey is how to select the best object or candidate from the incompletely ranked data. This paper proposes a subset selection procedure which constructs a random subset of all the k objects involved in the survey such that the best object is included in the subset with a prespecified confidence. It is shown that the proposed subset selection procedure is distribution-free over a very broad class of underlying distributions. An example from a market research study is used to illustrate the proposed procedure.  相似文献   

20.
This paper studies a sequential procedure R for selecting a random size subset that contains the multinomial cell which has the smallest cell probability. The stopping rule of the proposed procedure R is the composite of the stopping rules of curtailed sampling, inverse sampling, and the Ramey-Alam sampling. A reslut on the worst configuration is shown and it is employed in computing the procedure parameters that guarantee certain probability requirements. Tables of these procedure parameters, the corresponding probability of correct selection, the expected sample size, and the expected subset size are given for comparison purpose.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号