首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 568 毫秒
1.
Let X1,…,X2n be independent and identically distributed copies of the non-negative integer valued random variable X distributed according to the unknown frequency function f(x). A total of 2n disjoint sequences of urns, each consisting of k urns, are given. Xj balls are placed in urn sequence j (1 ≤ j ≤ 2n). Each ball is placed in an urn of a given sequence with a certain known probability independently of the other balls. The variables X1,…,X2n are not observed; rather we observe whether certain pairs of urns are both empty or not. Our object is to estimate the mean μ of the number of balls X. Two different kinds of estimators of μ are investigated. One of the estimators studied is a method of moments type estimator while the other is motivated by the maximum likelihood principle. These estimators are compared on the basis of their asymptotic mean squared error as k tends to infinity. An application of these results to a problem in genetics involved with estimating codon substitution rates is discussed.  相似文献   

2.
The multiple inference character of several tests in the same application is usually taken into consideration by requiring that the tests have a multiple level of significance. Also, a prediction problem in an application with several possible predictor variables requires that the multiple inference character of the problem be considered. This is not being done in the methods commonly used to choose predictor variables. Here, we discuss both the test and prediction methods in two-level factorial designs and suggest a principle for choosing variables which is based on multiple inference thinking. By an example use demonstrated that the principle proposed leads to the use of fewer prediction variables than does the Akaike method.  相似文献   

3.
The K-means clustering method is a widely adopted clustering algorithm in data mining and pattern recognition, where the partitions are made by minimizing the total within group sum of squares based on a given set of variables. Weighted K-means clustering is an extension of the K-means method by assigning nonnegative weights to the set of variables. In this paper, we aim to obtain more meaningful and interpretable clusters by deriving the optimal variable weights for weighted K-means clustering. Specifically, we improve the weighted k-means clustering method by introducing a new algorithm to obtain the globally optimal variable weights based on the Karush-Kuhn-Tucker conditions. We present the mathematical formulation for the clustering problem, derive the structural properties of the optimal weights, and implement an recursive algorithm to calculate the optimal weights. Numerical examples on simulated and real data indicate that our method is superior in both clustering accuracy and computational efficiency.  相似文献   

4.
A stepwise algorithm for selecting categories for the chisquared goodness-of-fit test with completely specified continuous null and alternative distributions is described in this paper. The procedure's starting point is an initial partitioning of the sample space into a large number of categories. A second partition with one fewer category is constructed by combining two categories of the original partition. The procedure continues until there are only two categories; the partition in the sequence with the highest estimated power is the one chosen. For illustartive purposes, the performance of the algorithm is evaluated for several hypothesis tests of the from H0: normal distribution vs. H1: a specific mixed normal distribution. For each test considered, the partition identified by the algorithm was compared to several equiprobable partitions, including the equiprobable partition with the highest estimated power. In all cases but one, the algorithm identified a parttion with higher estimated power than the best equiprobable partition. Applciations of the procedure are discussed.  相似文献   

5.
6.
A multiple regression method based on distance analysis and metric scaling is proposed and studied. This method allow us to predict a continuous response variable from several explanatory variables, is compatible with the general linear model and is found to be useful when the predictor variables are both continuous and categorical. Real data examples are given to illustrate the results obtained.  相似文献   

7.
The beta regression models are commonly used by practitioners to model variables that assume values in the standard unit interval (0, 1). In this paper, we consider the issue of variable selection for beta regression models with varying dispersion (VBRM), in which both the mean and the dispersion depend upon predictor variables. Based on a penalized likelihood method, the consistency and the oracle property of the penalized estimators are established. Following the coordinate descent algorithm idea of generalized linear models, we develop new variable selection procedure for the VBRM, which can efficiently simultaneously estimate and select important variables in both mean model and dispersion model. Simulation studies and body fat data analysis are presented to illustrate the proposed methods.  相似文献   

8.
The Chow test compares regressions developed from two samples from possibly different populations. Its use has traditionally been recommended only when the number of observations in one of the samples does not exceed the number of predictor variables. It is shown that when that condition is not satisfied, the test remains uniformly most powerful (UMP) among a certain class of tests against an important class of alternatives.  相似文献   

9.
A fast splitting procedure for classification trees   总被引:1,自引:0,他引:1  
This paper provides a faster method to find the best split at each node when using the CART methodology. The predictability index is proposed as a splitting rule for growing the same classification tree as CART does when using the Gini index of heterogeneity as an impurity measure. A theorem is introduced to show a new property of the index : the for a given predictor has a value not lower than the for any split generated by the predictor. This property is used to make a substantial saving in the time required to generate a classification tree. Three simulation studies are presented in order to show the computational gain in terms of both the number of splits analysed at each node and the CPU time. The proposed splitting algorithm can prove computational efficiency in real data sets as shown in an example.  相似文献   

10.
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.  相似文献   

11.
The effect of a single variable data point, x, on the usual test statistics for traditional hypothesis tests for means is analyzed. It is shown that an outlier may have a profound and unexpected effect on the test statistic. Although it might appear that an outlier would tend to lend support to the alternate hypothesis, it may in fact detract from the significance of the test. In one-population tests and analysis of variance (ANOVA), the value of x that maximizes the significance of the test statistic is given. This value does not have to be unusually large or small. In fact, it often falls within the range of the other sample points. In the general one-population case, the limiting value for the test statistic is shown to be +1. In the case involving more than one population, it is shown that the limiting value of the test statistic is a function only of the number of members in the samples and not their relative values. Special cases are identified in which the test statistic is shown to have unique characteristics depending on the characteristics of the data.  相似文献   

12.
Non-symmetric correspondence analysis (NSCA) is a useful technique for analysing a two-way contingency table. Frequently, the predictor variables are more than one; in this paper, we consider two categorical variables as predictor variables and one response variable. Interaction represents the joint effects of predictor variables on the response variable. When interaction is present, the interpretation of the main effects is incomplete or misleading. To separate the main effects and the interaction term, we introduce a method that, starting from the coordinates of multiple NSCA and using a two-way analysis of variance without interaction, allows a better interpretation of the impact of the predictor variable on the response variable. The proposed method has been applied on a well-known three-way contingency table proposed by Bockenholt and Bockenholt in which they cross-classify subjects by person's attitude towards abortion, number of years of education and religion. We analyse the case where the variables education and religion influence a person's attitude towards abortion.  相似文献   

13.
The multiple non symmetric correspondence analysis (MNSCA) is a useful technique for analyzing a two-way contingency table. In more complex cases, the predictor variables are more than one. In this paper, the MNSCA, along with the decomposition of the Gray–Williams Tau index, in main effects and interaction term, is used to analyze a contingency table with two predictor categorical variables and an ordinal response variable. The Multiple-Tau index is a measure of association that contains both main effects and interaction term. The main effects represent the change in the response variables due to the change in the level/categories of the predictor variables, considering the effects of their addition, while the interaction effect represents the combined effect of predictor categorical variables on the ordinal response variable. Moreover, for ordinal scale variables, we propose a further decomposition in order to check the existence of power components by using Emerson's orthogonal polynomials.  相似文献   

14.
The present paper studies the validity of inferential procedures which follow the Taguchi method, under saturated designs. The distribution of the signal to noise (S/N) ratio Y [ILM0001] is investigated,for normal parent distributions. We further investigate the distribution of orthonormal contrasts of such S/N variables. Finally, we discuss and provide critical values for mod-F tests of significance of parameters, when the k smallest SS values are pooled to serve as error variance estimate  相似文献   

15.
In the last fifty years, a great deal of research effort has been made on the construction of simultaneous confidence bands for a linear regression function. Two most frequently quoted confidence bands in the statistics literature are the Scheffé type and constant width bands over a given rectangular region of the predictor variables. For the constant width bands, a method is given by Gafarian [Gafarian, A.V., 1964, Confidence bands in straight line regression. Journal of the American Statistical Association, 59, 182–213.] for the calculation of critical constants only for the special case of one predictor variable. In this article, a method is proposed to construct constant width bands when there are any number of predictor variables. A new criterion for assessing a confidence band is also proposed; it is the probability that a confidence band excludes a false regression function and can be viewed as the power function of a test associated, naturally, with a confidence band. Under this criterion, a numerical comparison between the Scheffé type and constant width bands is then carried out. It emerges from this comparison that the constant width bands can be better than the Scheffé type bands for certain designs.  相似文献   

16.
This note discusses a problem that might occur when forward stepwise regression is used for variable selection and among the candidate variables is a categorical variable with more than two categories. Most software packages (such as SAS, SPSSx, BMDP) include special programs for performing stepwise regression. The user of these programs has to code categorical variables with dummy variables. In this case the forward selection might wrongly indicate that a categorical variable with more than two categories is nonsignificant. This is a disadvantage of the forward selection compared with the backward elimination method. A way to avoid the problem would be to test in a single step all dummy variables corresponding to the same categorical variable rather than one dummy variable at a time, such as in the analysis of covariance. This option, however, is not available in forward stepwise procedures, except for stepwise logistic regression in BMDP. A practical possibility is to repeat the forward stepwise regression and change the reference categories each time.  相似文献   

17.
闫懋博  田茂再 《统计研究》2021,38(1):147-160
Lasso等惩罚变量选择方法选入模型的变量数受到样本量限制。文献中已有研究变量系数显著性的方法舍弃了未选入模型的变量含有的信息。本文在变量数大于样本量即p>n的高维情况下,使用随机化bootstrap方法获得变量权重,在计算适应性Lasso时构建选择事件的条件分布并剔除系数不显著的变量,以得到最终估计结果。本文的创新点在于提出的方法突破了适应性Lasso可选变量数的限制,当观测数据含有大量干扰变量时能够有效地识别出真实变量与干扰变量。与现有的惩罚变量选择方法相比,多种情境下的模拟研究展示了所提方法在上述两个问题中的优越性。实证研究中对NCI-60癌症细胞系数据进行了分析,结果较以往文献有明显改善。  相似文献   

18.

This paper is motivated by our collaborative research and the aim is to model clinical assessments of upper limb function after stroke using 3D-position and 4D-orientation movement data. We present a new nonlinear mixed-effects scalar-on-function regression model with a Gaussian process prior focusing on the variable selection from a large number of candidates including both scalar and function variables. A novel variable selection algorithm has been developed, namely functional least angle regression. As it is essential for this algorithm, we studied the representation of functional variables with different methods and the correlation between a scalar and a group of mixed scalar and functional variables. We also propose a new stopping rule for practical use. This algorithm is efficient and accurate for both variable selection and parameter estimation even when the number of functional variables is very large and the variables are correlated. And thus the prediction provided by the algorithm is accurate. Our comprehensive simulation study showed that the method is superior to other existing variable selection methods. When the algorithm was applied to the analysis of the movement data, the use of the nonlinear random-effect model and the function variables significantly improved the prediction accuracy for the clinical assessment.

  相似文献   

19.
A method for constructing two-stage (double samble) tests is presented which does not require the evaluation of complicated bivariate distribution function. The procedure results from a modification of Fisher's method for combining independent tests of significance and is distribution free in the way it combines the test results from the two sampies. However, the one sample test statistics for the two samples are assumed to have continuous distributions and may be parametric. A rule is also given or the selection of a particular test out of a family of possible two-stage tests which can be generated by this method. Specific examples are given and comparisons are made with two double sample tests which have previously been presented in the literature.  相似文献   

20.
Summary. When a number of distinct models contend for use in prediction, the choice of a single model can offer rather unstable predictions. In regression, stochastic search variable selection with Bayesian model averaging offers a cure for this robustness issue but at the expense of requiring very many predictors. Here we look at Bayes model averaging incorporating variable selection for prediction. This offers similar mean-square errors of prediction but with a vastly reduced predictor space. This can greatly aid the interpretation of the model. It also reduces the cost if measured variables have costs. The development here uses decision theory in the context of the multivariate general linear model. In passing, this reduced predictor space Bayes model averaging is contrasted with single-model approximations. A fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be contemplated. We discuss the merits of absolute rather than proportionate shrinkage in regression, especially when there are more variables than observations. The methodology is illustrated on a set of spectroscopic data used for measuring the amounts of different sugars in an aqueous solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号