首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A common data analysis setting consists of a collection of datasets of varying sizes that are all relevant to a particular scientific question, but which include different subsets of the relevant variables, presumably with some overlap. Here, we demonstrate that synthesizing cross-classified categorical datasets drawn from an incompletely cross-classified common population, where many of the sets are incomplete (i.e., one or more of the classification variables is unobserved), but at least one is completely observed is expected to reduce uncertainty about the cell probabilities in the associated multi-way contingency table as well as for derived quantities such as relative risks and odds ratios. The use of the word “expected” here is the key point. When synthesizing complete datasets from a common population, we are assured to reduce uncertainty. However, when we work with a log-linear model to explain the complete table, because this model cannot be fitted to any of the incomplete datasets, improvement is not assured. We provide technical clarification of this point as well as a series of simulation examples, motivated by an adverse birth outcomes investigation, to illustrate what can be expected under such synthesis.  相似文献   

2.
Although devised in 1936 by Fisher, discriminant analysis is still rapidly evolving, as the complexity of contemporary data sets grows exponentially. Our classification rules explore these complexities by modeling various correlations in higher-order data. Moreover, our classification rules are suitable to data sets where the number of response variables is comparable or larger than the number of observations. We assume that the higher-order observations have a separable variance-covariance matrix and two different Kronecker product structures on the mean vector. In this article, we develop quadratic classification rules among g different populations where each individual has κth order (κ ≥2) measurements. We also provide the computational algorithms to compute the maximum likelihood estimates for the model parameters and eventually the sample classification rules.  相似文献   

3.
In this paper we analyse the average behaviour of the Bayes-optimal and Gibbs learning algorithms. We do this both for off-training-set error and conventional IID (independent identically distributed) error (for which test sets overlap with training sets). For the IID case we provide a major extension to one of the better known results. We also show that expected IID test set error is a non-increasing function of training set size for either algorithm. On the other hand, as we show, the expected off-training-set error for both learning algorithms can increase with training set size, for non-uniform sampling distributions. We characterize the relationship the sampling distribution must have with the prior for such an increase. We show in particular that for uniform sampling distributions and either algorithm, the expected off-training-set error is a non-increasing function of training set size. For uniform sampling distributions, we also characterize the priors for which the expected error of the Bayes-optimal algorithm stays constant. In addition we show that for the Bayes-optimal algorithm, expected off-training-set error can increase with training set size when the target function is fixed, but if and only if the expected error averaged over all targets decreases with training set size. Our results hold for arbitrary noise and arbitrary loss functions.  相似文献   

4.
In this paper, we consider a multivariate linear model with complete/incomplete data, where the regression coefficients are subject to a set of linear inequality restrictions. We first develop an expectation/conditional maximization (ECM) algorithm for calculating restricted maximum likelihood estimates of parameters of interest. We then establish the corresponding convergence properties for the proposed ECM algorithm. Applications to growth curve models and linear mixed models are presented. Confidence interval construction via the double-bootstrap method is provided. Some simulation studies are performed and a real example is used to illustrate the proposed methods.  相似文献   

5.
Summary.  Multiple imputation is now a well-established technique for analysing data sets where some units have incomplete observations. Provided that the imputation model is correct, the resulting estimates are consistent. An alternative, weighting by the inverse probability of observing complete data on a unit, is conceptually simple and involves fewer modelling assumptions, but it is known to be both inefficient (relative to a fully parametric approach) and sensitive to the choice of weighting model. Over the last decade, there has been a considerable body of theoretical work to improve the performance of inverse probability weighting, leading to the development of 'doubly robust' or 'doubly protected' estimators. We present an intuitive review of these developments and contrast these estimators with multiple imputation from both a theoretical and a practical viewpoint.  相似文献   

6.
7.
8.
Bechhofer and Tamhane (1981) proposed a new class of incomplete block designs called BTIB designs for comparing p ≥ 2 test treatments with a control treatment in blocks of equal size k < p + 1. All BTIB designs for given (p,k) can be constructed by forming unions of replications of a set of elementary BTIB designs called generator designs for that (p,k). In general, there are many generator designs for given (p,k) but only a small subset (called the minimal complete set) of these suffices to obtain all admissible BTIB designs (except possibly any equivalent ones). Determination of the minimal complete set of generator designs for given (p,k) was stated as an open problem in Bechhofer and Tamhane (1981). In this paper we solve this problem for k = 3. More specifically, we give the minimal complete sets of generator designs for k = 3, p = 3(1)10; the relevant proofs are given only for the cases p = 3(1)6. Some additional combinatorial results concerning BTIB designs are also given.  相似文献   

9.
Properties of Huber's M-estimators based on estimating equations have been studied extensively and are well understood for complete (i.i.d.) data. Although the concepts of M-estimators and influence curves have been extended for some time by Reid (1981) to incomplete data that are subject to right censoring, results on the general behavior of M-estimators based on incomplete data remain scattered and restrictive. This paper establishes a general large sample theory for M-estimators based on censored data. We show how to extend any asymptotic result available for M-estimators based on complete data to the case of censored data. The extensions are usually straightforward and include the multiparameter situation. Both the lifetime and censoring distributions may be discontinuous. We illustrate several extensions which provide simple and tractable sufficient conditions for an M-estimator to be strongly consistent and asymptotically normal. The influence curves and asymptotic variance of the M-estimators are also derived. The applicability of the new sufficient conditions is demonstrated through several examples, including location and scale M-estimators.  相似文献   

10.
This paper presents missing data methods for repeated measures data in small samples. Most methods currently available are for large samples. In particular, no studies have compared the performance of multiple imputation methods to that of non-imputation incomplete analysis methods. We first develop a strategy for multiple imputations for repeated measures data under a cell-means model that is applicable for any multivariate data with small samples. Multiple imputation inference procedures are applied to the resulting multiply imputed complete data sets. Comparisons to other available non-imputation incomplete data methods is made via simulation studies to conclude that there is not much gain in using the computer intensive multiple imputation methods for small sample repeated measures data analysis in terms of the power of testing hypotheses of parameters of interest.  相似文献   

11.
The purpose of this paper is to highlight some classic issues in the measurement of change and to show how contemporary solutions can be used to deal with some of these issues. Five classic issues will be raised here: (1) Separating individual changes from group differences; (2) options for incomplete longitudinal data over time, (3) options for nonlinear changes over time; (4) measurement invariance in studies of changes over time; and (5) new opportunities for modeling dynamic changes. For each issue we will describe the problem, and then review some contemporary solutions to these problems base on Structural Equation Models (SEM). We will fit these SEM to using existing panel data from the Health & Retirement Study (HRS) cognitive variables. This is not intended as an overly technical treatment, so only a few basic equations are presented, examples will be displayed graphically, and more complete references to the contemporary solutions will be given throughout.  相似文献   

12.
In this paper we consider the worst-case adaptive complexity of the search problem , where is also the set of independent sets of a matroid over S. We give a formula for the number of questions needed and an algorithm to find the optimal search algorithm for any matroid. This algorithm uses only O(|S|3) steps (i.e. questions to the independence oracle). This is also the length of Edmonds’ partitioning algorithm for matroids, which does not seem to be avoidable.  相似文献   

13.
基于聚类关联规则的缺失数据处理研究   总被引:2,自引:1,他引:2       下载免费PDF全文
 本文提出了基于聚类和关联规则的缺失数据处理新方法,通过聚类方法将含有缺失数据的数据集相近的记录归到一类,然后利用改进后的关联规则方法对各子数据集挖掘变量间的关联性,并利用这种关联性来填补缺失数据。通过实例分析,发现该方法对缺失数据处理,尤其是海量数据集具有较好的效果。  相似文献   

14.

There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified.

  相似文献   

15.
Stanislaw Gnot 《Statistics》2013,47(3):343-349
The sample-based identification rules for the problem of two-group identification have been defined. The essentially complete class of tests was used to describe the essentially complete class of sample-based identification rules. In the problem of multinomial identification the minimal essentially complete class of sample-based rules was found and compared with the rules derived from density estimators.  相似文献   

16.
The term 'representation bias' is used to describe the disparities that exist between treatment effects estimated from field experiments, and those effects that would be seen if treatments were used in the field. In this paper we are specifically concerned with representation bias caused by disease inoculum travelling between plots, or out of the experimental area altogether. The scope for such bias is maximized in the case of airborne spread diseases. This paper extends the work of Deardon et al. (2004), using simulation methods to explore the relationship between design and representation bias. In doing so, we illustrate the importance of plot size and spacing, as well as treatment-to-plot allocation. We examine a novel class of designs, incomplete column designs, to develop an understanding of the mechanisms behind representation bias. We also introduce general methods of designing field trials, which can be used to limit representation bias by carefully controlling treatment to block allocation in both incomplete column and incomplete randomized block designs. Finally, we show how the commonly used practice of sampling from the centres of plots, rather than entire plots, can also help to control representation bias.  相似文献   

17.
Linear mixed models are regularly applied to animal and plant breeding data to evaluate genetic potential. Residual maximum likelihood (REML) is the preferred method for estimating variance parameters associated with this type of model. Typically an iterative algorithm is required for the estimation of variance parameters. Two algorithms which can be used for this purpose are the expectation‐maximisation (EM) algorithm and the parameter expanded EM (PX‐EM) algorithm. Both, particularly the EM algorithm, can be slow to converge when compared to a Newton‐Raphson type scheme such as the average information (AI) algorithm. The EM and PX‐EM algorithms require specification of the complete data, including the incomplete and missing data. We consider a new incomplete data specification based on a conditional derivation of REML. We illustrate the use of the resulting new algorithm through two examples: a sire model for lamb weight data and a balanced incomplete block soybean variety trial. In the cases where the AI algorithm failed, a REML PX‐EM based on the new incomplete data specification converged in 28% to 30% fewer iterations than the alternative REML PX‐EM specification. For the soybean example a REML EM algorithm using the new specification converged in fewer iterations than the current standard specification of a REML PX‐EM algorithm. The new specification integrates linear mixed models, Henderson's mixed model equations, REML and the REML EM algorithm into a cohesive framework.  相似文献   

18.
In this paper, we develop a simple nonparametric test for testing the independence of time to failure and cause of failure in competing risks set up. We generalise the test to the situation where failure data is right censored. We obtain the asymptotic distribution of the test statistics for complete and censored data. The efficiency loss due to censoring is studied using Pitman efficiency. The performance of the proposed test is evaluated through simulations. Finally we illustrate our test procedure using three real data sets.  相似文献   

19.
Complete sets of orthogonal F-squares of order n = sp, where g is a prime or prime power and p is a positive integer have been constructed by Hedayat, Raghavarao, and Seiden (1975). Federer (1977) has constructed complete sets of orthogonal F-squares of order n = 4t, where t is a positive integer. We give a general procedure for constructing orthogonal F-squares of order n from an orthogonal array (n, k, s, 2) and an OL(s, t) set, where n is not necessarily a prime or prime power. In particular, we show how to construct sets of orthogonal F-squares of order n = 2sp, where s is a prime or prime power and p is a positive integer. These sets are shown to be near complete and approach complete sets as s and/or p become large. We have also shown how to construct orthogonal arrays by these methods. In addition, the best upper bound on the number t of orthogonal F(n, λ1), F(n, λ2), …, F(n, λ1) squares is given.  相似文献   

20.
ABSTRACT

We propose a multiple imputation method based on principal component analysis (PCA) to deal with incomplete continuous data. To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study and real data sets, the method is compared to two classical approaches: multiple imputation based on joint modelling and on fully conditional modelling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables and when the variables are highly correlated. In addition, it provides unbiased point estimates of quantities of interest, such as an expectation, a regression coefficient or a correlation coefficient, with a smaller mean squared error. Furthermore, the widths of the confidence intervals built for the quantities of interest are often smaller whilst ensuring a valid coverage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号