首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A genuine small sample theory for post-stratification is developed in this paper. This includes the definition of a ratio estimator of the population mean ?, the derivation of its bias and its exact variance and a discussion of variance estimation. The estimator has both a within strata component of variance which is comparable with that obtained in proportional allocation stratified sampling and a between strata component of variance which will tend to zero as the overall sample size becomes large. Certain optimality properties of the estimator are obtained. The generalization of post-stratification from the simple random sampling to post-stratification used in conjunction with stratification and multi-stage designs is discussed.  相似文献   

2.
By considering uncertainty in the attributes common methods cannot be applicable in data clustering. In the recent years, many researches have been done by considering fuzzy concepts to interpolate the uncertainty. But when data elements attributes have probabilistic distributions, the uncertainty cannot be interpreted by fuzzy theory. In this article, a new concept for clustering of elements with predefined probabilistic distributions for their attributes has been proposed, so each observation will be as a member of a cluster with special probability. Two metaheuristic algorithms have been applied to deal with the problem. Squared Euclidean distance type has been considered to calculate the similarity of data elements to cluster centers. The sensitivity analysis shows that the proposed approach will converge to the classic approaches results when the variance of each point tends to be zero. Moreover, numerical analysis confirms that the proposed approach is efficient in clustering of probabilistic data.  相似文献   

3.
Product-limit survival functions with correlated survival times   总被引:1,自引:1,他引:0  
A simple variance estimator for product-limit survival functions is demonstrated for survival times with nested errors. Such data arise whenever survival times are observed within clusters of related observations. Greenwood's formula, which assumes independent observations, is not appropriate in this situation. A robust variance estimator is developed using Taylor series linearized values and the between-cluster variance estimator commonly used in multi-stage sample surveys. A simulation study shows that the between-cluster variance estimator is approximately unbiased and yields confidence intervals that maintain the nominal level for several patterns of correlated survival times. The simulation study also shows that Greenwood's formula underestimates the variance when the survival times are positively correlated within a cluster and yields confidence intervals that are too narrow. Extension to life table methods is also discussed.  相似文献   

4.
王星  马璇 《统计研究》2015,32(10):74-81
文章旨在研究受航空业动态定价机制影响下的机票价格序列变点估计模型,文中分析了机票价格u8序列数据的结构特点,提出了可用于高噪声数据环境下、阶梯状、带明显多变点的多阶段序列变点估计框架,该框架中级联组合了DBSCAN算法、EM-高斯混合模型聚类、凝聚层次聚类算法和基于乘积划分模型的变点估计方法等多种成熟的数据分析方法,通过对“北京-昆明”航线航班的实证分析,验证了数据分析框架的有效性和普遍适用性。  相似文献   

5.
In this article, we study the problem of estimating the prevalence rate of a disease in a geographical area, based on data collected from a sample of locations within this area. If there are several locations with zero incidence of the disease, the usual estimators are not suitable and so we develop a new estimator, together with an unbiased estimator of its variance, which may be appropriately used in such situations. An application of this estimator is illustrated with data from a large-scale survey, which was carried out in the city of Kolkata, India, to estimate the prevalence rate of stroke. We show that spatial modelling may be used to smooth the observed data before applying our proposed estimator. Our computations show that this smoothing helps to reduce the coefficient of variation and such a model-cum-design-based procedure is useful for estimating the prevalence rate. This method may of course be used in other similar situations.  相似文献   

6.
In this study, an evaluation of Bayesian hierarchical models is made based on simulation scenarios to compare single-stage and multi-stage Bayesian estimations. Simulated datasets of lung cancer disease counts for men aged 65 and older across 44 wards in the London Health Authority were analysed using a range of spatially structured random effect components. The goals of this study are to determine which of these single-stage models perform best given a certain simulating model, how estimation methods (single- vs. multi-stage) compare in yielding posterior estimates of fixed effects in the presence of spatially structured random effects, and finally which of two spatial prior models – the Leroux or ICAR model, perform best in a multi-stage context under different assumptions concerning spatial correlation. Among the fitted single-stage models without covariates, we found that when there is low amount of variability in the distribution of disease counts, the BYM model is relatively robust to misspecification in terms of DIC, while the Leroux model is the least robust to misspecification. When these models were fit to data generated from models with covariates, we found that when there was one set of covariates – either spatially correlated or non-spatially correlated, changing the values of the fixed coefficients affected the ability of either the Leroux or ICAR model to fit the data well in terms of DIC. When there were multiple sets of spatially correlated covariates in the simulating model, however, we could not distinguish the goodness of fit to the data between these single-stage models. We found that the multi-stage modelling process via the Leroux and ICAR models generally reduced the variance of the posterior estimated fixed effects for data generated from models with covariates and a UH term compared to analogous single-stage models. Finally, we found the multi-stage Leroux model compares favourably to the multi-stage ICAR model in terms of DIC. We conclude that the mutli-stage Leroux model should be seriously considered in applications of Bayesian disease mapping when an investigator desires to fit a model with both fixed effects and spatially structured random effects to Poisson count data.  相似文献   

7.
Studies on maturation and body composition mention age at peak height velocity (PHV) as an important measure that could predict adulthood outcome. The age at PHV is often derived from growth models such as the triple logistic fitted to the stature (height) data. Theoretically, for a well-behaved growth function, age at PHV could be obtained by setting the second derivative of the growth function to zero and solving for age. Such a solution obviously depends on the parameters of the growth function. Therefore, the uncertainty in the estimation of age at PHV resulting from the uncertainty in the estimation of the growth model, need to be accounted for in the models in which it is used as a predictor. Explicit expressions for the age at PHV and, consequently the variance of the estimate of the age at PHV, do not exist for some of the commonly used nonlinear growth functions, such as the triple logistic function. Once an estimate of this variance is obtained, it could be incorporated in subsequent modeling either through measurement error models or by using the inverse variances as weights. A numerical method for estimating the variance is implemented. The accuracy of this method is demonstrated through comparisons in models where explicit solution for the variance exists. The method of estimating the variance is illustrated by applying to growth data from the Fels study and subsequently used as weights in modeling two adulthood outcomes from the same study.  相似文献   

8.
Count data often display excessive number of zero outcomes than are expected in the Poisson regression model. The zero-inflated Poisson regression model has been suggested to handle zero-inflated data, whereas the zero-inflated negative binomial (ZINB) regression model has been fitted for zero-inflated data with additional overdispersion. For bivariate and zero-inflated cases, several regression models such as the bivariate zero-inflated Poisson (BZIP) and bivariate zero-inflated negative binomial (BZINB) have been considered. This paper introduces several forms of nested BZINB regression model which can be fitted to bivariate and zero-inflated count data. The mean–variance approach is used for comparing the BZIP and our forms of BZINB regression model in this study. A similar approach was also used by past researchers for defining several negative binomial and zero-inflated negative binomial regression models based on the appearance of linear and quadratic terms of the variance function. The nested BZINB regression models proposed in this study have several advantages; the likelihood ratio tests can be performed for choosing the best model, the models have flexible forms of marginal mean–variance relationship, the models can be fitted to bivariate zero-inflated count data with positive or negative correlations, and the models allow additional overdispersion of the two dependent variables.  相似文献   

9.
Clustering due to unobserved heterogeneity may seriously impact on inference from binary regression models. We examined the performance of the logistic, and the logistic-normal models for data with such clustering. The total variance of unobserved heterogeneity rather than the level of clustering determines the size of bias of the maximum likelihood (ML) estimator, for the logistic model. Incorrect specification of clustering as level 2, using the logistic-normal model, provides biased estimates of the structural and random parameters, while specifying level 1, provides unbiased estimates for the former, and adequately estimates the latter. The proposed procedure appeals to many research areas.  相似文献   

10.
Based on recent developments in the field of operations research, we propose two adaptive resampling algorithms for estimating bootstrap distributions. One algorithm applies the principle of the recently proposed cross-entropy (CE) method for rare event simulation, and does not require calculation of the resampling probability weights via numerical optimization methods (e.g., Newton's method), whereas the other algorithm can be viewed as a multi-stage extension of the classical two-step variance minimization approach. The two algorithms can be easily used as part of a general algorithm for Monte Carlo calculation of bootstrap confidence intervals and tests, and are especially useful in estimating rare event probabilities. We analyze theoretical properties of both algorithms in an idealized setting and carry out simulation studies to demonstrate their performance. Empirical results on both one-sample and two-sample problems as well as a real survival data set show that the proposed algorithms are not only superior to traditional approaches, but may also provide more than an order of magnitude of computational efficiency gains.  相似文献   

11.
Three modified tests for homogeneity of the odds ratio for a series of 2 × 2 tables are studied when the data are clustered. In the case of clustered data, the standard tests for homogeneity of odds ratios ignore the variance inflation caused by positive correlation among responses of subjects within the same cluster, and therefore have inflated Type I error. The modified tests adjust for the variance inflation in the three existing standard tests: Breslow–Day, Tarone and the conditional score test. The degree of clustering effect is measured by the intracluster correlation coefficient, ρ. A variance correction factor derived from ρ is then applied to the variance estimator in the standard tests of homogeneity of the odds ratio. The proposed tests are an application of the variance adjustment method commonly used in correlated data analysis and are shown to maintain the nominal significance level in a simulation study. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

12.
Comparison of treatment effects in an experiment is usually done through analysis of variance under the assumption that the errors are normally and independently distributed with zero mean and constant variance. The traditional approach in dealing with non-constant variance is to apply a variance stabilizing transformation and then run the analysis on the transformed data. In this approach, the conclusions of analysis of variance apply only to the transformed population. In this paper, the asymptotic quasi-likelihood method is introduced to the analysis of experimental designs. The weak assumptions of the asymptotic quasi-likelihood method make it possible to draw conclusions on heterogeneous populations without transforming them. This paper demonstrates how to apply the asymptotic quasi-likelihood technique to three commonly used models. This gives a possible way to analyse data given a complex experimental design.  相似文献   

13.
State-space models are widely used in ecology. However, it is well known that in practice it can be difficult to estimate both the process and observation variances that occur in such models. We consider this issue for integrated population models, which incorporate state-space models for population dynamics. To some extent, the mechanism of integrated population models protects against this problem, but it can still arise, and two illustrations are provided, in each of which the observation variance is estimated as zero. In the context of an extended case study involving data on British Grey herons, we consider alternative approaches for dealing with the problem when it occurs. In particular, we consider penalised likelihood, a method based on fitting splines and a method of pseudo-replication, which is undertaken via a simple bootstrap procedure. For the case study of the paper, it is shown that when it occurs, an estimate of zero observation variance is unimportant for inference relating to the model parameters of primary interest. This unexpected finding is supported by a simulation study.  相似文献   

14.
采用Monte Carlo模拟方法对STAR模型样本矩的统计特性进行研究。分析结果表明:STAR模型的样本均值、样本方差、样本偏度及样本峰度都渐近服从正态分布;即使STAR模型的数据生成过程中不含有常数项,其总体均值可能也不是0,这与线性ARMA模型有显著区别;即使STAR模型数据生成过程中的误差项服从正态分布,数据仍有可能是有偏分布。  相似文献   

15.
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.  相似文献   

16.
A new method for constructing interpretable principal components is proposed. The method first clusters the variables, and then interpretable (sparse) components are constructed from the correlation matrices of the clustered variables. For the first step of the method, a new weighted-variances method for clustering variables is proposed. It reflects the nature of the problem that the interpretable components should maximize the explained variance and thus provide sparse dimension reduction. An important feature of the new clustering procedure is that the optimal number of clusters (and components) can be determined in a non-subjective manner. The new method is illustrated using well-known simulated and real data sets. It clearly outperforms many existing methods for sparse principal component analysis in terms of both explained variance and sparseness.  相似文献   

17.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

18.
A proper log-rank test for comparing two waiting (i.e. sojourn, gap) times under right censored data has been absent in the survival literature. The classical log-rank test provides a biased comparison even under independent right censoring since the censoring induced on the time since state entry depends on the entry time unless the hazards are semi-Markov. We develop test statistics for comparing K waiting time distributions from a multi-stage model in which censoring and waiting times may be dependent upon the transition history in the multi-stage model. To account for such dependent censoring, the proposed test statistics utilize an inverse probability of censoring weighted (IPCW) approach previously employed to define estimators for the cumulative hazard and survival function for waiting times in multi-stage models. We develop the test statistics as analogues to K-sample log-rank statistics for failure time data, and weak convergence to a Gaussian limit is demonstrated. A simulation study demonstrates the appropriateness of the test statistics in designs that violate typical independence assumptions for multi-stage models, under which naive test statistics for failure time data perform poorly, and illustrates the superiority of the test under proportional hazards alternatives to a Mann–Whitney type test. We apply the test statistics to an existing data set of burn patients.  相似文献   

19.
Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.  相似文献   

20.
When modeling multilevel data, it is important to accurately represent the interdependence of observations within clusters. Ignoring data clustering may result in parameter misestimation. However, it is not well established to what degree parameter estimates are affected by model misspecification when applying missing data techniques (MDTs) to incomplete multilevel data. We compare the performance of three MDTs with incomplete hierarchical data. We consider the impact of imputation model misspecification on the quality of parameter estimates by employing multiple imputation under assumptions of a normal model (MI/NM) with two-level cross-sectional data when values are missing at random on the dependent variable at rates of 10%, 30%, and 50%. Five criteria are used to compare estimates from MI/NM to estimates from MI assuming a linear mixed model (MI/LMM) and maximum likelihood estimation to the same incomplete data sets. With 10% missing data (MD), techniques performed similarly for fixed-effects estimates, but variance components were biased with MI/NM. Effects of model misspecification worsened at higher rates of MD, with the hierarchical structure of the data markedly underrepresented by biased variance component estimates. MI/LMM and maximum likelihood provided generally accurate and unbiased parameter estimates but performance was negatively affected by increased rates of MD.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号