首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Preferential attachment is a proportionate growth process in networks, where nodes receive new links in proportion to their current degree. Preferential attachment is a popular generative mechanism to explain the widespread observation of power-law-distributed networks. An alternative explanation for the phenomenon is a randomly grown network with large individual variation in growth rates among the nodes (frailty). We derive analytically the distribution of individual rates, which will reproduce the connectivity distribution that is obtained from a general preferential attachment process (Yule process), and the structural differences between the two types of graphs are examined by simulations. We present a statistical test to distinguish the two generative mechanisms from each other and we apply the test to both simulated data and two real data sets of scientific citation and sexual partner networks. The findings from the latter analyses argue for frailty effects as an important mechanism underlying the dynamics of complex networks.  相似文献   

2.
Classical inferential procedures induce conclusions from a set of data to a population of interest, accounting for the imprecision resulting from the stochastic component of the model. Less attention is devoted to the uncertainty arising from (unplanned) incompleteness in the data. Through the choice of an identifiable model for non-ignorable non-response, one narrows the possible data-generating mechanisms to the point where inference only suffers from imprecision. Some proposals have been made for assessing the sensitivity to these modelling assumptions; many are based on fitting several plausible but competing models. For example, we could assume that the missing data are missing at random in one model, and then fit an additional model where non-random missingness is assumed. On the basis of data from a Slovenian plebiscite, conducted in 1991, to prepare for independence, it is shown that such an ad hoc procedure may be misleading. We propose an approach which identifies and incorporates both sources of uncertainty in inference: imprecision due to finite sampling and ignorance due to incompleteness. A simple sensitivity analysis considers a finite set of plausible models. We take this idea one step further by considering more degrees of freedom than the data support. This produces sets of estimates (regions of ignorance) and sets of confidence regions (combined into regions of uncertainty).  相似文献   

3.
Missing data are often problematic in social network analysis since what is missing may potentially alter the conclusions about what we have observed as tie-variables need to be interpreted in relation to their local neighbourhood and the global structure. Some ad hoc methods for dealing with missing data in social networks have been proposed but here we consider a model-based approach. We discuss various aspects of fitting exponential family random graph (or p-star) models (ERGMs) to networks with missing data and present a Bayesian data augmentation algorithm for the purpose of estimation. This involves drawing from the full conditional posterior distribution of the parameters, something which is made possible by recently developed algorithms. With ERGMs already having complicated interdependencies, it is particularly important to provide inference that adequately describes the uncertainty, something that the Bayesian approach provides. To the extent that we wish to explore the missing parts of the network, the posterior predictive distributions, immediately available at the termination of the algorithm, are at our disposal, which allows us to explore the distribution of what is missing unconditionally on any particular parameter values. Some important features of treating missing data and of the implementation of the algorithm are illustrated using a well-known collaboration network and a variety of missing data scenarios.  相似文献   

4.
The EM algorithm is often used for finding the maximum likelihood estimates in generalized linear models with incomplete data. In this article, the author presents a robust method in the framework of the maximum likelihood estimation for fitting generalized linear models when nonignorable covariates are missing. His robust approach is useful for downweighting any influential observations when estimating the model parameters. To avoid computational problems involving irreducibly high‐dimensional integrals, he adopts a Metropolis‐Hastings algorithm based on a Markov chain sampling method. He carries out simulations to investigate the behaviour of the robust estimates in the presence of outliers and missing covariates; furthermore, he compares these estimates to the classical maximum likelihood estimates. Finally, he illustrates his approach using data on the occurrence of delirium in patients operated on for abdominal aortic aneurysm.  相似文献   

5.
处理缺失数据中辅助信息的利用   总被引:2,自引:0,他引:2       下载免费PDF全文
金勇进 《统计研究》1998,15(1):43-45
统计分析中经常会遇到数据缺失的情况。数据缺失的产生背景不同,主要来自于调查中的无回答。此外,由于调查员的疏忽,在调查过程中遗漏了某些调查项,或在对调查数据的检查与处理过程中,发现一些不合逻辑,明显有误,或有意使假的数据,而将其剔除,这些都会造成数据缺失。 缺失数据造成的危害是明显的,它不仅使接受调查的实际单位数目减少,扩大了抽样调查中的估计量方差,而且还会导致估计量偏差,是影响统计数据质量的重要方面。一般而言,对于缺失数据,往往需要进行重新调查,以便将缺失的数据补齐。但有时由于种种原因和条件的限制,或者无法进行重新的补充调查,或者这种补充调查仍然不能解决问题。这时,我们特别关心两个问题:一是需要了解缺失数据造成的影响有多大,即能否对由于数据缺失带来的估计量偏差进行估计;二是如何对缺失数据进行补救。这两个问题都与辅助信息有关,本文拟就这些问题进行分析。  相似文献   

6.
Bayesian hierarchical formulations are utilized by the U.S. Bureau of Labor Statistics (BLS) with respondent‐level data for missing item imputation because these formulations are readily parameterized to capture correlation structures. BLS collects survey data under informative sampling designs that assign probabilities of inclusion to be correlated with the response on which sampling‐weighted pseudo posterior distributions are estimated for asymptotically unbiased inference about population model parameters. Computation is expensive and does not support BLS production schedules. We propose a new method to scale the computation that divides the data into smaller subsets, estimates a sampling‐weighted pseudo posterior distribution, in parallel, for every subset and combines the pseudo posterior parameter samples from all the subsets through their mean in the Wasserstein space of order 2. We construct conditions on a class of sampling designs where posterior consistency of the proposed method is achieved. We demonstrate on both synthetic data and in application to the Current Employment Statistics survey that our method produces results of similar accuracy as the usual approach while offering substantially faster computation.  相似文献   

7.
Summary.  Data in the social, behavioural and health sciences frequently come from observational studies instead of controlled experiments. In addition to random errors, observational data typically contain additional sources of uncertainty such as missing values, unmeasured confounders and selection biases. Also, the research question is often different from that which a particular source of data was designed to answer, and so not all relevant variables are measured. As a result, multiple sources of data are often necessary to identify the biases and to inform about different aspects of the research question. Bayesian graphical models provide a coherent way to connect a series of local submodels, based on different data sets, into a global unified analysis. We present a unified modelling framework that will account for multiple biases simultaneously and give more accurate parameter estimates than standard approaches. We illustrate our approach by analysing data from a study of water disinfection by-products and adverse birth outcomes in the UK.  相似文献   

8.
Non‐likelihood‐based methods for repeated measures analysis of binary data in clinical trials can result in biased estimates of treatment effects and associated standard errors when the dropout process is not completely at random. We tested the utility of a multiple imputation approach in reducing these biases. Simulations were used to compare performance of multiple imputation with generalized estimating equations and restricted pseudo‐likelihood in five representative clinical trial profiles for estimating (a) overall treatment effects and (b) treatment differences at the last scheduled visit. In clinical trials with moderate to high (40–60%) dropout rates with dropouts missing at random, multiple imputation led to less biased and more precise estimates of treatment differences for binary outcomes based on underlying continuous scores. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

9.
Although the effect of missing data on regression estimates has received considerable attention, their effect on predictive performance has been neglected. We studied the performance of three missing data strategies—omission of records with missing values, replacement with a mean and imputation based on regression—on the predictive performance of logistic regression (LR), classification tree (CT) and neural network (NN) models in the presence of data missing completely at random (MCAR). Models were constructed using datasets of size 500 simulated from a joint distribution of binary and continuous predictors including nonlinearities, collinearity and interactions between variables. Though omission produced models that fit better on the data from which the models were developed, imputation was superior on average to omission for all models when evaluating the receiver operating characteristic (ROC) curve area, mean squared error (MSE), pooled variance across outcome categories and calibration X 2 on an independently generated test set. However, in about one-third of simulations, omission performed better. Performance was also more variable with omission including quite a few instances of extremely poor performance. Replacement and imputation generally produced similar results except with neural networks for which replacement, the strategy typically used in neural network algorithms, was inferior to imputation. Missing data affected simpler models much less than they did more complex models such as generalized additive models that focus on local structure For moderate sized datasets, logistic regressions that use simple nonlinear structures such as quadratic terms and piecewise linear splines appear to be at least as robust to randomly missing values as neural networks and classification trees.  相似文献   

10.
We adapt existing statistical modeling techniques for social networks to study consumption data observed in trophic food webs. These data describe the feeding volume (non-negative) among organisms grouped into nodes, called trophic species, that form the food web. Model complexity arises due to the extensive amount of zeros in the data, as each node in the web is predator/prey to only a small number of other trophic species. Many of the zeros are regarded as structural (non-random) in the context of feeding behavior. The presence of basal prey and top predator nodes (those who never consume and those who are never consumed, with probability 1) creates additional complexity to the statistical modeling. We develop a special statistical social network model to account for such network features. The model is applied to two empirical food webs; focus is on the web for which the population size of seals is of concern to various commercial fisheries.  相似文献   

11.
In this paper we propose a latent class based multiple imputation approach for analyzing missing categorical covariate data in a highly stratified data model. In this approach, we impute the missing data assuming a latent class imputation model and we use likelihood methods to analyze the imputed data. Via extensive simulations, we study its statistical properties and make comparisons with complete case analysis, multiple imputation, saturated log-linear multiple imputation and the Expectation–Maximization approach under seven missing data mechanisms (including missing completely at random, missing at random and not missing at random). These methods are compared with respect to bias, asymptotic standard error, type I error, and 95% coverage probabilities of parameter estimates. Simulations show that, under many missingness scenarios, latent class multiple imputation performs favorably when jointly considering these criteria. A data example from a matched case–control study of the association between multiple myeloma and polymorphisms of the Inter-Leukin 6 genes is considered.  相似文献   

12.
Gene regulatory networks are collections of genes that interact with one other and with other substances in the cell. By measuring gene expression over time using high-throughput technologies, it may be possible to reverse engineer, or infer, the structure of the gene network involved in a particular cellular process. These gene expression data typically have a high dimensionality and a limited number of biological replicates and time points. Due to these issues and the complexity of biological systems, the problem of reverse engineering networks from gene expression data demands a specialized suite of statistical tools and methodologies. We propose a non-standard adaptation of a simulation-based approach known as Approximate Bayesian Computing based on Markov chain Monte Carlo sampling. This approach is particularly well suited for the inference of gene regulatory networks from longitudinal data. The performance of this approach is investigated via simulations and using longitudinal expression data from a genetic repair system in Escherichia coli.  相似文献   

13.
This article proposes a Bayesian approach, which can simultaneously obtain the Bayesian estimates of unknown parameters and random effects, to analyze nonlinear reproductive dispersion mixed models (NRDMMs) for longitudinal data with nonignorable missing covariates and responses. The logistic regression model is employed to model the missing data mechanisms for missing covariates and responses. A hybrid sampling procedure combining the Gibber sampler and the Metropolis-Hastings algorithm is presented to draw observations from the conditional distributions. Because missing data mechanism is not testable, we develop the logarithm of the pseudo-marginal likelihood, deviance information criterion, the Bayes factor, and the pseudo-Bayes factor to compare several competing missing data mechanism models in the current considered NRDMMs with nonignorable missing covaraites and responses. Three simulation studies and a real example taken from the paediatric AIDS clinical trial group ACTG are used to illustrate the proposed methodologies. Empirical results show that our proposed methods are effective in selecting missing data mechanism models.  相似文献   

14.
Summary.  In longitudinal studies, missingness of data is often an unavoidable problem. Estimators from the linear mixed effects model assume that missing data are missing at random. However, estimators are biased when this assumption is not met. In the paper, theoretical results for the asymptotic bias are established under non-ignorable drop-out, drop-in and other missing data patterns. The asymptotic bias is large when the drop-out subjects have only one or no observation, especially for slope-related parameters of the linear mixed effects model. In the drop-in case, intercept-related parameter estimators show substantial asymptotic bias when subjects enter late in the study. Eight other missing data patterns are considered and these produce asymptotic biases of a variety of magnitudes.  相似文献   

15.
Many epidemic models approximate social contact behavior by assuming random mixing within mixing groups (e.g., homes, schools, and workplaces). The effect of more realistic social network structure on estimates of epidemic parameters is an open area of exploration. We develop a detailed statistical model to estimate the social contact network within a high school using friendship network data and a survey of contact behavior. Our contact network model includes classroom structure, longer durations of contacts to friends than non-friends and more frequent contacts with friends, based on reports in the contact survey. We performed simulation studies to explore which network structures are relevant to influenza transmission. These studies yield two key findings. First, we found that the friendship network structure important to the transmission process can be adequately represented by a dyad-independent exponential random graph model (ERGM). This means that individual-level sampled data is sufficient to characterize the entire friendship network. Second, we found that contact behavior was adequately represented by a static rather than dynamic contact network. We then compare a targeted antiviral prophylaxis intervention strategy and a grade closure intervention strategy under random mixing and network-based mixing. We find that random mixing overestimates the effect of targeted antiviral prophylaxis on the probability of an epidemic when the probability of transmission in 10 minutes of contact is less than 0.004 and underestimates it when this transmission probability is greater than 0.004. We found the same pattern for the final size of an epidemic, with a threshold transmission probability of 0.005. We also find random mixing overestimates the effect of a grade closure intervention on the probability of an epidemic and final size for all transmission probabilities. Our findings have implications for policy recommendations based on models assuming random mixing, and can inform further development of network-based models.  相似文献   

16.
The generalized half-normal (GHN) distribution and progressive type-II censoring are considered in this article for studying some statistical inferences of constant-stress accelerated life testing. The EM algorithm is considered to calculate the maximum likelihood estimates. Fisher information matrix is formed depending on the missing information law and it is utilized for structuring the asymptomatic confidence intervals. Further, interval estimation is discussed through bootstrap intervals. The Tierney and Kadane method, importance sampling procedure and Metropolis-Hastings algorithm are utilized to compute Bayesian estimates. Furthermore, predictive estimates for censored data and the related prediction intervals are obtained. We consider three optimality criteria to find out the optimal stress level. A real data set is used to illustrate the importance of GHN distribution as an alternative lifetime model for well-known distributions. Finally, a simulation study is provided with discussion.  相似文献   

17.
Missing response problem is ubiquitous in survey sampling, medical, social science and epidemiology studies. It is well known that non-ignorable missing is the most difficult missing data problem where the missing of a response depends on its own value. In statistical literature, unlike the ignorable missing data problem, not many papers on non-ignorable missing data are available except for the full parametric model based approach. In this paper we study a semiparametric model for non-ignorable missing data in which the missing probability is known up to some parameters, but the underlying distributions are not specified. By employing Owen (1988)’s empirical likelihood method we can obtain the constrained maximum empirical likelihood estimators of the parameters in the missing probability and the mean response which are shown to be asymptotically normal. Moreover the likelihood ratio statistic can be used to test whether the missing of the responses is non-ignorable or completely at random. The theoretical results are confirmed by a simulation study. As an illustration, the analysis of a real AIDS trial data shows that the missing of CD4 counts around two years are non-ignorable and the sample mean based on observed data only is biased.  相似文献   

18.
We investigate the impact of some characteristics of friendship networks on the timing of the first sexual intercourse. We assume that the gender-segregated composition of such networks explains part of the particularly late age at first intercourse in Italy. We use new data from a survey on sexual behavior and reproductive health of Italian first and second-year university students. The survey has been carried out in 15 different universities in 2000-2001 and it includes retrospective data on age at first intercourse, as well as retrospectively-collected time-varying measures for the gender composition of the friendship network at different ages, for almost 5,000 cases. After having described the data as transition frequencies, we use a Cox proportional hazards model with time-varying covariates. Results are in accordance with the hypothesis that having friendship networks that include more members of the other gender and talking about sex with friends increases the relative risk of first sexual intercourse.  相似文献   

19.
Missing observations often occur in cross-classified data collected during observational, clinical, and public health studies. Inappropriate treatment of missing data can reduce statistical power and give biased results. This work extends the Baker, Rosenberger and Dersimonian modeling approach to compute maximum likelihood estimates for cell counts in three-way tables with missing data, and studies the association between two dichotomous variables while controlling for a third variable in \( 2\times 2 \times K \) tables. This approach is applied to the Behavioral Risk Factor Surveillance System data. Simulation studies are used to investigate the efficiency of estimation of the common odds ratio.  相似文献   

20.
Multivariate mixture regression models can be used to investigate the relationships between two or more response variables and a set of predictor variables by taking into consideration unobserved population heterogeneity. It is common to take multivariate normal distributions as mixing components, but this mixing model is sensitive to heavy-tailed errors and outliers. Although normal mixture models can approximate any distribution in principle, the number of components needed to account for heavy-tailed distributions can be very large. Mixture regression models based on the multivariate t distributions can be considered as a robust alternative approach. Missing data are inevitable in many situations and parameter estimates could be biased if the missing values are not handled properly. In this paper, we propose a multivariate t mixture regression model with missing information to model heterogeneity in regression function in the presence of outliers and missing values. Along with the robust parameter estimation, our proposed method can be used for (i) visualization of the partial correlation between response variables across latent classes and heterogeneous regressions, and (ii) outlier detection and robust clustering even under the presence of missing values. We also propose a multivariate t mixture regression model using MM-estimation with missing information that is robust to high-leverage outliers. The proposed methodologies are illustrated through simulation studies and real data analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号