首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Model-based clustering for social networks   总被引:5,自引:0,他引:5  
Summary.  Network models are widely used to represent relations between interacting units or actors. Network data often exhibit transitivity, meaning that two actors that have ties to a third actor are more likely to be tied than actors that do not, homophily by attributes of the actors or dyads, and clustering. Interest often focuses on finding clusters of actors or ties, and the number of groups in the data is typically unknown. We propose a new model, the latent position cluster model , under which the probability of a tie between two actors depends on the distance between them in an unobserved Euclidean 'social space', and the actors' locations in the latent social space arise from a mixture of distributions, each corresponding to a cluster. We propose two estimation methods: a two-stage maximum likelihood method and a fully Bayesian method that uses Markov chain Monte Carlo sampling. The former is quicker and simpler, but the latter performs better. We also propose a Bayesian way of determining the number of clusters that are present by using approximate conditional Bayes factors. Our model represents transitivity, homophily by attributes and clustering simultaneously and does not require the number of clusters to be known. The model makes it easy to simulate realistic networks with clustering, which are potentially useful as inputs to models of more complex systems of which the network is part, such as epidemic models of infectious disease. We apply the model to two networks of social relations. A free software package in the R statistical language, latentnet, is available to analyse data by using the model.  相似文献   

2.
The analysis of complex networks is a rapidly growing topic with many applications in different domains. The analysis of large graphs is often made via unsupervised classification of vertices of the graph. Community detection is the main way to divide a large graph into smaller ones that can be studied separately. However another definition of a cluster is possible, which is based on the structural distance between vertices. This definition includes the case of community clusters but is more general in the sense that two vertices may be in the same group even if they are not connected. Methods for detecting communities in undirected graphs have been recently reviewed by Fortunato. In this paper we expand Fortunato’s work and make a review of methods and algorithms for detecting essentially structurally homogeneous subsets of vertices in binary or weighted and directed and undirected graphs.  相似文献   

3.
Clusters form the basis of a number of research study designs including survey and experimental studies. Cluster-based designs can be less costly but also less efficient than individual-based designs due to correlation between individuals within the same cluster. Their design typically relies on ad hoc choices of correlation parameters, and is insensitive to variations in cluster design. This article examines how to efficiently design clusters where they are geographically defined by demarcating areas incorporating individuals and households or other units. Using geostatistical models for spatial autocorrelation, we generate approximations to within cluster average covariance in order to estimate the effective sample size given particular cluster design parameters. We show how the number of enumerated locations, cluster area, proportion sampled, and sampling method affect the efficiency of the design and consider the optimization problem of choosing the most efficient design subject to budgetary constraints. We also consider how the parameters from these approximations can be interpreted simply in terms of ‘real-world’ quantities and used in design analysis.  相似文献   

4.
A network cluster is defined as a set of nodes with ‘strong’ within group ties and ‘weak’ between group ties. Most clustering methods focus on finding groups of ‘densely connected’ nodes, where the dyad (or tie between two nodes) serves as the building block for forming clusters. However, since the unweighted dyad cannot distinguish strong relationships from weak ones, it then seems reasonable to consider an alternative building block, i.e. one involving more than two nodes. In the simplest case, one can consider the triad (or three nodes), where the fully connected triad represents the basic unit of transitivity in an undirected network. In this effort we propose a clustering framework for finding highly transitive subgraphs in an undirected/unweighted network, where the fully connected triad (or triangle configuration) is used as the building block for forming clusters. We apply our methodology to four real networks with encouraging results. Monte Carlo simulation results suggest that, on average, the proposed method yields good clustering performance on synthetic benchmark graphs, relative to other popular methods.  相似文献   

5.
Summary. Enormous quantities of geoelectrical data are produced daily and often used for large scale reservoir modelling. To interpret these data requires reliable and efficient inversion methods which adequately incorporate prior information and use realistically complex modelling structures. We use models based on random coloured polygonal graphs as a powerful and flexible modelling framework for the layered composition of the Earth and we contrast our approach with earlier methods based on smooth Gaussian fields. We demonstrate how the reconstruction algorithm may be efficiently implemented through the use of multigrid Metropolis–coupled Markov chain Monte Carlo methods and illustrate the method on a set of field data.  相似文献   

6.
In dental implant research studies, events such as implant complications including pain or infection may be observed recurrently before failure events, i.e. the death of implants. It is natural to assume that recurrent events and failure events are correlated to each other, since they happen on the same implant (subject) and complication times have strong effects on the implant survival time. On the other hand, each patient may have more than one implant. Therefore these recurrent events or failure events are clustered since implant complication times or failure times within the same patient (cluster) are likely to be correlated. The overall implant survival times and recurrent complication times are both interesting to us. In this paper, a joint modelling approach is proposed for modelling complication events and dental implant survival times simultaneously. The proposed method uses a frailty process to model the correlation within cluster and the correlation within subjects. We use Bayesian methods to obtain estimates of the parameters. Performance of the joint models are shown via simulation studies and data analysis.  相似文献   

7.
We address statistical issues involved in the partially clustered design where clusters are only employed in the intervention arm, but not in the control arm. We develop a cluster adjusted t-test to compare group treatment effects with individual treatment effects for continuous outcomes in which the individual level data are used as the unit of the analysis in both arms, we develop an approach for determining sample sizes using this cluster adjusted t-test, and use simulation to demonstrate the consistent accuracy of the proposed cluster adjusted t-test and power estimation procedures. Two real examples illustrate how to use the proposed methods.  相似文献   

8.
主要采用主成分分析方法,综合主成分分析方法和系统聚类方法等多元统计中的数据处理手段,对全球可持续创新网络(CInet)于2004年组织调查的全球近500家企业所得数据进行分析。通过贵州省企业与全球其他国家的比较,发现在企业持续改进能力的组织与运作方面,贵州省企业与全球其他国家之间存在较大差异。为寻找造成这些差异的原因,采用综合主成分分析方法和系统聚类方法,建立了在持续改进的组织与运作方面能力强的目标企业群。然后通过贵州省企业与目标企业之间在企业组织与运作构成因子的对比分析,指出了贵州省企业在持续改进的组织与运作中所存在的问题,进而对贵州省企业提出了相应改进的建议及其对策。其中,目标企业的选取及其创新能力检验、数据表缺省项的填充方法、在分析数据时所采用的因子对比分析方法等对其他大型调研数据分析均具有一定的借鉴意义。  相似文献   

9.
Inference of interaction networks represented by systems of differential equations is a challenging problem in many scientific disciplines. In the present article, we follow a semi-mechanistic modelling approach based on gradient matching. We investigate the extent to which key factors, including the kinetic model, statistical formulation and numerical methods, impact upon performance at network reconstruction. We emphasize general lessons for computational statisticians when faced with the challenge of model selection, and we assess the accuracy of various alternative paradigms, including recent widely applicable information criteria and different numerical procedures for approximating Bayes factors. We conduct the comparative evaluation with a novel inferential pipeline that systematically disambiguates confounding factors via an ANOVA scheme.  相似文献   

10.
This paper deals with an important problem with large and complex Bayesian networks. Exact inference in these networks is simply not feasible owing to the huge storage requirements of exact methods. Markov chain Monte Carlo methods, however, are able to deal with these large networks but to do this they require an initial legal configuration to set off the sampler. So far nondeterministic methods such as forward sampling have often been used for this, even though the forward sampler may take an eternity to come up with a legal configuration. In this paper a novel algorithm will be presented that allows a legal configuration in a general Bayesian network to be found in polynomial time in almost all cases. The algorithm will not be proved deterministic but empirical results will demonstrate that this holds in most cases. Also, the algorithm will be justified by its simplicity and ease of implementation.  相似文献   

11.
Summary.  Multilevel modelling is sometimes used for data from complex surveys involving multistage sampling, unequal sampling probabilities and stratification. We consider generalized linear mixed models and particularly the case of dichotomous responses. A pseudolikelihood approach for accommodating inverse probability weights in multilevel models with an arbitrary number of levels is implemented by using adaptive quadrature. A sandwich estimator is used to obtain standard errors that account for stratification and clustering. When level 1 weights are used that vary between elementary units in clusters, the scaling of the weights becomes important. We point out that not only variance components but also regression coefficients can be severely biased when the response is dichotomous. The pseudolikelihood methodology is applied to complex survey data on reading proficiency from the American sample of the 'Program for international student assessment' 2000 study, using the Stata program gllamm which can estimate a wide range of multilevel and latent variable models. Performance of pseudo-maximum-likelihood with different methods for handling level 1 weights is investigated in a Monte Carlo experiment. Pseudo-maximum-likelihood estimators of (conditional) regression coefficients perform well for large cluster sizes but are biased for small cluster sizes. In contrast, estimators of marginal effects perform well in both situations. We conclude that caution must be exercised in pseudo-maximum-likelihood estimation for small cluster sizes when level 1 weights are used.  相似文献   

12.
This article proposes a new spatial cluster detection method for longitudinal outcomes that detects neighborhoods and regions with elevated rates of disease while controlling for individual level confounders. The proposed method, CumResPerm, utilizes cumulative geographic residuals through a permutation test to detect potential clusters which are defined as sets of administrative regions, such as a town or group of administrative regions. Previous cluster detection methods are not able to incorporate individual level data including covariate adjustment, while still being able to define potential clusters using informative neighborhood or town boundaries. Often, it is of interest to detect such spatial clusters because individuals residing in a town may have similar environmental exposures or socioeconomic backgrounds due to administrative reasons, such as zoning laws. Therefore, these boundaries can be very informative and more relevant than arbitrary clusters such as the standard circle or square. Application of the CumResPerm method will be illustrated by the Home Allergens and Asthma prospective cohort study analyzing the relationship between area or neighborhood residence and repeated measured outcome, occurrence of wheeze in the last six months, while taking into account mobile locations.  相似文献   

13.
Accurate and efficient methods to detect unusual clusters of abnormal activity are needed in many fields such as medicine and business. Often the size of clusters is unknown; hence, multiple (variable) window scan statistics are used to identify clusters using a set of different potential cluster sizes. We give an efficient method to compute the exact distribution of multiple window discrete scan statistics for higher-order, multi-state Markovian sequences. We define a Markov chain to efficiently keep track of probabilities needed to compute p-values for the statistic. The state space of the Markov chain is set up by a criterion developed to identify strings that are associated with observing the specified values of the statistic. Using our algorithm, we identify cases where the available approximations do not perform well. We demonstrate our methods by detecting unusual clusters of made free throw shots by National Basketball Association players during the 2009–2010 regular season.  相似文献   

14.
The problem of modelling multivariate time series of vehicle counts in traffic networks is considered. It is proposed to use a model called the linear multiregression dynamic model (LMDM). The LMDM is a multivariate Bayesian dynamic model which uses any conditional independence and causal structure across the time series to break down the complex multivariate model into simpler univariate dynamic linear models. The conditional independence and causal structure in the time series can be represented by a directed acyclic graph (DAG). The DAG not only gives a useful pictorial representation of the multivariate structure, but it is also used to build the LMDM. Therefore, eliciting a DAG which gives a realistic representation of the series is a crucial part of the modelling process. A DAG is elicited for the multivariate time series of hourly vehicle counts at the junction of three major roads in the UK. A flow diagram is introduced to give a pictorial representation of the possible vehicle routes through the network. It is shown how this flow diagram, together with a map of the network, can suggest a DAG for the time series suitable for use with an LMDM.  相似文献   

15.
ABSTRACT

Very often researchers plan a balanced design for cluster randomization clinical trials in conducting medical research, but unavoidable circumstances lead to unbalanced data. By adopting three or more levels of nested designs, they usually ignore the higher level of nesting and consider only two levels, this situation leads to underestimation of variance at higher levels. While calculating the sample size for three-level nested designs, in order to achieve desired power, intra-class correlation coefficients (ICCs) at individual level as well as higher levels need to be considered and must be provided along with respective standard errors. In the present paper, the standard errors of analysis of variance (ANOVA) estimates of ICCs for three-level unbalanced nested design are derived. To conquer the strong appeal of distributional assumptions, balanced design, equality of variances between clusters and large sample, general expressions for standard errors of ICCs which can be deployed in unbalanced cluster randomization trials are postulated. The expressions are evaluated on real data as well as highly unbalanced simulated data.  相似文献   

16.
We consider the adjustment, based upon a sample of size n, of collections of vectors drawn from either an infinite or finite population. The vectors may be judged to be either normally distributed or, more generally, second-order exchangeable. We develop the work of Goldstein and Wooff (1998) to show how the familiar univariate finite population corrections (FPCs) naturally generalise to individual quantities in the multivariate population. The types of information we gain by sampling are identified with the orthogonal canonical variable directions derived from a generalised eigenvalue problem. These canonical directions share the same co-ordinate representation for all sample sizes and, for equally defined individuals, all population sizes enabling simple comparisons between both the effects of different sample sizes and of different population sizes. We conclude by considering how the FPC is modified for multivariate cluster sampling with exchangeable clusters. In univariate two-stage cluster sampling, we may decompose the variance of the population mean into the sum of the variance of cluster means and the variance of the cluster members within clusters. The first term has a FPC relating to the sampling fraction of clusters, the second term has a FPC relating to the sampling fraction of cluster size. We illustrate how this generalises in the multivariate case. We decompose the variance into two terms: the first relating to multivariate finite population sampling of clusters and the second to multivariate finite population sampling within clusters. We solve two generalised eigenvalue problems to show how to generalise the univariate to the multivariate: each of the two FPCs attaches to one, and only one, of the two eigenbases.  相似文献   

17.
The ability to infer parameters of gene regulatory networks is emerging as a key problem in systems biology. The biochemical data are intrinsically stochastic and tend to be observed by means of discrete-time sampling systems, which are often limited in their completeness. In this paper we explore how to make Bayesian inference for the kinetic rate constants of regulatory networks, using the stochastic kinetic Lotka-Volterra system as a model. This simple model describes behaviour typical of many biochemical networks which exhibit auto-regulatory behaviour. Various MCMC algorithms are described and their performance evaluated in several data-poor scenarios. An algorithm based on an approximating process is shown to be particularly efficient.  相似文献   

18.

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

  相似文献   

19.
Social network data represent the interactions between a group of social actors. Interactions between colleagues and friendship networks are typical examples of such data.The latent space model for social network data locates each actor in a network in a latent (social) space and models the probability of an interaction between two actors as a function of their locations. The latent position cluster model extends the latent space model to deal with network data in which clusters of actors exist — actor locations are drawn from a finite mixture model, each component of which represents a cluster of actors.A mixture of experts model builds on the structure of a mixture model by taking account of both observations and associated covariates when modeling a heterogeneous population. Herein, a mixture of experts extension of the latent position cluster model is developed. The mixture of experts framework allows covariates to enter the latent position cluster model in a number of ways, yielding different model interpretations.Estimates of the model parameters are derived in a Bayesian framework using a Markov Chain Monte Carlo algorithm. The algorithm is generally computationally expensive — surrogate proposal distributions which shadow the target distributions are derived, reducing the computational burden.The methodology is demonstrated through an illustrative example detailing relationships between a group of lawyers in the USA.  相似文献   

20.
Clustered binary responses are often found in ecological studies. Data analysis may include modeling the marginal probability response. However, when the association is the main scientific focus, modeling the correlation structure between pairs of responses is the key part of the analysis. Second-order generalized estimating equations (GEE) are established in the literature. Some of them are more efficient in computational terms, especially facing large clusters. Alternating logistic regression (ALR) and orthogonalized residual (ORTH) GEE methods are presented and compared in this paper. Simulation results show a slightly superiority of ALR over ORTH. Marginal probabilities and odds ratios are also estimated and compared in a real ecological study involving a three-level hierarchical clustering. ALR and ORTH models are useful for modeling complex association structure with large cluster sizes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号