期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Cluster analysis of massive datasets in astronomy

Woncheol Jang Martin Hendry 《Statistics and Computing》2007,17(3):253-262

Clusters of galaxies are a useful proxy to trace the distribution of mass in the universe. By measuring the mass of clusters of galaxies on different scales, one can follow the evolution of the mass distribution (Martínez and Saar, Statistics of the Galaxy Distribution, 2002). It can be shown that finding galaxy clusters is equivalent to finding density contour clusters (Hartigan, Clustering Algorithms, 1975): connected components of the level set S _c≡{f>c} where f is a probability density function. Cuevas et al. (Can. J. Stat. 28, 367–382, 2000; Comput. Stat. Data Anal. 36, 441–459, 2001) proposed a nonparametric method for density contour clusters, attempting to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data. 相似文献

2.

Model-based clustering for social networks 总被引：5，自引：0，他引：5

Mark S. Handcock Adrian E. Raftery Jeremy M. Tantrum 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2007,170(2):301-354

Summary. Network models are widely used to represent relations between interacting units or actors. Network data often exhibit transitivity, meaning that two actors that have ties to a third actor are more likely to be tied than actors that do not, homophily by attributes of the actors or dyads, and clustering. Interest often focuses on finding clusters of actors or ties, and the number of groups in the data is typically unknown. We propose a new model, the latent position cluster model , under which the probability of a tie between two actors depends on the distance between them in an unobserved Euclidean 'social space', and the actors' locations in the latent social space arise from a mixture of distributions, each corresponding to a cluster. We propose two estimation methods: a two-stage maximum likelihood method and a fully Bayesian method that uses Markov chain Monte Carlo sampling. The former is quicker and simpler, but the latter performs better. We also propose a Bayesian way of determining the number of clusters that are present by using approximate conditional Bayes factors. Our model represents transitivity, homophily by attributes and clustering simultaneously and does not require the number of clusters to be known. The model makes it easy to simulate realistic networks with clustering, which are potentially useful as inputs to models of more complex systems of which the network is part, such as epidemic models of infectious disease. We apply the model to two networks of social relations. A free software package in the R statistical language, latentnet, is available to analyse data by using the model. 相似文献

3.

Clustering microarray data using model-based double K-means

Francesca Martella Maurizio Vichi 《Journal of applied statistics》2012,39(9):1853-1869

The microarray technology allows the measurement of expression levels of thousands of genes simultaneously. The dimension and complexity of gene expression data obtained by microarrays create challenging data analysis and management problems ranging from the analysis of images produced by microarray experiments to biological interpretation of results. Therefore, statistical and computational approaches are beginning to assume a substantial position within the molecular biology area. We consider the problem of simultaneously clustering genes and tissue samples (in general conditions) of a microarray data set. This can be useful for revealing groups of genes involved in the same molecular process as well as groups of conditions where this process takes place. The need of finding a subset of genes and tissue samples defining a homogeneous block had led to the application of double clustering techniques on gene expression data. Here, we focus on an extension of standard K-means to simultaneously cluster observations and features of a data matrix, namely double K-means introduced by Vichi (2000). We introduce this model in a probabilistic framework and discuss the advantages of using this approach. We also develop a coordinate ascent algorithm and test its performance via simulation studies and real data set. Finally, we validate the results obtained on the real data set by building resampling confidence intervals for block centroids. 相似文献

4.

A network analysis of student mobility patterns from high school to master’s

Genova Vincenzo G. Tumminello Michele Aiello Fabio Attanasio Massimo 《Statistical Methods and Applications》2021,30(5):1445-1464

Human migration involves the movement of people from one place to another. An example of undirected migration is Italian student mobility where students move from the South to the Center-North. This kind of mobility has become of general interest, and this work explores student mobility from Sicily towards universities outside the island. The data used in this paper regards six cohorts of students, from 2008/09 to 2013/14. In particular, our goal is to study the 3-step migration path: the area of origin (Sicilian provinces), the regional university for the bachelor’s degree, and the regional university for the master’s. Our analysis is conducted by building a multipartite network with four sets of nodes: students; Sicilian provinces; bachelor region of studies; and the master region of studies. By projecting the students’ set onto the others, we obtain a tripartite network where the number of students represents the link weight. Results show that the big Sicilian cities—Palermo, Catania, and Messina—have different preferential paths compared to small Sicilian cities. Furthermore, the results reveal preferential paths of 3-step mobility that only, in part, reflect a south-north orientation in the transition from the region of study for the bachelor degree to that for the master’s.

相似文献

5.

Edge selection for undirected graphs

Meng Hwee Victor Ong Berwin A. Turlach 《Journal of Statistical Computation and Simulation》2018,88(17):3291-3322

This article explores an ‘Edge Selection’ procedure to fit an undirected graph to a given data set. Undirected graphs are routinely used to represent, model and analyse associative relationships among the entities on a social, biological or genetic network. Our proposed method combines the computational efficiency of least angle regression and at the same time ensures symmetry of the selected adjacency matrix. Various local and global properties of the edge selection path are explored analytically. In particular, a suitable parameter that controls the amount of shrinkage is identified and we consider several cross-validation techniques to choose an accurate predictive model on the path. The proposed method is illustrated with a detailed simulation study involving models with various levels of sparsity and variability in the nodal degree distributions. Finally, our method is used to select undirected graphs from various real data sets. We employ it for identifying the regulatory network of isoprenoid pathways from a gene-expression data and also to identify genetic network from a high-dimensional breast cancer study data. 相似文献

6.

A Statistical Model for Social Network Labeling

Danyang Huang Jun Yin Tao Shi Hansheng Wang 《商业与经济统计学杂志》2016,34(3):368-374

We consider a social network from which one observes not only network structure (i.e., nodes and edges) but also a set of labels (or tags, keywords) for each node (or user). These labels are self-created and closely related to the user’s career status, life style, personal interests, and many others. Thus, they are of great interest for online marketing. To model their joint behavior with network structure, a complete data model is developed. The model is based on the classical p₁ model but allows the reciprocation parameter to be label-dependent. By focusing on connected pairs only, the complete data model can be generalized into a conditional model. Compared with the complete data model, the conditional model specifies only the conditional likelihood for the connected pairs. As a result, it suffers less risk from model misspecification. Furthermore, because the conditional model involves connected pairs only, the computational cost is much lower. The resulting estimator is consistent and asymptotically normal. Depending on the network sparsity level, the convergence rate could be different. To demonstrate its finite sample performance, numerical studies (based on both simulated and real datasets) are presented. 相似文献

7.

Information geodesics for gamma models of communication clustering

《Journal of Statistical Computation and Simulation》2012,82(1-4):133-146

Recently, Akyildiz called for further work on non-Poisson models for communication arrivals in distributed networks such as cellular phone systems. The basic ‘random’ model for stochastic events is the Poisson process; for events on a line this resuits in an exponential disiribuuon of intervals between events. Network designers and managers need too monotor and quantify call clustering in order to optimize resaurce usage; the natural reference state from which to measure departures is that arising from a Poisson, process of calls. Here we consider gamma distributions, which contain exponential distributions as a special case. The surface representing gamma models has a natural Riemannian information metric and we obtain some geodesic sprays for this metric. The exponential distributions form a 1-dimensional subspace of the 2-dimensional space of all gamma distributions, so we have an isometric embedding of the random model as a subspace of the gamma models. This geometry may provide an appropriate structure on which to represent clustering as quantifiable departures from randomness and on which to impose dynamic control algorithms to optimize traffic at receiving nodes in distributed communication networks. In practice, we may expect correlation between call arrival times and call duration, reflecting for example peaks of different users of internet services. This would give rise to a twisted product of two surfaces with the twisting controlled by the correlation. Though bivariate gamma models do exist, such as Kibble's, none has tractabie information geometry nor sufficiently general marginal gammas,but a simulation method of approach is suggested. 相似文献

8.

A Non-parametric Frailty Model for Temporally Clustered Multivariate Failure Times

Tommi Härkänen Hannu Hausen Jorma I. Virtanen Elja Arjas 《Scandinavian Journal of Statistics》2003,30(3):523-533

Abstract A model is introduced here for multivariate failure time data arising from heterogenous populations. In particular, we consider a situation in which the failure times of individual subjects are often temporally clustered, so that many failures occur during a relatively short age interval. The clustering is modelled by assuming that the subjects can be divided into ‘internally homogenous’ latent classes, each such class being then described by a time‐dependent frailty profile function. As an example, we reanalysed the dental caries data presented earlier in Härkänen et al. [Scand. J. Statist. 27 (2000) 577], as it turned out that our earlier model could not adequately describe the observed clustering. 相似文献

9.

KENDALL'S TAU AND CONTINGENCY TABLES

B.M. Brown 《Australian & New Zealand Journal of Statistics》1988,30(3):276-291

Kendall's tau is a coefficient of concordance between two rankings of n objects. Its definition and large sample normal approximation are easily extended to the case where one of the rankings contains ties. In this paper, definition and normal approximation are extended further to the case where both rankings contain ties. The results are applied to give a fully distribution-free test for two-way contingency tables with ordered categories. 相似文献

10.

Clustering of Variables Based on Watson Distribution on Hypersphere: A Comparison of Algorithms

Adelaide Figueiredo Paulo Gomes 《统计学通讯:模拟与计算》2015,44(10):2622-2635

We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data. 相似文献

11.

PCA likelihood ratio test approach for attributed social networks monitoring

M. Shaghaghi A. Saghaei 《统计学通讯:理论与方法》2020,49(12):2869-2886

Abstract

One of the most important factors in building and changing communication mechanisms in social networks is considering features of the members of social networks. Most of the existing methods in network monitoring don’t consider effects of features in network formation mechanisms and others don’t lead to reliable results when the features abound or when there are correlations among them. In this article, we combined two methods principal component analysis (PCA) and likelihood method to monitor the underlying network model when the features of individuals abound and when some of them have high correlations with each other. 相似文献

12.

Semiparametric Estimators for Limited Dependent Variable (LDV) Models with Endogenous Regressors

Myoung-Jae Lee 《Econometric Reviews》2013,32(2):171-214

This article reviews semiparametric estimators for limited dependent variable (LDV) models with endogenous regressors, where nonlinearity and nonseparability pose difficulties. We first introduce six main approaches in the linear equation system literature to handle endogenous regressors with linear projections: (i) ‘substitution’ replacing the endogenous regressors with their projected versions on the system exogenous regressors x, (ii) instrumental variable estimator (IVE) based on E{(error) × x} = 0, (iii) ‘model-projection’ turning the original model into a model in terms of only x-projected variables, (iv) ‘system reduced form (RF)’ finding RF parameters first and then the structural form (SF) parameters, (v) ‘artificial instrumental regressor’ using instruments as artificial regressors with zero coefficients, and (vi) ‘control function’ adding an extra term as a regressor to control for the endogeneity source. We then check if these approaches are applicable to LDV models using conditional mean/quantiles instead of linear projection. The six approaches provide a convenient forum on which semiparametric estimators in the literature can be categorized, although there are a few exceptions. The pros and cons of the approaches are discussed, and a small-scale simulation study is provided for some reviewed estimators. 相似文献

13.

Failure analysis of network nodes and edges in scale-free networks

Dui Hongyan Zhang Chi Xu Xin 《统计学通讯:理论与方法》2020,49(15):3635-3649

相似文献

14.

Use of inter-block information to obtain uniformly better estimators of treatment contrasts

S. Mejza 《Statistics》2013,47(3):335-341

In this paper the problem of combining the estimates is reexamined by making use of the theory of basic contrasts. For some basic contrasts, called partially confounded, a general method of finding uniformly better combined estimators of treatment contrast is derived, The method is applicable for all proper block designs, not necessarily connected, with equal or different treatment replications, for which there are multiple efficiency factors ?ε of multiplicity q> 2and if ν _e> 2, where ν_eis the number of the error degrees of freedom in the intra-block analysis. 相似文献

15.

Clustering of electrical transmission systems based on network topology and stability

Sebastian Krey Sebastian Brato Uwe Ligges Jürgen Götze Claus Weihs 《Journal of Statistical Computation and Simulation》2015,85(1):47-61

A proper understanding and modelling of the behaviour of heavily loaded large-scale electrical transmission systems is essential for a secure and uninterrupted operation. In this paper, we present methods to cluster electrical power networks based on different criteria into regions. These regions are useful for the efficient modelling of large transcontinental electricity networks, switching operation decisions or placement of redundant parts of the monitoring and control system. In alternating current electricity networks, power oscillations are normal, but they can become dangerous if they build up. The first approach uses the correlation between results of a stability assessment for these oscillations at every node for the cluster criterion. The second method concentrates on the network topology and uses spectral clustering on the network graph to create clusters where all nodes are interconnected. In this work, we also discuss the problem how to choose the right number of clusters and how the discussed clustering methods can be used for an efficient modelling of large electricity networks or in protection and control systems. 相似文献

16.

Relation of modified power series distributions to lagrangian probability distributions

P.C. Consul 《统计学通讯:理论与方法》2013,42(20):2039-2046

The class of Lagrangian probability distributions ‘LPD’, given by the expansion of a probability generating function f‘t’ under the transformation u = t/g‘t’ where g ‘t’ is also a p.g.f., has been substantially widened by removing the restriction that the defining functions g ‘t’ and f‘t’ be probability generating functions. The class of modified power series distributions defined by Gupta ‘1974’ has been shown to be a sub-class of the wider class of LPDs 相似文献

17.

A Tandem Queue with Server Slow-Down and Blocking

《随机性模型》2013,29(2-3):695-724

Abstract

We consider two variants of a two-station tandem network with blocking. In both variants the first server ceases to work when the queue length at the second station hits a ‘blocking threshold.’ In addition, in variant 2 the first server decreases its service rate when the second queue exceeds a ‘slow-down threshold, ’ which is smaller than the blocking level. In both variants the arrival process is Poisson and the service times at both stations are exponentially distributed. Note, however, that in case of slow-downs, server 1 works at a high rate, a slow rate, or not at all, depending on whether the second queue is below or above the slow-down threshold or at the blocking threshold, respectively. For variant 1, i.e., only blocking, we concentrate on the geometric decay rate of the number of jobs in the first buffer and prove that for increasing blocking thresholds the sequence of decay rates decreases monotonically and at least geometrically fast to max{ρ₁, ρ₂}, where ρ_i is the load at server i. The methods used in the proof also allow us to clarify the asymptotic queue length distribution at the second station. Then we generalize the analysis to variant 2, i.e., slow-down and blocking, and establish analogous results. 相似文献

18.

Asymmetric generalizations of symmetric univariate probability distributions obtained through quantile splicing

Brenda V. Mac’Oduol Paul J. van Staden Robert A. R. King 《统计学通讯:理论与方法》2020,49(18):4413-4429

Abstract

Balakrishnan et al. proposed a two-piece skew logistic distribution by making use of the cumulative distribution function (CDF) of half distributions as the building block, to give rise to an asymmetric family of two-piece distributions, through the inclusion of a single shape parameter. This paper proposes the construction of asymmetric families of two-piece distributions by making use of quantile functions of symmetric distributions as building blocks. This proposition will enable the derivation of a general formula for the L-moments of two-piece distributions. Examples will be presented, where the logistic, normal, Student’s t(2) and hyperbolic secant distributions are considered. 相似文献

19.

Classification Error of the Thresholded Independence Rule

下载免费PDF全文

Britta Anker Bak Jens Ledet Jensen Morten Fenger‐Grøn 《Scandinavian Journal of Statistics》2015,42(1):32-42

We consider classification in the situation of two groups with normally distributed data in the ‘large p small n’ framework. To counterbalance the high number of variables, we consider the thresholded independence rule. An upper bound on the classification error is established that is taylored to a mean value of interest in biological applications. 相似文献

20.

Asymptotic Throughput in Discrete‐Time Cyclic Networks with Queue‐Length‐Dependent Service Rates

《随机性模型》2013,29(4):483-506

Abstract

For a discrete‐time closed cyclic network of single server queues whose service rates are non‐decreasing in the queue length, we compute the queue‐length distribution at each node in terms of throughputs of related networks. For the asymptotic analysis, we consider sequences of networks where the number of nodes grows to infinity, service rates are taken only from a fixed finite set of non‐decreasing sequences, the ratio of customers to nodes has a limit, and the proportion of nodes for each possible service‐rate sequence has a limit. Under these assumptions, the asymptotic throughput exists and is calculated explicitly. Furthermore, the asymptotic queue‐length distribution at any node can be obtained in terms of the asymptotic throughput. The asymptotic throughput, regarded as a function of the limiting customer‐to‐node ratio, is strictly increasing for ratios up to a threshold value (possibly infinite) and is constant thereafter. For ratios less than the threshold, the asymptotic queue‐length distribution at each node has finite moments of all orders. However, at or above the threshold, bottlenecks (nodes with asymptotically‐infinite mean queue length) do occur, and we completely characterize such nodes. 相似文献