首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An important problem in network analysis is to identify significant communities. Most of the real-world data sets exhibit a certain topological structure between nodes and the attributes describing them. In this paper, we propose a new community detection criterion considering both structural similarities and attribute similarities. The clustering method integrates the cost of clustering node attributes with the cost of clustering the structural information via the normalized modularity. We show that the joint clustering problem can be formulated as a spectral relaxation problem. The proposed algorithm is capable of learning the degree of contributions of individual node attributes. A number of numerical studies involving simulated and real data sets demonstrate the effectiveness of the proposed method.  相似文献   

2.
Choice-based conjoint experiments are used when choice alternatives can be described in terms of attributes. The objective is to infer the value that respondents attach to attribute levels. This method involves the design of profiles on the basis of attributes specified at certain levels. Respondents are presented sets of profiles and asked to select the one they consider best. However if choice sets have too many profiles, they may be difficult to implement. In this paper we provide strategies for reducing the number of profiles in choice sets. We consider situations where only a subset of interactions is of interest, and we obtain connected main effect plans with smaller choice sets that are capable of estimating subsets of two-factor and three-factor interactions in 2n and 3n plans. We also provide connected main effect plans for mixed level designs.  相似文献   

3.
孙怡帆等 《统计研究》2019,36(3):124-128
从大量基因中识别出致病基因是大数据下的一个十分重要的高维统计问题。基因间网络结构的存在使得对于致病基因的识别已从单个基因识别扩展到基因模块识别。从基因网络中挖掘出基因模块就是所谓的社区发现(或节点聚类)问题。绝大多数社区发现方法仅利用网络结构信息,而忽略节点本身的信息。Newman和Clauset于2016年提出了一个将二者有机结合的基于统计推断的社区发现方法(简称为NC方法)。本文以NC方法为案例,介绍统计方法在实际基因网络中的应用和取得的成果,并从统计学角度提出了改进措施。通过对NC方法的分析可以看出对于以基因网络为代表的非结构化数据,统计思想和原理在数据分析中仍然处于核心地位。而相应的统计方法则需要针对数据的特点及关心的问题进行相应的调整和优化。  相似文献   

4.
We introduce a one-step EM algorithm to estimate the graphical structure in a Poisson-Log-Normal graphical model. This procedure is equivalent to a normality transformation that makes the problem of identifying relationships in high-throughput microRNA (miRNA) sequence data feasible. The Poisson-log-normal model moreover allows us to directly account for known overdispersion relationships present in this data set. We show that our EM algorithm provides a provable increase in performance in determining the network structure. The model is shown to provide an increase in performance in simulation settings over a range of network structures. The model is applied to high-throughput miRNA sequencing data from patients with breast cancer from The Cancer Genome Atlas (TCGA). By selecting the most highly connected miRNA molecules in the fitted network we find that nearly all of them are known to be involved in the regulation of breast cancer.  相似文献   

5.
We consider a social network from which one observes not only network structure (i.e., nodes and edges) but also a set of labels (or tags, keywords) for each node (or user). These labels are self-created and closely related to the user’s career status, life style, personal interests, and many others. Thus, they are of great interest for online marketing. To model their joint behavior with network structure, a complete data model is developed. The model is based on the classical p1 model but allows the reciprocation parameter to be label-dependent. By focusing on connected pairs only, the complete data model can be generalized into a conditional model. Compared with the complete data model, the conditional model specifies only the conditional likelihood for the connected pairs. As a result, it suffers less risk from model misspecification. Furthermore, because the conditional model involves connected pairs only, the computational cost is much lower. The resulting estimator is consistent and asymptotically normal. Depending on the network sparsity level, the convergence rate could be different. To demonstrate its finite sample performance, numerical studies (based on both simulated and real datasets) are presented.  相似文献   

6.
This study investigated the impact of spatial location on the effectiveness of population‐based breast screening in reducing breast cancer mortality compared to other detection methods among Queensland women. The analysis was based on linked population‐based datasets from BreastScreen Queensland and the Queensland Cancer Registry for the period of 1997–2008 for women aged less than 90 years at diagnosis. A Bayesian hierarchical regression modelling approach was adopted and posterior estimation was performed using Markov Chain Monte Carlo techniques. This approach accommodated sparse data resulting from rare outcomes in small geographic areas, while allowing for spatial correlation and demographic influences to be included. A relative survival model was chosen to evaluate the relative excess risk for each breast cancer related factor. Several models were fitted to examine the influence of demographic information, cancer stage, geographic information and detection method on women's relative survival. Overall, the study demonstrated that including the detection method and geographic information when assessing the relative survival of breast cancer patients helped capture unexplained and spatial variability. The study also found evidence of better survival among women with breast cancer diagnosed in a screening program than those detected otherwise, as well as lower risk for those residing in a more urban or socio‐economically advantaged region, even after adjusting for tumour stage, environmental factors and demographics. However, no evidence of dependency between method of detection and geographic location was found. This project provides a sophisticated approach to examining the benefit of a screening program while considering the influence of geographic factors.  相似文献   

7.
A network cluster is defined as a set of nodes with ‘strong’ within group ties and ‘weak’ between group ties. Most clustering methods focus on finding groups of ‘densely connected’ nodes, where the dyad (or tie between two nodes) serves as the building block for forming clusters. However, since the unweighted dyad cannot distinguish strong relationships from weak ones, it then seems reasonable to consider an alternative building block, i.e. one involving more than two nodes. In the simplest case, one can consider the triad (or three nodes), where the fully connected triad represents the basic unit of transitivity in an undirected network. In this effort we propose a clustering framework for finding highly transitive subgraphs in an undirected/unweighted network, where the fully connected triad (or triangle configuration) is used as the building block for forming clusters. We apply our methodology to four real networks with encouraging results. Monte Carlo simulation results suggest that, on average, the proposed method yields good clustering performance on synthetic benchmark graphs, relative to other popular methods.  相似文献   

8.
Ipsilateral breast tumor relapse (IBTR) often occurs in breast cancer patients after their breast conservation therapy. The IBTR status' classification (true local recurrence versus new ipsilateral primary tumor) is subject to error and there is no widely accepted gold standard. Time to IBTR is likely informative for IBTR classification because new primary tumor tends to have a longer mean time to IBTR and is associated with improved survival as compared with the true local recurrence tumor. Moreover, some patients may die from breast cancer or other causes in a competing risk scenario during the follow-up period. Because the time to death can be correlated to the unobserved true IBTR status and time to IBTR (if relapse occurs), this terminal mechanism is non-ignorable. In this paper, we propose a unified framework that addresses these issues simultaneously by modeling the misclassified binary outcome without a gold standard and the correlated time to IBTR, subject to dependent competing terminal events. We evaluate the proposed framework by a simulation study and apply it to a real data set consisting of 4477 breast cancer patients. The adaptive Gaussian quadrature tools in SAS procedure NLMIXED can be conveniently used to fit the proposed model. We expect to see broad applications of our model in other studies with a similar data structure.  相似文献   

9.
We give new constructions for DCEs in which all attributes have the same number of levels. These constructions use several combinatorial structures, such as orthogonal arrays, balanced incomplete block designs and Hadamard matrices. If we assume that only the main effects of the attributes are to be used to explain the results and that all attribute level combinations are equally attractive, we show that the constructed DCEs are D-optimal.  相似文献   

10.
Neural networks are a popular machine learning tool, particularly in applications such as protein structure prediction; however, overfitting can pose an obstacle to their effective use. Due to the large number of parameters in a typical neural network, one may obtain a network fit that perfectly predicts the learning data, yet fails to generalize to other data sets. One way of reducing the size of the parmeter space is to alter the network topology so that some edges are removed; however it is often not immediately apparent which edges should be eliminated. We propose a data-adaptive method of selecting an optimal network architecture using a deletion/substitution/addition algorithm. Results of this approach to classification are presented on simulated data and the breast cancer data of Wolberg and Mangasarian [1990. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Nat. Acad. Sci. 87, 9193–9196].  相似文献   

11.
Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure.  相似文献   

12.
We study reliable multinomial probabilistic group testing models with incomplete identification. We assume that every of the pooled items has none or some of k attributes, one of them causing contamination. Any group possessing this latter attribute is discarded, while the others are collected and separated according to the attributes that were found in them. The objective is to choose an optimal group size for pooled screening so as to collect prespecified numbers of items of the various types with minimum testing expenditures. We derive exact results for the underlying distributions of the stopping times, enabling us to find optimal procedures by numerical methods.  相似文献   

13.
A statistical model assuming a preferential attachment network, which is generated by adding nodes sequentially according to a few simple rules, usually describes real-life networks better than a model assuming, for example, a Bernoulli random graph, in which any two nodes have the same probability of being connected, does. Therefore, to study the propagation of “infection” across a social network, we propose a network epidemic model by combining a stochastic epidemic model and a preferential attachment model. A simulation study based on the subsequent Markov Chain Monte Carlo algorithm reveals an identifiability issue with the model parameters. Finally, the network epidemic model is applied to a set of online commissioning data.  相似文献   

14.
It is widely recognized that early detection of malignant cancer is associated with increased survival prospects. If regular examinations are given, then in the case of breast cancer, there is a high chance of locating lesions before they would normally be found by the patient. Such examinations are called screenings and may involve several detection modalities. The two main parameters in the design of a screening program are the frequency of examination and the sensitivity of the detection modality. Models are developed in this paper to examine the effect of screening on the sue of tumors at the time of detection. They are then used to assess the effect of the frequency and sensitivity of the screening program on the non-recurrence rate for breast cancer. As a result of the modelling, various recommendations on screening design are given.  相似文献   

15.
A subset T of S is said to be a Pareto Optimal subset of m ordered attributes (factors) if for profiles (combination of attribute levels) ( x 1, …, xm ) and ( y 1, …, ym ) ∈ T , no profile 'dominates' another; that is, there exists no pair such that xi ≤ yi , for i = 1, …, m . Pareto Optimal designs have specific applications in economics, cognitive psychology, and marketing research where investigators use main effects linear models to infer how respondents values level of costs and benefits from their preferences for sets of profiles offered them. In such studies, it is desirable that no profile dominates the others in a set. This paper shows how to construct a Pareto Optimal subset, proves that a single Pareto Optimal subset is not a connected main effects plan, provides subsets of two or more attributes that are connected in symmetric designs and gives corresponding results for asymmetric designs.  相似文献   

16.
This paper compares and contrasts two methods of obtaining opinions using questionnaires. As the name suggests, a conjoint study makes it possible to consider several attributes jointly. Conjoint analysis is a statistical method to analyse preferences. However, conjoint analysis requires a certain amount of effort by the respondent. The alternative is ordinary survey questions, answered one at a time. Survey questions are easier to grasp mentally, but they do not challenge the respondent to prioritize. This investigation has utilized both methods, survey and conjoint, making it possible to compare them on real data. Attribute importance, attribute correlations, case clustering and attribute grouping are evaluated by both methods. Correspondence between how the two methods measure the attribute in question is also given. Overall, both methods yield the same picture concerning the relative importance of the attributes. Taken one attribute at a time, the correspondence between the methods varies from good to no correspondence. Considering all attributes together by cluster analysis of the cases, the conjoint and survey data yield different cluster structures. The attributes are grouped by factor analysis, and there is reasonable correspondence. The data originate from the EU project 'New Intermediary services and the transformation of urban water supply and wastewater disposal systems in Europe'.  相似文献   

17.
A discrete approximation to the Polya tree prior suitable for latent data is proposed that enjoys surprisingly simple and efficient conjugate updating. This approximation is illustrated in two applied contexts: the implementation of a nonparametric meta-analysis involving studies on the relationship between alcohol consumption and breast cancer, and random intercept Poisson regression for Ache armadillo hunting treks. The discrete approximation is then smoothed with Gaussian kernels to provide a smooth density for use with continuous data; the smoothed approximation is illustrated on a classic dataset on galaxy velocities and on recent data involving breast cancer survival in Louisiana.  相似文献   

18.
Screening programs for breast cancer are widely used to reduce the impact of breast cancer in populations. For example, the South Australian Breast X–ray Service, BreastScreen SA, established in 1989, is a participant in the National Program of Early Detection of Breast Cancer. BreastScreen SA has collected information on both screening–detected and interval or self–reported cases, which enables the estimation of various important attributes of the screening mechanism. In this paper, a tailored model is fitted to the BreastScreen SA data. The probabilities that the screening detects a tumour of a given size and that an individual reports a tumour by a specified size in the absence of screening are estimated. Estimates of the distribution of sizes detected in the absence of screening, and at the first two screenings, are also given.  相似文献   

19.
A cure rate model is a survival model incorporating the cure rate with the assumption that the population contains both uncured and cured individuals. It is a powerful statistical tool for prognostic studies, especially in cancer. The cure rate is important for making treatment decisions in clinical practice. The proportional hazards (PH) cure model can predict the cure rate for each patient. This contains a logistic regression component for the cure rate and a Cox regression component to estimate the hazard for uncured patients. A measure for quantifying the predictive accuracy of the cure rate estimated by the Cox PH cure model is required, as there has been a lack of previous research in this area. We used the Cox PH cure model for the breast cancer data; however, the area under the receiver operating characteristic curve (AUC) could not be estimated because many patients were censored. In this study, we used imputation‐based AUCs to assess the predictive accuracy of the cure rate from the PH cure model. We examined the precision of these AUCs using simulation studies. The results demonstrated that the imputation‐based AUCs were estimable and their biases were negligibly small in many cases, although ordinary AUC could not be estimated. Additionally, we introduced the bias‐correction method of imputation‐based AUCs and found that the bias‐corrected estimate successfully compensated the overestimation in the simulation studies. We also illustrated the estimation of the imputation‐based AUCs using breast cancer data. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

20.
Current statistical methods for analyzing epidemiological data with disease subtype information allow us to acquire knowledge not only for risk factor-disease subtype association but also, on a more profound account, heterogeneity in these associations by multiple disease characteristics (so-called etiologic heterogeneity of the disease). Current interest, particularly in cancer epidemiology, lies in obtaining a valid p-value for testing the hypothesis whether a particular cancer is etiologically heterogeneous. We consider the two-stage logistic regression model along with pseudo-conditional likelihood estimation method and design a testing strategy based on Rao's score test. An extensive Monte Carlo simulation study is carried out, false discovery rate and statistical power of the suggested test are investigated. Simulation results indicate that applying the proposed testing strategy, even a small degree of true etiologic heterogeneity can be recovered with a large statistical power from the sampled data. The strategy is then applied on a breast cancer data set to illustrate its use in practice where there are multiple risk factors and multiple disease characteristics of simultaneous concern.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号