期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On selection biases with prediction rules formed from gene expression data

J.X. Zhu G.J. McLachlan L. Ben-Tovim Jones I.A. Wood 《Journal of statistical planning and inference》2008

There has been ever increasing interest in the use of microarray experiments as a basis for the provision of prediction (discriminant) rules for improved diagnosis of cancer and other diseases. Typically, the microarray cancer studies provide only a limited number of tissue samples from the specified classes of tumours or patients, whereas each tissue sample may contain the expression levels of thousands of genes. Thus researchers are faced with the problem of forming a prediction rule on the basis of a small number of classified tissue samples, which are of very high dimension. Usually, some form of feature (gene) selection is adopted in the formation of the prediction rule. As the subset of genes used in the final form of the rule have not been randomly selected but rather chosen according to some criterion designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rules if care is not taken. We shall present various situations where selection bias arises in the formation of a prediction rule and where there is a consequent need for the correction of this bias. We describe the design of cross-validation schemes that are able to correct for the various selection biases. 相似文献

2.

Statistical methodology for assessing homology of intronic regions of genes

Deborah L. Hall Karen Kafadar Alvin M. Malkinson 《Revue canadienne de statistique》1998,26(3):455-465

We consider the problem of statistically evaluating the similarity of DNA intronic regions of genes. Present algorithms are based on matching a sequence of interest with known DNA sequences in a gene bank and are designed primarily to assess homology among exonic regions of genes. Most research focuses on exonic regions because they have a clear biological significance, coding for proteins, and therefore tend to be more conserved in evolution than intronic regions. To investigate whether the intronic features of genes whose expression is highly sensitive to environmental perturbations differ from genes that have a more constant expression, a collection of oncogenes, tumor suppressor genes, and nonregulatory genes involved in energy metabolism are compared. An analysis of the features of these genes' intronic regions result in clustering by regulatory group. In addition, Billingsley's test for Markov structure (1961) suggests that 67% of the intronic regions in this collection of genes show evidence of nonrandom structure, indicating the possibility of a biological function for these regions. The result of Billingsley's test for homology is used as input to a clustering algorithm. The biological significance of this methodology lies in the identification of groups based on the intronic regions from genes of unknown function. With the advent of rapid sequencing techniques, there is a great need for statistical techniques to help identify the purpose of poorly understood portions of genes. These methods can be utilized to assess the functional group to which such a gene might possibly belong. 相似文献

3.

Clustering microarray data using model-based double K-means

Francesca Martella Maurizio Vichi 《Journal of applied statistics》2012,39(9):1853-1869

The microarray technology allows the measurement of expression levels of thousands of genes simultaneously. The dimension and complexity of gene expression data obtained by microarrays create challenging data analysis and management problems ranging from the analysis of images produced by microarray experiments to biological interpretation of results. Therefore, statistical and computational approaches are beginning to assume a substantial position within the molecular biology area. We consider the problem of simultaneously clustering genes and tissue samples (in general conditions) of a microarray data set. This can be useful for revealing groups of genes involved in the same molecular process as well as groups of conditions where this process takes place. The need of finding a subset of genes and tissue samples defining a homogeneous block had led to the application of double clustering techniques on gene expression data. Here, we focus on an extension of standard K-means to simultaneously cluster observations and features of a data matrix, namely double K-means introduced by Vichi (2000). We introduce this model in a probabilistic framework and discuss the advantages of using this approach. We also develop a coordinate ascent algorithm and test its performance via simulation studies and real data set. Finally, we validate the results obtained on the real data set by building resampling confidence intervals for block centroids. 相似文献

4.

A new approach: Interrelated two-way clustering of gene expression data

B. Chandra S. Shanker Saroj Mishra 《Statistical Methodology》2006,3(1):93

The paper presents a new approach to interrelated two-way clustering of gene expression data. Clustering of genes has been effected using entropy and a correlation measure, whereas the samples have been clustered using the fuzzy C-means. The efficiency of this approach has been tested on two well known data sets: the colon cancer data set and the leukemia data set. Using this approach, we were able to identify the important co-regulated genes and group the samples efficiently at the same time. 相似文献

5.

An empirical comparison of Canonical Correspondence Analysis and STATICO in the identification of spatio-temporal ecological relationships

Susana Mendes M. José Fernández-Gómez Mário Jorge Pereira Ulisses Miranda Azeiteiro M. Purificación Galindo-Villardón 《Journal of applied statistics》2012,39(5):979-994

The wide-ranging and rapidly evolving nature of ecological studies mean that it is not possible to cover all existing and emerging techniques for analyzing multivariate data. However, two important methods enticed many followers: the Canonical Correspondence Analysis (CCA) and the STATICO analysis. Despite the particular characteristics of each, they have similarities and differences, which when analyzed properly, can, together, provide important complementary results to those that are usually exploited by researchers. If on one hand, the use of CCA is completely generalized and implemented, solving many problems formulated by ecologists, on the other hand, this method has some weaknesses mainly caused by the imposition of the number of variables that is required to be applied (much higher in comparison with samples). Also, the STATICO method has no such restrictions, but requires that the number of variables (species or environment) is the same in each time or space. Yet, the STATICO method presents information that can be more detailed since it allows visualizing the variability within groups (either in time or space). In this study, the data needed for implementing these methods are sketched, as well as the comparison is made showing the advantages and disadvantages of each method. The treated ecological data are a sequence of pairs of ecological tables, where species abundances and environmental variables are measured at different, specified locations, over the course of time. 相似文献

6.

An extended association rule mining strategy for gene relationship discovery from microarray data

《Journal of Statistical Computation and Simulation》2012,82(2):384-396

DNA microarrays allow for measuring expression levels of a large number of genes between different experimental conditions and/or samples. Association rule mining (ARM) methods are helpful in finding associational relationships between genes. However, classical association rule mining (CARM) algorithms extract only a subset of the associations that exist among different binary states; therefore can only infer part of the relationships on gene regulations. To resolve this problem, we developed an extended association rule mining (EARM) strategy along with a new way of the association rule definition. Compared with the CARM method, our new approach extracted more frequent genesets from a public microarray data set. The EARM method discovered some biologically interesting association rules that were not detected by CARM. Therefore, EARM provides an effective tool for exploring relationships among genes. 相似文献

7.

面板数据的有序聚类分析及其应用——以全球气候变化聚类分析为例 总被引：1，自引：0，他引：1

杨毅赵国浩秦爱民《统计与信息论坛》2012,27(7):13-18

面板数据的有序聚类分析是多元统计分析的新兴研究领域。借鉴多元统计学中主成分分析方法对面板数据在时间变量上进行降维处理,把变异信息的损失降低到最小,较为准确地反映了样本在各时间段内的整体变化水平;采用费希尔最优求解算法对主成分得分进行有序聚类,为研究有序面板数据的亲疏关系提供一些思路;对全球气候变化进行聚类分析,分析五十年来全球及区域气候变化特点,与国外研究结论对比,显示出良好的应用性。相似文献

8.

Performance of localized regression tree splitting criteria on data with discontinuities

Alexandra P. Bremner Ross H. Taplin 《Australian & New Zealand Journal of Statistics》2004,46(3):367-381

Properties of the localized regression tree splitting criterion, described in Bremner & Taplin (2002) and referred to as the BT method, are explored in this paper and compared to those of Clark & Pregibon's (1992) criterion (the CP method). These properties indicate why the BT method can result in superior trees. This paper shows that the BT method exhibits a weak bias towards edge splits, and the CP method exhibits a strong bias towards central splits in the presence of main effects. A third criterion, called the SM method, that exhibits no bias towards a particular split position is introduced. The SM method is a modification of the BT method that uses more symmetric local means. The BT and SM methods are more likely to split at a discontinuity than the CP method because of their relatively low bias towards particular split positions. The paper shows that the BT and SM methods can be used to discover discontinuities in the data, and that they offer a way of producing a variety of different trees for examination or for tree averaging methods. 相似文献

9.

Missing data in clinical trials: from clinical assumptions to statistical analysis using pattern mixture models

Bohdana Ratitch Michael O'Kelly Robert Tosiello 《Pharmaceutical statistics》2013,12(6):337-347

The need to use rigorous, transparent, clearly interpretable, and scientifically justified methodology for preventing and dealing with missing data in clinical trials has been a focus of much attention from regulators, practitioners, and academicians over the past years. New guidelines and recommendations emphasize the importance of minimizing the amount of missing data and carefully selecting primary analysis methods on the basis of assumptions regarding the missingness mechanism suitable for the study at hand, as well as the need to stress‐test the results of the primary analysis under different sets of assumptions through a range of sensitivity analyses. Some methods that could be effectively used for dealing with missing data have not yet gained widespread usage, partly because of their underlying complexity and partly because of lack of relatively easy approaches to their implementation. In this paper, we explore several strategies for missing data on the basis of pattern mixture models that embody clear and realistic clinical assumptions. Pattern mixture models provide a statistically reasonable yet transparent framework for translating clinical assumptions into statistical analyses. Implementation details for some specific strategies are provided in an Appendix (available online as Supporting Information), whereas the general principles of the approach discussed in this paper can be used to implement various other analyses with different sets of assumptions regarding missing data. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

10.

典型相关分析的延拓研究

杜子芳常志勇《统计与信息论坛》2014,(5):3-7

在典型相关分析中,求得典型相关变量的表达式并没有全部完成任务,例如需要确定典型相关变量的个数和变量选择。针对典型相关变量的个数问题,发现了常用的卡方检验和冗余分析方法的不足,进而提出了一种新的算法。针对原始变量的选择问题,提出了三种可能的路径。最后利用人体尺寸数据对相关结论进行了验证。相似文献

11.

Assessing the impact of initial nonresponse and attrition in the analysis of unemployment duration with panel surveys

Marjo Pyy-Martikainen Ulrich Rendtel 《AStA Advances in Statistical Analysis》2008,92(3):297-318

We show how register data combined at person-level with survey data can be used to conduct a novel type of nonresponse analysis in a panel survey. The availability of register data provides a unique opportunity to directly test the type of the missingness mechanism as well as estimate the size of bias due to initial nonresponse and attrition. We are also able to study in-depth the determinants of initial nonresponse and attrition. We use the Finnish subset of the European Community Household Panel (FI ECHP) data combined with register panel data and unemployment spells as outcome variables of interest. Our results show that initial nonresponse and attrition are clearly different processes driven by different background variables. Both the initial nonresponse and attrition mechanisms are nonignorable with respect to analysis of unemployment spells. Finally, our results suggest that initial nonresponse may play a role at least as important as attrition in causing bias. This result challenges the common view of attrition being the main threat to the value of panel data. 相似文献

12.

Accounting for clinical covariates and interactions in ranking genomic markers using ROC

Tao Yu Shuangge Ma 《统计学通讯:模拟与计算》2017,46(5):3735-3755

In biomedical research, profiling is now commonly conducted, generating high-dimensional genomic measurements (without loss of generality, say genes). An important analysis objective is to rank genes according to their marginal associations with a disease outcome/phenotype. Clinical-covariates, including for example clinical risk factors and environmental exposures, usually exist and need to be properly accounted for. In this study, we propose conducting marginal ranking of genes using a receiver operating characteristic (ROC) based method. This method can accommodate categorical, censored survival, and continuous outcome variables in a very similar manner. Unlike logistic-model-based methods, it does not make very specific assumptions on model, making it robust. In ranking genes, we account for both the main effects of clinical-covariates and their interactions with genes, and develop multiple diagnostic accuracy improvement measurements. Using simulation studies, we show that the proposed method is effective in that genes associated with or gene–covariate interactions associated with the outcome receive high rankings. In data analysis, we observe some differences between the rankings using the proposed method and the logistic-model-based method. 相似文献

13.

基于Panel Data和SEA的环境Kuznets曲线分析 ——与马树才、李国柱两位先生探讨

下载免费PDF全文

李刚《统计研究》2007,24(5):54-59

针对当前学者研究中国环境Kuznets曲线存在的问题，本文使用了面板数据模型和空间计量模型，以克服使用时间序列模型时样本数量偏少和使用截面数据时易引起空间自相关性等问题。结果表明中国有部分环境指标满足环境Kuznets曲线的倒U型特征。相似文献

14.

Estimation of the need for child care in Canada

Jane F. Gentleman G.A. Whitmore 《Revue canadienne de statistique》1991,19(3):242-249

Finding adequate child care is a serious problem for many Canadian parents. The purpose of this case study is to estimate the need for child care in Canada, including ihc portion of this need that may be hidden. We utilize the Family History Survey conducted by Statistics Canada (1984) to explore patterns of both met and unmet child-care needs, based, in the latter case, on varying assumptions about the degree of parents' desire for child care. 相似文献

15.

Functional data analysis: estimation of the relative error in functional regression under random left-truncation model

Belkais Altendji Jacques Demongeot Ali Laksaci 《Journal of nonparametric statistics》2018,30(2):472-490

In this paper, we investigate the relationship between a functional random covariable and a scalar response which is subject to left-truncation by another random variable. Precisely, we use the mean squared relative error as a loss function to construct a nonparametric estimator of the regression operator of these functional truncated data. Under some standard assumptions in functional data analysis, we establish the almost sure consistency, with rates, of the constructed estimator as well as its asymptotic normality. Then, a simulation study, on finite-sized samples, was carried out in order to show the efficiency of our estimation procedure and to highlight its superiority over the classical kernel estimation, for different levels of simulated truncated data. 相似文献

16.

Teaching Bayesian Statistics Using Sampling Methods and MINITAB

James H. Albert 《The American statistician》2013,67(3):182-191

Bayesian statistics can be hard to teach at an elementary level due to the difficulty in deriving the posterior distribution for interesting nonconjugate problems. One attractive method of summarizing the posterior distribution is to directly simulate from the probability distribution of interest and then explore the simulated sample. We illustrate the use of Rubin's Sampling-Importance-Resampling (SIR) algorithm to simulate posterior distributions for three inference problems. In each example, we focus on the construction of the prior distribution and then use exploratory data analysis techniques to describe the posterior samples and make inferences. The use of MINITAB macros is presented to illustrate the ease of performing this simulation on standard statistical computer programs. 相似文献

17.

Comparison of algorithms for replacing missing data in discriminant analysis

J.Twedt Daniel D.S. Gill 《统计学通讯:理论与方法》2013,42(6):1567-1578

We examined the impact of different methods for replacing missing data in discriminant analyses conducted on randomly generated samples from multivariate normal and non-normal distributions. The probabilities of correct classification were obtained for these discriminant analyses before and after randomly deleting data as well as after deleted data were replaced using: (1) variable means, (2) principal component projections, and (3) the EM algorithm. Populations compared were: (1) multivariate normal with covariance matrices ∑₁=∑₂, (2) multivariate normal with ∑₁≠∑₂ and (3) multivariate non-normal with ∑₁=∑₂. Differences in the probabilities of correct classification were most evident for populations with small Mahalanobis distances or high proportions of missing data. The three replacement methods performed similarly but all were better than non - replacement. 相似文献

18.

中国各地区医疗卫生服务的生产效率分析 总被引：9，自引：0，他引：9

罗良清胡美玲《统计与信息论坛》2008,23(2):47-51

医疗卫生是与国民密切相关的一个问题,所以医疗卫生服务的生产效率如何倍受人们的关注。运用DEA模型可以对中国各地区的医疗卫生服务生产效率进行研究,分析不同地区效率高低的原因。结果显示,虽然总体上中国的医疗卫生服务的生产效率处于一个较低水平,但地区之间还是存在着很大的区别,并且地区间即便都是高效率或者低效率的地区,其投入和产出水平也不尽相同。相似文献

19.

Mixture model on the variance for the differential analysis of gene expression data 总被引：1，自引：0，他引：1

Paul Delmar Stéphane Robin Diana Tronik-Le Roux Jean Jacques Daudin 《Journal of the Royal Statistical Society. Series C, Applied statistics》2005,54(1):31-50

Summary. In microarray experiments, accurate estimation of the gene variance is a key step in the identification of differentially expressed genes. Variance models go from the too stringent homoscedastic assumption to the overparameterized model assuming a specific variance for each gene. Between these two extremes there is some room for intermediate models. We propose a method that identifies clusters of genes with equal variance. We use a mixture model on the gene variance distribution. A test statistic for ranking and detecting differentially expressed genes is proposed. The method is illustrated with publicly available complementary deoxyribonucleic acid microarray experiments, an unpublished data set and further simulation studies. 相似文献

20.

On the Estimation of the Density of a Directional Data Stream

下载免费PDF全文

Aboubacar Amiri Baba Thiam Thomas Verdebout 《Scandinavian Journal of Statistics》2017,44(1):249-267

Many directional data such as wind directions can be collected extremely easily so that experiments typically yield a huge number of data points that are sequentially collected. To deal with such big data, the traditional nonparametric techniques rapidly require a lot of time to be computed and therefore become useless in practice if real time or online forecasts are expected. In this paper, we propose a recursive kernel density estimator for directional data which (i) can be updated extremely easily when a new set of observations is available and (ii) keeps asymptotically the nice features of the traditional kernel density estimator. Our methodology is based on Robbins–Monro stochastic approximations ideas. We show that our estimator outperforms the traditional techniques in terms of computational time while being extremely competitive in terms of efficiency with respect to its competitors in the sequential context considered here. We obtain expressions for its asymptotic bias and variance together with an almost sure convergence rate and an asymptotic normality result. Our technique is illustrated on a wind dataset collected in Spain. A Monte‐Carlo study confirms the nice properties of our recursive estimator with respect to its non‐recursive counterpart. 相似文献