期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Clustering microarray data using model-based double K-means

Francesca Martella Maurizio Vichi 《Journal of applied statistics》2012,39(9):1853-1869

The microarray technology allows the measurement of expression levels of thousands of genes simultaneously. The dimension and complexity of gene expression data obtained by microarrays create challenging data analysis and management problems ranging from the analysis of images produced by microarray experiments to biological interpretation of results. Therefore, statistical and computational approaches are beginning to assume a substantial position within the molecular biology area. We consider the problem of simultaneously clustering genes and tissue samples (in general conditions) of a microarray data set. This can be useful for revealing groups of genes involved in the same molecular process as well as groups of conditions where this process takes place. The need of finding a subset of genes and tissue samples defining a homogeneous block had led to the application of double clustering techniques on gene expression data. Here, we focus on an extension of standard K-means to simultaneously cluster observations and features of a data matrix, namely double K-means introduced by Vichi (2000). We introduce this model in a probabilistic framework and discuss the advantages of using this approach. We also develop a coordinate ascent algorithm and test its performance via simulation studies and real data set. Finally, we validate the results obtained on the real data set by building resampling confidence intervals for block centroids. 相似文献

2.

Performance of the hotelling T2 control chart for compositional data in the presence of measurement errors

F. S. Zaidi K. P. Tran M. B. C. Khoo 《Journal of applied statistics》2019,46(14):2583-2602

相似文献

3.

Clustering gene expression time course data using mixtures of multivariate t-distributions

Paul D. McNicholas Sanjeena Subedi 《Journal of statistical planning and inference》2012,142(5):1114-1127

Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter. 相似文献

4.

Clustering and classification problems in genetics through U-statistics

Gabriela B. Cybis Marcio Valk Sílvia R. C. Lopes 《Journal of Statistical Computation and Simulation》2018,88(10):1882-1902

ABSTRACT

Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here we consider a versatile U-statistics-based approach for non-parametric clustering that allows for an unconventional way of solving these problems. In this paper we propose a statistical test to assess group homogeneity taking into account multiple testing issues and a clustering algorithm based on dissimilarities within and between groups that highly speeds up the homogeneity test. We also propose a test to verify classification significance of a sample in one of two groups. We present Monte Carlo simulations that evaluate size and power of the proposed tests under different scenarios. Finally, the methodology is applied to three different genetic data sets: global human genetic diversity, breast tumour gene expression and Dengue virus serotypes. These applications showcase this statistical framework's ability to answer diverse biological questions in the high dimension low sample size scenario while adapting to the specificities of the different datatypes. 相似文献

5.

On selection biases with prediction rules formed from gene expression data

J.X. Zhu G.J. McLachlan L. Ben-Tovim Jones I.A. Wood 《Journal of statistical planning and inference》2008

There has been ever increasing interest in the use of microarray experiments as a basis for the provision of prediction (discriminant) rules for improved diagnosis of cancer and other diseases. Typically, the microarray cancer studies provide only a limited number of tissue samples from the specified classes of tumours or patients, whereas each tissue sample may contain the expression levels of thousands of genes. Thus researchers are faced with the problem of forming a prediction rule on the basis of a small number of classified tissue samples, which are of very high dimension. Usually, some form of feature (gene) selection is adopted in the formation of the prediction rule. As the subset of genes used in the final form of the rule have not been randomly selected but rather chosen according to some criterion designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rules if care is not taken. We shall present various situations where selection bias arises in the formation of a prediction rule and where there is a consequent need for the correction of this bias. We describe the design of cross-validation schemes that are able to correct for the various selection biases. 相似文献

6.

Regression imputation with Q-mode clustering for rounded zero replacement in high-dimensional compositional data

Jiajia Chen Karel Hron Matthias Templ Shengjia Li 《Journal of applied statistics》2018,45(11):2067-2080

The logratio methodology is not applicable when rounded zeros occur in compositional data. There are many methods to deal with rounded zeros. However, some methods are not suitable for analyzing data sets with high dimensionality. Recently, related methods have been developed, but they cannot balance the calculation time and accuracy. For further improvement, we propose a method based on regression imputation with Q-mode clustering. This method forms the groups of parts and builds partial least squares regression with these groups using centered logratio coordinates. We also prove that using centered logratio coordinates or isometric logratio coordinates in the response of partial least squares regression have the equivalent results for the replacement of rounded zeros. Simulation study and real example are conducted to analyze the performance of the proposed method. The results show that the proposed method can reduce the calculation time in higher dimensions and improve the quality of results. 相似文献

7.

Bayesian modeling of factorial time-course data with applications to a bone aging gene expression study

Joseph Wu Mayetri Gupta Amira I. Hussein Louis Gerstenfeld 《Journal of applied statistics》2021,48(10):1730

相似文献

8.

Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples

Susmita Datta Lara M. DePadilla 《Statistical Methodology》2006,3(1):79

This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure. 相似文献

9.

A new regression model for rates and proportions data with applications

F. Prataviera G. M. Cordeiro E. M. M. Ortega E. M. Hashimoto V. G. Cancho 《Journal of applied statistics》2022,49(16):4137

We propose a new continuous distribution in the interval

(0, 1)

based on the generalized odd log-logistic-G family, whose density function can be symmetrical, asymmetric, unimodal and bimodal. The new model is implemented using the gamlss packages in R. We propose an extended regression based on this distribution which includes as sub-models some important regressions. We employ a frequentist and Bayesian analysis to estimate the parameters and adopt the non-parametric and parametric bootstrap methods to obtain better efficiency of the estimators. Some simulations are conducted to verify the empirical distribution of the maximum likelihood estimators. We compare the empirical distribution of the quantile residuals with the standard normal distribution. The extended regression can give more realistic fits than other regressions in the analysis of proportional data. 相似文献

10.

On simultaneous confidence intervals based on rank-estimates with application to analysis of gene expression data

Hossein Mansouri Bo Li 《统计学通讯:理论与方法》2013,42(17):4339-4349

Abstract

Inferential methods based on ranks present robust and powerful alternative methodology for testing and estimation. In this article, two objectives are followed. First, develop a general method of simultaneous confidence intervals based on the rank estimates of the parameters of a general linear model and derive the asymptotic distribution of the pivotal quantity. Second, extend the method to high dimensional data such as gene expression data for which the usual large sample approximation does not apply. It is common in practice to use the asymptotic distribution to make inference for small samples. The empirical investigation in this article shows that for methods based on the rank-estimates, this approach does not produce a viable inference and should be avoided. A method based on the bootstrap is outlined and it is shown to provide a reliable and accurate method of constructing simultaneous confidence intervals based on rank estimates. In particular it is shown that commonly applied methods of normal or t-approximation are not satisfactory, particularly for large-scale inferences. Methods based on ranks are uniquely suitable for analysis of microarray gene expression data since they often involve large scale inferences based on small samples containing a large number of outliers and violate the assumption of normality. A real microarray data is analyzed using the rank-estimate simultaneous confidence intervals. Viability of the proposed method is assessed through a Monte Carlo simulation study under varied assumptions. 相似文献

11.

A new regression model for bimodal data and applications in agriculture

Julio Cezar Souza Vasconcelos Gauss Moutinho Cordeiro Edwin Moises Marcos Ortega dila Maria de Rezende 《Journal of applied statistics》2021,48(2):349

We define the odd log-logistic exponential Gaussian regression with two systematic components, which extends the heteroscedastic Gaussian regression and it is suitable for bimodal data quite common in the agriculture area. We estimate the parameters by the method of maximum likelihood. Some simulations indicate that the maximum-likelihood estimators are accurate. The model assumptions are checked through case deletion and quantile residuals. The usefulness of the new regression model is illustrated by means of three real data sets in different areas of agriculture, where the data present bimodality. 相似文献

12.

Incomplete covariates in the Cox model with applications to biological marker data

Traci Leong Stuart R. Lipsitz & Joseph G. Ibrahim 《Journal of the Royal Statistical Society. Series C, Applied statistics》2001,50(4):467-484

A common occurrence in clinical trials with a survival end point is missing covariate data. With ignorably missing covariate data, Lipsitz and Ibrahim proposed a set of estimating equations to estimate the parameters of Cox's proportional hazards model. They proposed to obtain parameter estimates via a Monte Carlo EM algorithm. We extend those results to non-ignorably missing covariate data. We present a clinical trials example with three partially observed laboratory markers which are used as covariates to predict survival. 相似文献

13.

Model-based clustering,classification, and discriminant analysis of data with mixed type

Ryan P. Browne Paul D. McNicholas 《Journal of statistical planning and inference》2012

We propose a mixture of latent variables model for the model-based clustering, classification, and discriminant analysis of data comprising variables with mixed type. This approach is a generalization of latent variable analysis, and model fitting is carried out within the expectation-maximization framework. Our approach is outlined and a simulation study conducted to illustrate the effect of sample size and noise on the standard errors and the recovery probabilities for the number of groups. Our modelling methodology is then applied to two real data sets and their clustering and classification performance is discussed. We conclude with discussion and suggestions for future work. 相似文献

14.

Semiparametric regression analysis of two‐sample current status data,with applications to tumorigenicity experiments

Xingwei Tong Chao Zhu Jianguo Sun 《Revue canadienne de statistique》2007,35(4):575-584

Current status data frequently occur in failure time studies, particularly in demographical studies and tumorigenicity experiments. Although commonly used in this context, proportional hazards and odds models are inadequate when survival functions cross. The authors consider a class of two‐sample models which is suitable for this situation and encompasses the proportional hazards and odds models. The estimating equations they propose lead to consistent and asymptotically Gaussian estimates of regression parameters in the extended model. Their approach is assessed through simulations and illustrated using data from a tumorigenicity experiment. 相似文献

15.

Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations

Cédric Béguin Beat Hulliger 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2004,167(2):275-294

Summary. As a part of the EUREDIT project new methods to detect multivariate outliers in incomplete survey data have been developed. These methods are the first to work with sampling weights and to be able to cope with missing values. Two of these methods are presented here. The epidemic algorithm simulates the propagation of a disease through a population and uses extreme infection times to find outlying observations. Transformed rank correlations are robust estimates of the centre and the scatter of the data. They use a geometric transformation that is based on the rank correlation matrix. The estimates are used to define a Mahalanobis distance that reveals outliers. The two methods are applied to a small data set and to one of the evaluation data sets of the EUREDIT project. 相似文献

16.

Density estimation using asymmetric kernels and Bayes bandwidths with censored data

C.N. Kuruwita K.B. Kulasekera W.J. Padgett 《Journal of statistical planning and inference》2010

We propose a modification to the regular kernel density estimation method that use asymmetric kernels to circumvent the spill over problem for densities with positive support. First a pivoting method is introduced for placement of the data relative to the kernel function. This yields a strongly consistent density estimator that integrates to one for each fixed bandwidth in contrast to most density estimators based on asymmetric kernels proposed in the literature. Then a data-driven Bayesian local bandwidth selection method is presented and lognormal, gamma, Weibull and inverse Gaussian kernels are discussed as useful special cases. Simulation results and a real-data example illustrate the advantages of the new methodology. 相似文献

17.

A new lifetime model with regression models,characterizations and applications

Haitham M. Yousof Mahdi Rasekhi Morad Alizadeh G. G. Hamedani M. Masoom Ali 《统计学通讯:模拟与计算》2019,48(1):264-286

In this article, we introduce a new extension of Burr XII distribution called Topp Leone Generated Burr XII distribution. We derive some of its properties. Useful characterizations are presented. Simulation study is performed to assess the performance of the maximum likelihood estimators. Censored maximum likelihood estimation is presented in the general case of multi-censored data. The new location-scale regression model based on the proposed distribution is introduced. The usefulness of the proposed models is illustrated empirically by means of three real datasets. 相似文献

18.

Estimation in autoregressive models with surrogate data and validation data

Shi-Hang Yu Kun Li Zhi-Wen Zhao 《统计学通讯:理论与方法》2017,46(3):1532-1545

Time-series data are often subject to measurement error, usually the result of needing to estimate the variable of interest. Generally, however, the relationship between the surrogate variables and the true variables can be rather complicated compared to the classical additive error structure usually assumed. In this article, we address the estimation of the parameters in autoregressive models in the presence of function measurement errors. We first develop a parameter estimation method with the help of validation data; this estimation method does not depend on functional form and the distribution of the measurement error. The proposed estimator is proved to be consistent. Moreover, the asymptotic representation and the asymptotic normality of the estimator are also derived, respectively. Simulation results indicate that the proposed method works well for practical situation. 相似文献

19.

The relationship between the mean,median, and mode with grouped data

Shimin Zheng Eunice Mogusu Sreenivas P. Veeranki Megan Quinn Yan Cao 《统计学通讯:理论与方法》2017,46(9):4285-4295

It is widely believed that the median is “usually” between the mean and the mode for skewed unimodal distributions. However, this inequality is not always true, especially with grouped data. Unavailability of complete raw data further necessitates the importance of evaluating this characteristic in grouped data. There is a gap in the current statistical literature on assessing mean–median–mode inequality for grouped data. The study aims to evaluate the relationship between the mean, median, and mode with unimodal grouped data; derive conditions for their inequalities; and present their application. 相似文献

20.

Estimation of conditional mode with truncated,censored, and dependent data

Jong-Il Baek 《统计学通讯:理论与方法》2017,46(12):6000-6016

In this paper, we studied the uniform convergence with rates for the kernel estimator of the conditional mode function for a left truncated and right censored model. It is assumed that the lifetime observations with multivariate covariates form a stationary α-mixing sequence. Also, the asymptotic normality of the estimator is established. 相似文献