共查询到20条相似文献,搜索用时 93 毫秒
1.
The microarray technology allows the measurement of expression levels of thousands of genes simultaneously. The dimension and complexity of gene expression data obtained by microarrays create challenging data analysis and management problems ranging from the analysis of images produced by microarray experiments to biological interpretation of results. Therefore, statistical and computational approaches are beginning to assume a substantial position within the molecular biology area. We consider the problem of simultaneously clustering genes and tissue samples (in general conditions) of a microarray data set. This can be useful for revealing groups of genes involved in the same molecular process as well as groups of conditions where this process takes place. The need of finding a subset of genes and tissue samples defining a homogeneous block had led to the application of double clustering techniques on gene expression data. Here, we focus on an extension of standard K-means to simultaneously cluster observations and features of a data matrix, namely double K-means introduced by Vichi (2000). We introduce this model in a probabilistic framework and discuss the advantages of using this approach. We also develop a coordinate ascent algorithm and test its performance via simulation studies and real data set. Finally, we validate the results obtained on the real data set by building resampling confidence intervals for block centroids. 相似文献
2.
3.
Paul D. McNicholas Sanjeena Subedi 《Journal of statistical planning and inference》2012,142(5):1114-1127
Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter. 相似文献
4.
Gabriela B. Cybis Marcio Valk Sílvia R. C. Lopes 《Journal of Statistical Computation and Simulation》2018,88(10):1882-1902
ABSTRACTGenetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here we consider a versatile U-statistics-based approach for non-parametric clustering that allows for an unconventional way of solving these problems. In this paper we propose a statistical test to assess group homogeneity taking into account multiple testing issues and a clustering algorithm based on dissimilarities within and between groups that highly speeds up the homogeneity test. We also propose a test to verify classification significance of a sample in one of two groups. We present Monte Carlo simulations that evaluate size and power of the proposed tests under different scenarios. Finally, the methodology is applied to three different genetic data sets: global human genetic diversity, breast tumour gene expression and Dengue virus serotypes. These applications showcase this statistical framework's ability to answer diverse biological questions in the high dimension low sample size scenario while adapting to the specificities of the different datatypes. 相似文献
5.
J.X. Zhu G.J. McLachlan L. Ben-Tovim Jones I.A. Wood 《Journal of statistical planning and inference》2008
There has been ever increasing interest in the use of microarray experiments as a basis for the provision of prediction (discriminant) rules for improved diagnosis of cancer and other diseases. Typically, the microarray cancer studies provide only a limited number of tissue samples from the specified classes of tumours or patients, whereas each tissue sample may contain the expression levels of thousands of genes. Thus researchers are faced with the problem of forming a prediction rule on the basis of a small number of classified tissue samples, which are of very high dimension. Usually, some form of feature (gene) selection is adopted in the formation of the prediction rule. As the subset of genes used in the final form of the rule have not been randomly selected but rather chosen according to some criterion designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rules if care is not taken. We shall present various situations where selection bias arises in the formation of a prediction rule and where there is a consequent need for the correction of this bias. We describe the design of cross-validation schemes that are able to correct for the various selection biases. 相似文献
6.
Jiajia Chen Karel Hron Matthias Templ Shengjia Li 《Journal of applied statistics》2018,45(11):2067-2080
The logratio methodology is not applicable when rounded zeros occur in compositional data. There are many methods to deal with rounded zeros. However, some methods are not suitable for analyzing data sets with high dimensionality. Recently, related methods have been developed, but they cannot balance the calculation time and accuracy. For further improvement, we propose a method based on regression imputation with Q-mode clustering. This method forms the groups of parts and builds partial least squares regression with these groups using centered logratio coordinates. We also prove that using centered logratio coordinates or isometric logratio coordinates in the response of partial least squares regression have the equivalent results for the replacement of rounded zeros. Simulation study and real example are conducted to analyze the performance of the proposed method. The results show that the proposed method can reduce the calculation time in higher dimensions and improve the quality of results. 相似文献
7.
8.
This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure. 相似文献
9.
F. Prataviera G. M. Cordeiro E. M. M. Ortega E. M. Hashimoto V. G. Cancho 《Journal of applied statistics》2022,49(16):4137
We propose a new continuous distribution in the interval based on the generalized odd log-logistic-G family, whose density function can be symmetrical, asymmetric, unimodal and bimodal. The new model is implemented using the gamlss packages in R. We propose an extended regression based on this distribution which includes as sub-models some important regressions. We employ a frequentist and Bayesian analysis to estimate the parameters and adopt the non-parametric and parametric bootstrap methods to obtain better efficiency of the estimators. Some simulations are conducted to verify the empirical distribution of the maximum likelihood estimators. We compare the empirical distribution of the quantile residuals with the standard normal distribution. The extended regression can give more realistic fits than other regressions in the analysis of proportional data. 相似文献
10.
AbstractInferential methods based on ranks present robust and powerful alternative methodology for testing and estimation. In this article, two objectives are followed. First, develop a general method of simultaneous confidence intervals based on the rank estimates of the parameters of a general linear model and derive the asymptotic distribution of the pivotal quantity. Second, extend the method to high dimensional data such as gene expression data for which the usual large sample approximation does not apply. It is common in practice to use the asymptotic distribution to make inference for small samples. The empirical investigation in this article shows that for methods based on the rank-estimates, this approach does not produce a viable inference and should be avoided. A method based on the bootstrap is outlined and it is shown to provide a reliable and accurate method of constructing simultaneous confidence intervals based on rank estimates. In particular it is shown that commonly applied methods of normal or t-approximation are not satisfactory, particularly for large-scale inferences. Methods based on ranks are uniquely suitable for analysis of microarray gene expression data since they often involve large scale inferences based on small samples containing a large number of outliers and violate the assumption of normality. A real microarray data is analyzed using the rank-estimate simultaneous confidence intervals. Viability of the proposed method is assessed through a Monte Carlo simulation study under varied assumptions. 相似文献
11.
Julio Cezar Souza Vasconcelos Gauss Moutinho Cordeiro Edwin Moises Marcos Ortega dila Maria de Rezende 《Journal of applied statistics》2021,48(2):349
We define the odd log-logistic exponential Gaussian regression with two systematic components, which extends the heteroscedastic Gaussian regression and it is suitable for bimodal data quite common in the agriculture area. We estimate the parameters by the method of maximum likelihood. Some simulations indicate that the maximum-likelihood estimators are accurate. The model assumptions are checked through case deletion and quantile residuals. The usefulness of the new regression model is illustrated by means of three real data sets in different areas of agriculture, where the data present bimodality. 相似文献
12.
Traci Leong Stuart R. Lipsitz & Joseph G. Ibrahim 《Journal of the Royal Statistical Society. Series C, Applied statistics》2001,50(4):467-484
A common occurrence in clinical trials with a survival end point is missing covariate data. With ignorably missing covariate data, Lipsitz and Ibrahim proposed a set of estimating equations to estimate the parameters of Cox's proportional hazards model. They proposed to obtain parameter estimates via a Monte Carlo EM algorithm. We extend those results to non-ignorably missing covariate data. We present a clinical trials example with three partially observed laboratory markers which are used as covariates to predict survival. 相似文献
13.
We propose a mixture of latent variables model for the model-based clustering, classification, and discriminant analysis of data comprising variables with mixed type. This approach is a generalization of latent variable analysis, and model fitting is carried out within the expectation-maximization framework. Our approach is outlined and a simulation study conducted to illustrate the effect of sample size and noise on the standard errors and the recovery probabilities for the number of groups. Our modelling methodology is then applied to two real data sets and their clustering and classification performance is discussed. We conclude with discussion and suggestions for future work. 相似文献
14.
Current status data frequently occur in failure time studies, particularly in demographical studies and tumorigenicity experiments. Although commonly used in this context, proportional hazards and odds models are inadequate when survival functions cross. The authors consider a class of two‐sample models which is suitable for this situation and encompasses the proportional hazards and odds models. The estimating equations they propose lead to consistent and asymptotically Gaussian estimates of regression parameters in the extended model. Their approach is assessed through simulations and illustrated using data from a tumorigenicity experiment. 相似文献
15.
Cédric Béguin Beat Hulliger 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2004,167(2):275-294
Summary. As a part of the EUREDIT project new methods to detect multivariate outliers in incomplete survey data have been developed. These methods are the first to work with sampling weights and to be able to cope with missing values. Two of these methods are presented here. The epidemic algorithm simulates the propagation of a disease through a population and uses extreme infection times to find outlying observations. Transformed rank correlations are robust estimates of the centre and the scatter of the data. They use a geometric transformation that is based on the rank correlation matrix. The estimates are used to define a Mahalanobis distance that reveals outliers. The two methods are applied to a small data set and to one of the evaluation data sets of the EUREDIT project. 相似文献
16.
We propose a modification to the regular kernel density estimation method that use asymmetric kernels to circumvent the spill over problem for densities with positive support. First a pivoting method is introduced for placement of the data relative to the kernel function. This yields a strongly consistent density estimator that integrates to one for each fixed bandwidth in contrast to most density estimators based on asymmetric kernels proposed in the literature. Then a data-driven Bayesian local bandwidth selection method is presented and lognormal, gamma, Weibull and inverse Gaussian kernels are discussed as useful special cases. Simulation results and a real-data example illustrate the advantages of the new methodology. 相似文献
17.
Haitham M. Yousof Mahdi Rasekhi Morad Alizadeh G. G. Hamedani M. Masoom Ali 《统计学通讯:模拟与计算》2019,48(1):264-286
In this article, we introduce a new extension of Burr XII distribution called Topp Leone Generated Burr XII distribution. We derive some of its properties. Useful characterizations are presented. Simulation study is performed to assess the performance of the maximum likelihood estimators. Censored maximum likelihood estimation is presented in the general case of multi-censored data. The new location-scale regression model based on the proposed distribution is introduced. The usefulness of the proposed models is illustrated empirically by means of three real datasets. 相似文献
18.
Time-series data are often subject to measurement error, usually the result of needing to estimate the variable of interest. Generally, however, the relationship between the surrogate variables and the true variables can be rather complicated compared to the classical additive error structure usually assumed. In this article, we address the estimation of the parameters in autoregressive models in the presence of function measurement errors. We first develop a parameter estimation method with the help of validation data; this estimation method does not depend on functional form and the distribution of the measurement error. The proposed estimator is proved to be consistent. Moreover, the asymptotic representation and the asymptotic normality of the estimator are also derived, respectively. Simulation results indicate that the proposed method works well for practical situation. 相似文献
19.
Shimin Zheng Eunice Mogusu Sreenivas P. Veeranki Megan Quinn Yan Cao 《统计学通讯:理论与方法》2017,46(9):4285-4295
It is widely believed that the median is “usually” between the mean and the mode for skewed unimodal distributions. However, this inequality is not always true, especially with grouped data. Unavailability of complete raw data further necessitates the importance of evaluating this characteristic in grouped data. There is a gap in the current statistical literature on assessing mean–median–mode inequality for grouped data. The study aims to evaluate the relationship between the mean, median, and mode with unimodal grouped data; derive conditions for their inequalities; and present their application. 相似文献
20.
Jong-Il Baek 《统计学通讯:理论与方法》2017,46(12):6000-6016
In this paper, we studied the uniform convergence with rates for the kernel estimator of the conditional mode function for a left truncated and right censored model. It is assumed that the lifetime observations with multivariate covariates form a stationary α-mixing sequence. Also, the asymptotic normality of the estimator is established. 相似文献