首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Recent advances in technology have allowed researchers to collect large scale complex biological data, simultaneously, often in matrix format. In genomic studies, for instance, measurements from tens to hundreds of thousands of genes are taken from individuals across several experimental groups. In time course microarray experiments, gene expression is measured at several time points for each individual across the whole genome resulting in a high-dimensional matrix for each gene. In such experiments, researchers are faced with high-dimensional longitudinal data. Unfortunately, traditional methods for longitudinal data are not appropriate for high-dimensional situations. In this paper, we use the growth curve model and introduce test useful for high-dimensional longitudinal data and evaluate its performance using simulations. We also show how our approach can be used to filter genes in time course genomic experiments. We illustrate this using publicly available genomic data, involving experiments comparing normal human lung tissue with vanadium pentoxide treated human lung tissue, designed with the aim of understanding the susceptibility of individuals working in petro-chemical factories to airway re-modelling. Using our method, we were able to filter out 1053 (about 5 %) genes as non-noise genes from a pool of  22,277. Although our focus is on hypothesis testing, we also provided modified maximum likelihood estimator for the mean parameter of the growth curve model and assessed its performance through bias and mean squared error.  相似文献   

2.
Abstract.  We consider a two-component mixture model where one component distribution is known while the mixing proportion and the other component distribution are unknown. These kinds of models were first introduced in biology to study the differences in expression between genes. The various estimation methods proposed till now have all assumed that the unknown distribution belongs to a parametric family. In this paper, we show how this assumption can be relaxed. First, we note that generally the above model is not identifiable, but we show that under moment and symmetry conditions some 'almost everywhere' identifiability results can be obtained. Where such identifiability conditions are fulfilled we propose an estimation method for the unknown parameters which is shown to be strongly consistent under mild conditions. We discuss applications of our method to microarray data analysis and to the training data problem. We compare our method to the parametric approach using simulated data and, finally, we apply our method to real data from microarray experiments.  相似文献   

3.
In this paper we introduce a new method for detecting outliers in a set of proportions. It is based on the construction of a suitable two-way contingency table and on the application of an algorithm for the detection of outlying cells in such table. We exploit the special structure of the relevant contingency table to increase the efficiency of the method. The main properties of our algorithm, together with a guide for the choice of the parameters, are investigated through simulations, and in simple cases some theoretical justifications are provided. Several examples on synthetic data and an example based on pseudo-real data from biological experiments demonstrate the good performances of our algorithm.  相似文献   

4.
We consider the detection of changes in the mean of a set of time series. The breakpoints are allowed to be series specific, and the series are assumed to be correlated. The correlation between the series is supposed to be constant along time but is allowed to take an arbitrary form. We show that such a dependence structure can be encoded in a factor model. Thanks to this representation, the inference of the breakpoints can be achieved via dynamic programming, which remains one the most efficient algorithms. We propose a model selection procedure to determine both the number of breakpoints and the number of factors. This proposed method is implemented in the FASeg R package, which is available on the CRAN. We demonstrate the performances of our procedure through simulation experiments and present an application to geodesic data.  相似文献   

5.
Supersaturated designs are a large class of factorial designs which can be used for screening out the important factors from a large set of potentially active variables. The huge advantage of these designs is that they reduce the experimental cost drastically, but their critical disadvantage is the confounding involved in the statistical analysis. In this article, we propose a method for analyzing data using several types of supersaturated designs. Modifications of widely used information criteria are given and applied to the variable selection procedure for the identification of the active factors. The effectiveness of the proposed method is depicted via simulated experiments and comparisons.  相似文献   

6.
The current paradigm for the identification of candidate drugs within the pharmaceutical industry typically involves the use of high-throughput screens. High-content screening (HCS) is the term given to the process of using an imaging platform to screen large numbers of compounds for some desirable biological activity. Classification methods have important applications in HCS experiments, where they are used to predict which compounds have the potential to be developed into new drugs. In this paper, a new classification method is proposed for batches of compounds where the rule is updated sequentially using information from the classification of previous batches. This methodology accounts for the possibility that the training data are not a representative sample of the test data and that the underlying group distributions may change as new compounds are analysed. This technique is illustrated on an example data set using linear discriminant analysis, k-nearest neighbour and random forest classifiers. Random forests are shown to be superior to the other classifiers and are further improved by the additional updating algorithm in terms of an increase in the number of true positives as well as a decrease in the number of false positives.  相似文献   

7.
Gene regulation plays a fundamental role in biological activities. The gene regulation network (GRN) is a high-dimensional complex system, which can be represented by various mathematical or statistical models. The ordinary differential equation (ODE) model is one of the popular dynamic GRN models. We proposed a comprehensive statistical procedure for ODE model to identify the dynamic GRN. In this article, we applied this model to different segments of time course gene expression data from a simulation experiment and a yeast cell cycle study. We found that the two cell cycle and one cell cycle data provided consistent results, but half cell cycle data produced biased estimation. Therefore, we may conclude that the proposed model can quantify both two cell cycle and one cell cycle gene expression dynamics, but not for half cycle dynamics. The findings suggest that the model can identify the dynamic GRN correctly if the time course gene expression data are sufficient enough to capture the overall dynamics of underlying biological mechanism.  相似文献   

8.
The mean residual life (MRL) measures the remaining life expectancy and is useful in actuarial studies, biological experiments and clinical trials. To assess the covariate effect, an additive MRL regression model has been proposed in the literature. In this paper, we focus on the topic of model checking. Specifically, we develop two goodness-of-fit tests to test the additive MRL model assumption. We explore the large sample properties of the test statistics and show that both of them are based on asymptotic Gaussian processes so that resampling approaches can be applied to find the rejection regions. Simulation studies indicate that our methods work reasonably well for sample sizes ranging from 50 to 200. Two empirical data sets are analyzed to illustrate the approaches.  相似文献   

9.
A new statistical approach is developed for estimating the carcinogenic potential of drugs and other chemical substances used by humans. Improved statistical methods are developed for rodent tumorigenicity assays that have interval sacrifices but not cause-of-death data. For such experiments, this paper proposes a nonparametric maximum likelihood estimation method for estimating the distributions of the time to onset of and the time to death from the tumour. The log-likelihood function is optimized using a constrained direct search procedure. Using the maximum likelihood estimators, the number of fatal tumours in an experiment can be imputed. By applying the procedure proposed to a real data set, the effect of calorie restriction is investigated. In this study, we found that calorie restriction delays the tumour onset time significantly for pituitary tumours. The present method can result in substantial economic savings by relieving the need for a case-by-case assignment of the cause of death or context of observation by pathologists. The ultimate goal of the method proposed is to use the imputed number of fatal tumours to modify Peto's International Agency for Research on Cancer test for application to tumorigenicity assays that lack cause-of-death data.  相似文献   

10.
Costs associated with the evaluation of biomarkers can restrict the number of relevant biological samples to be measured. This common problem has been dealt with extensively in the epidemiologic and biostatistical literature that proposes to apply different cost-efficient procedures, including pooling and random sampling strategies. The pooling design has been widely addressed as a very efficient sampling method under certain parametric assumptions regarding data distribution. When cost is not a main factor in the evaluation of biomarkers but measurement is subject to a limit of detection, a common instrument limitation on the measurement process, the pooling design can partially overcome this instrumental limitation. In certain situations, the pooling design can provide data that is less informative than a simple random sample; however this is not always the case. Pooled-data-based nonparametric inferences have not been well addressed in the literature. In this article, a distribution-free method based on the empirical likelihood technique is proposed to substitute the traditional parametric-likelihood approach, providing the true coverage, confidence interval estimation and powerful tests based on data obtained after the cost-efficient designs. We also consider several nonparametric tests to compare with the proposed procedure. We examine the proposed methodology via a broad Monte Carlo study and a real data example.  相似文献   

11.
We investigate an optimization problem for mixture experiments. We consider the case when a large number of ingredients are available but mixtures can contain only a few number of ingredients. These conditions are held in experiments for self assembling molecular systems. First, we introduce a concept of uniform coverage design specialized for the situation. Next, we propose to use the stepwise technique for estimating coefficients of third-order Scheffe model which describes a response surface. After that, we propose a method of adding new mixtures for a movement to an extremum region. By this method, additional mixtures of experiments are extremum points of current estimated model and points which lead to more accurate estimation of current model prediction. This methodology is studied numerically for a model constructed from real data.  相似文献   

12.
The latent class model or multivariate multinomial mixture is a powerful approach for clustering categorical data. It uses a conditional independence assumption given the latent class to which a statistical unit is belonging. In this paper, we exploit the fact that a fully Bayesian analysis with Jeffreys non-informative prior distributions does not involve technical difficulty to propose an exact expression of the integrated complete-data likelihood, which is known as being a meaningful model selection criterion in a clustering perspective. Similarly, a Monte Carlo approximation of the integrated observed-data likelihood can be obtained in two steps: an exact integration over the parameters is followed by an approximation of the sum over all possible partitions through an importance sampling strategy. Then, the exact and the approximate criteria experimentally compete, respectively, with their standard asymptotic BIC approximations for choosing the number of mixture components. Numerical experiments on simulated data and a biological example highlight that asymptotic criteria are usually dramatically more conservative than the non-asymptotic presented criteria, not only for moderate sample sizes as expected but also for quite large sample sizes. This research highlights that asymptotic standard criteria could often fail to select some interesting structures present in the data.  相似文献   

13.
When the results of biological experiments are tested for a possible difference between treatment and control groups, the inference is only valid if based upon a model that fits the experimental results satisfactorily. In dominant-lethal testing, foetal death has previously been assumed to follow a variety of models, including a Poisson, Binomial, Beta-binomial and various mixture models. However, discriminating between models has always been a particularly difficult problem. In this paper, we consider the data from 6 separate dominant-lethal assay experiments and discriminate between the competing models which could be used to describe them. We adopt a Bayesian approach and illustrate how a variety of different models may be considered, using Markov chain Monte Carlo (MCMC) simulation techniques and comparing the results with the corresponding maximum likelihood analyses. We present an auxiliary variable method for determining the probability that any particular data cell is assigned to a given component in a mixture and we illustrate the value of this approach. Finally, we show how the Bayesian approach provides a natural and unique perspective on the model selection problem via reversible jump MCMC and illustrate how probabilities associated with each of the different models may be calculated for each data set. In terms of estimation we show how, by averaging over the different models, we obtain reliable and robust inference for any statistic of interest.  相似文献   

14.
This paper introduces regularized functional principal component analysis for multidimensional functional data sets, utilizing Gaussian basis functions. An essential point in a functional approach via basis expansions is the evaluation of the matrix for the integral of the product of any two bases (cross-product matrix). Advantages of the use of the Gaussian type of basis functions in the functional approach are that its cross-product matrix can be easily calculated, and it creates a much more flexible instrument for transforming each individual's observation into a functional form. The proposed method is applied to the analysis of three-dimensional (3D) protein structural data that can be referred to as unbalanced data. It is shown that our method extracts useful information from unbalanced data through the application. Numerical experiments are conducted to investigate the effectiveness of our method via Gaussian basis functions, compared to the method based on B-splines. On performing regularized functional principal component analysis with B-splines, we also derive the exact form of its cross-product matrix. The numerical results show that our methodology is superior to the method based on B-splines for unbalanced data.  相似文献   

15.
Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas in practice it is often the case that only summary data are available. For example this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article, we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straight-forward, and related to a long history of the use of linear methods for estimating missing values (e.g. Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible - allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context - these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the art imputation methods that use individual-level data, but at a fraction of the computational cost.  相似文献   

16.
Complex biological processes are usually experimented along time among a collection of individuals, longitudinal data are then available. The statistical challenge is to better understand the underlying biological mechanisms. A standard statistical approach is mixed-effects model where the regression function is highly-developed to describe precisely the biological processes (solutions of multi-dimensional ordinary differential equations or of partial differential equation). A classical estimation method relies on coupling a stochastic version of the EM algorithm with a Monte Carlo Markov Chain algorithm. This algorithm requires many evaluations of the regression function. This is clearly prohibitive when the solution is numerically approximated with a time-consuming solver. In this paper a meta-model relying on a Gaussian process emulator is proposed to approximate the regression function, that leads to what is called a mixed meta-model. The uncertainty of the meta-model approximation can be incorporated in the model. A control on the distance between the maximum likelihood estimates of the mixed meta-model and the maximum likelihood estimates of the exact mixed model is guaranteed. Eventually, numerical simulations are performed to illustrate the efficiency of this approach.  相似文献   

17.
Massive correlated data with many inputs are often generated from computer experiments to study complex systems. The Gaussian process (GP) model is a widely used tool for the analysis of computer experiments. Although GPs provide a simple and effective approximation to computer experiments, two critical issues remain unresolved. One is the computational issue in GP estimation and prediction where intensive manipulations of a large correlation matrix are required. For a large sample size and with a large number of variables, this task is often unstable or infeasible. The other issue is how to improve the naive plug-in predictive distribution which is known to underestimate the uncertainty. In this article, we introduce a unified framework that can tackle both issues simultaneously. It consists of a sequential split-and-conquer procedure, an information combining technique using confidence distributions (CD), and a frequentist predictive distribution based on the combined CD. It is shown that the proposed method maintains the same asymptotic efficiency as the conventional likelihood inference under mild conditions, but dramatically reduces the computation in both estimation and prediction. The predictive distribution contains comprehensive information for inference and provides a better quantification of predictive uncertainty as compared with the plug-in approach. Simulations are conducted to compare the estimation and prediction accuracy with some existing methods, and the computational advantage of the proposed method is also illustrated. The proposed method is demonstrated by a real data example based on tens of thousands of computer experiments generated from a computational fluid dynamic simulator.  相似文献   

18.
The thin plate volume matching and volume smoothing histosplines are described. These histosplines are suitable for estimating densities or incidence rates as a function of position on the plane when data is aggregated by area, for example by county. We give a numerical algorithm for the volume matching histospline and for the volume smoothing histospline using generalized cross validation to estimate the smoothing parameter. Some numerical experiments were performed using synthetic data, population data and SMR's (standardized mortality ratios) aggregated by county over the state of Wisconsin. The method turns out to be not particularly suited for obtaining population density maps where the population density can vary by two orders of magnitude, because the histospline can be negative in  相似文献   

19.
Inference on the whole biological system is the recent focus in bioscience. Different biomarkers, although seem to function separately, can actually control some event(s) of interest simultaneously. This fundamental biological principle has motivated the researchers for developing joint models which can explain the biological system efficiently. Because of the advanced biotechnology, huge amount of biological information can be easily obtained in current years. Hence dimension reduction is one of the major issues in current biological research. In this article, we propose a Bayesian semiparametric approach of jointly modeling observed longitudinal trait and event-time data. A sure independence screening procedure based on the distance correlation and a modified version of Bayesian Lasso are used for dimension reduction. Traditional Cox proportional hazards model is used for modeling the event-time. Our proposed model is used for detecting marker genes controlling the biomass and first flowering time of soybean plants. Simulation studies are performed for assessing the practical usefulness of the proposed model. Proposed model can be used for the joint analysis of traits and diseases for humans, animals and plants.  相似文献   

20.
High-content automated imaging platforms allow the multiplexing of several targets simultaneously to generate multi-parametric single-cell data sets over extended periods of time. Typically, standard simple measures such as mean value of all cells at every time point are calculated to summarize the temporal process, resulting in loss of time dynamics of the single cells. Multiple experiments are performed but observation time points are not necessarily identical, leading to difficulties when integrating summary measures from different experiments. We used functional data analysis to analyze continuous curve data, where the temporal process of a response variable for each single cell can be described using a smooth curve. This allows analyses to be performed on continuous functions, rather than on original discrete data points. Functional regression models were applied to determine common temporal characteristics of a set of single cell curves and random effects were employed in the models to explain variation between experiments. The aim of the multiplexing approach is to simultaneously analyze the effect of a large number of compounds in comparison to control to discriminate between their mode of action. Functional principal component analysis based on T-statistic curves for pairwise comparison to control was used to study time-dependent compound effects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号