首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 796 毫秒
1.
The current paradigm for the identification of candidate drugs within the pharmaceutical industry typically involves the use of high-throughput screens. High-content screening (HCS) is the term given to the process of using an imaging platform to screen large numbers of compounds for some desirable biological activity. Classification methods have important applications in HCS experiments, where they are used to predict which compounds have the potential to be developed into new drugs. In this paper, a new classification method is proposed for batches of compounds where the rule is updated sequentially using information from the classification of previous batches. This methodology accounts for the possibility that the training data are not a representative sample of the test data and that the underlying group distributions may change as new compounds are analysed. This technique is illustrated on an example data set using linear discriminant analysis, k-nearest neighbour and random forest classifiers. Random forests are shown to be superior to the other classifiers and are further improved by the additional updating algorithm in terms of an increase in the number of true positives as well as a decrease in the number of false positives.  相似文献   

2.
Canonical correlation analysis (CCA) is often used to analyze the correlation between two random vectors. However, sometimes interpretation of CCA results may be hard. In an attempt to address these difficulties, principal canonical correlation analysis (PCCA) was proposed. PCCA is CCA between two sets of principal component (PC) scores. We consider the problem of selecting useful PC scores in CCA. A variable selection criterion for one set of PC scores has been proposed by Ogura (2010), here, we propose a variable selection criterion for two sets of PC scores in PCCA. Furthermore, we demonstrate the effectiveness of this criterion.  相似文献   

3.
ABSTRACT

Canonical correlations are maximized correlation coefficients indicating the relationships between pairs of canonical variates that are linear combinations of the two sets of original variables. The number of non-zero canonical correlations in a population is called its dimensionality. Parallel analysis (PA) is an empirical method for determining the number of principal components or factors that should be retained in factor analysis. An example is given to illustrate for adapting proposed procedures based on PA and bootstrap modified PA to the context of canonical correlation analysis (CCA). The performances of the proposed procedures are evaluated in a simulation study by their comparison with traditional sequential test procedures with respect to the under-, correct- and over-determination of dimensionality in CCA.  相似文献   

4.
Confronted with multivariate group-structured data, one is in fact always interested in describing differences between groups. In this paper, canonical correlation analysis (CCA) is used as an exploratory data analysis tool to detect and describe differences between groups of objects. CCA allows for the construction of Gabriel biplots, relating representations of objects and variables in the plane that best represents the distinction of the groups of object points. In the case of non-linear CCA, transformations of the original variables are suggested to achieve a better group separation compared with that obtained by linear CCA. One can detect which (transformed) variables are responsible for this separation. The separation itself might be due to several characteristics of the data (eg. distances between the centres of gravity of the original or transformed groups of object points, or differences in the structure of the original groups). Four case studies give an overview of an exploration of the possibilities offered by linear and non-linear CCA.  相似文献   

5.
The wide-ranging and rapidly evolving nature of ecological studies mean that it is not possible to cover all existing and emerging techniques for analyzing multivariate data. However, two important methods enticed many followers: the Canonical Correspondence Analysis (CCA) and the STATICO analysis. Despite the particular characteristics of each, they have similarities and differences, which when analyzed properly, can, together, provide important complementary results to those that are usually exploited by researchers. If on one hand, the use of CCA is completely generalized and implemented, solving many problems formulated by ecologists, on the other hand, this method has some weaknesses mainly caused by the imposition of the number of variables that is required to be applied (much higher in comparison with samples). Also, the STATICO method has no such restrictions, but requires that the number of variables (species or environment) is the same in each time or space. Yet, the STATICO method presents information that can be more detailed since it allows visualizing the variability within groups (either in time or space). In this study, the data needed for implementing these methods are sketched, as well as the comparison is made showing the advantages and disadvantages of each method. The treated ecological data are a sequence of pairs of ecological tables, where species abundances and environmental variables are measured at different, specified locations, over the course of time.  相似文献   

6.
This work investigates the use of canonical correlation analysis (CCA) in the definition of weight restrictions for data envelopment analysis (DEA). With this purpose, CCA limits are introduced into Wong and Beasley's DEA model. An application of the method is made over data from hospitals in 27 Brazilian cities, producing as outputs average payment (average admission values) and percentage of hospital admissions according to disease groups (International Classification of Diseases, 9th Edition), and having as inputs mortality rates and average stay (length of stay after admission (days)). In this application, performance scores were calculated for both the (CCA) restricted and unrestricted DEA models. It can be concluded that the use of CCA-based weight limits for DEA models increases the consistency of the estimated DEA scores (more homogenous weights) and that these limits do not present mathematical infeasibility problems while avoiding the need for subjectively restricting weight variation in DEA.  相似文献   

7.
In this paper we consider the worst-case adaptive complexity of the search problem , where is also the set of independent sets of a matroid over S. We give a formula for the number of questions needed and an algorithm to find the optimal search algorithm for any matroid. This algorithm uses only O(|S|3) steps (i.e. questions to the independence oracle). This is also the length of Edmonds’ partitioning algorithm for matroids, which does not seem to be avoidable.  相似文献   

8.
9.
Partially linear models provide a useful class of tools for modeling complex data by naturally incorporating a combination of linear and nonlinear effects within one framework. One key question in partially linear models is the choice of model structure, that is, how to decide which covariates are linear and which are nonlinear. This is a fundamental, yet largely unsolved problem for partially linear models. In practice, one often assumes that the model structure is given or known and then makes estimation and inference based on that structure. Alternatively, there are two methods in common use for tackling the problem: hypotheses testing and visual screening based on the marginal fits. Both methods are quite useful in practice but have their drawbacks. First, it is difficult to construct a powerful procedure for testing multiple hypotheses of linear against nonlinear fits. Second, the screening procedure based on the scatterplots of individual covariate fits may provide an educated guess on the regression function form, but the procedure is ad hoc and lacks theoretical justifications. In this article, we propose a new approach to structure selection for partially linear models, called the LAND (Linear And Nonlinear Discoverer). The procedure is developed in an elegant mathematical framework and possesses desired theoretical and computational properties. Under certain regularity conditions, we show that the LAND estimator is able to identify the underlying true model structure correctly and at the same time estimate the multivariate regression function consistently. The convergence rate of the new estimator is established as well. We further propose an iterative algorithm to implement the procedure and illustrate its performance by simulated and real examples. Supplementary materials for this article are available online.  相似文献   

10.
A special class of supersaturated design, called marginally over saturated design (MOSD), in which the number of variables under investigation (k) is only slightly larger than the number of experimental runs (n), is presented. Several optimality criteria for supersaturated designs are discussed. It is shown that the resolution rank criterion is most appropriate for screening situations. The construction method builds on two major theorems which provide an efficient way to evaluate resolution rank. Examples are given for the cases n=8, 12, 16, and 20. Potential extensions for future work are discussed.  相似文献   

11.
It is quite a challenge to develop model‐free feature screening approaches for missing response problems because the existing standard missing data analysis methods cannot be applied directly to high dimensional case. This paper develops some novel methods by borrowing information of missingness indicators such that any feature screening procedures for ultrahigh‐dimensional covariates with full data can be applied to missing response case. The first method is the so‐called missing indicator imputation screening, which is developed by proving that the set of the active predictors of interest for the response is a subset of the active predictors for the product of the response and missingness indicator under some mild conditions. As an alternative, another method called Venn diagram‐based approach is also developed. The sure screening property is proven for both methods. It is shown that the complete case analysis can also keep the sure screening property of any feature screening approach with sure screening property.  相似文献   

12.
Extended log-linear models (ELMs) are the natural generalization of log-linear models when the positivity assumption is relaxed. The hypergraph language, which is currently used to specify the syntax of ELMs, both provides an insight into key notions of the theory of ELMs such as collapsibility and decomposability, and allows to work out efficient algorithms to solve some problems of inference. This is the case for the three search problems addressed in this paper and referred to as the approximation problem, the selective-reduction problem and the synthesis problem. The approximation problem consists in finding the smallest decomposable ELM that contains a given ELM and is such that the given ELM is collapsible onto each of its generators. The selective-reduction problem consists in deleting the maximum number of generators of a given ELM in such a way that the resulting ELM is a submodel and none of certain variables of interest is missing. The synthesis problem consists in finding a minimal ELM containing the intersection of ELMs specified by given independence relations. We show that each of the three search problems above can be reduced to an equivalent search problem on hypergraphs, which can be solved in polynomial time.  相似文献   

13.
Taguchi (1984,1987) has derived tolerances for subsystems, subcomponents, parts and materials. However, he assumed that the relationship between a higher rank and a lower rank quality characteristic is deterministic. The basic structure of the above tolerance design problem is very similar to that of the screening problem. Tang (1987) proposed three cost models and derived an economic design for the screening problem of “the-bigger-the-better” quality characteristic in which the optimal specification limit ( or tolerance ) for a screening variable ( or a lower rank quality characteristic ) was obtained by minimizing the expected total cost function.Tang considered that the quality cost incurred only when the quality characteristic is out of specification while Taguchi considered that the quality cost incurred whenever the quality characteristic deviates from its nominal value. In this paper, a probabilistic relationship, namely, a bivariate normal distribution between the above two qualy characteristics as in a screening problem as well as Taguchi's quadratic loss function are considered together to develop a closed form solution of the tolerance design for a subsystem.  相似文献   

14.
In this article the screening problem is studied by a predictive approach in a general setting. The problem of optimal screening which is to raise the probability of success after screening to a prespecified value by retaining as many individuals as possible has been solved. The relation between such an optimal screening procedure and that considered in Turkman & Turkman (1989) is illuminated. The bivariate normal model is investigated as an illustration of the general theory.  相似文献   

15.
A variable screening procedure via correlation learning was proposed in Fan and Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening is called NIS, a specific member of the sure independence screening. Several closely related variable screening procedures are proposed. Under general nonparametric models, it is shown that under some mild technical conditions, the proposed independence screening methods enjoy a sure screening property. The extent to which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, a data-driven thresholding and an iterative nonparametric independence screening (INIS) are also proposed to enhance the finite sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.  相似文献   

16.
Comparison of Four New General Classes of Search Designs   总被引:1,自引:0,他引:1  
A factor screening experiment identifies a few important factors from a large list of factors that potentially influence the response. If a list consists of m factors each at three levels, a design is a subset of all possible 3 m runs. This paper considers the problem of finding designs with small numbers of runs, using the search linear model introduced in Srivastava (1975). The paper presents four new general classes of these 'search designs', each with 2 m −1 runs, which permit, at most, two important factors out of m factors to be searched for and identified. The paper compares the designs for 4 ≤ m ≤ 10, using arithmetic and geometric means of the determinants, traces and maximum characteristic roots of particular matrices. Two of the designs are found to be superior in all six criteria studied. The four designs are identical for m = 3 and this design is an optimal design in the class of all search designs under the six criteria. The four designs are also identical for m = 4 under some row and column permutations.  相似文献   

17.
In this paper we formulate the problem of constructing 1-rotational near resolvable difference families as a combinatorial optimization problem where a global optimum corresponds to a desired difference family. Then, we develop an algorithm based on scatter search in conjunction with a tabu search to construct many of these difference families. In particular, we construct three new near resolvable difference families which lead to an equal number of new 1-rotational near resolvable block designs with parameters: (46,9,8), (51,10,9) and (55,9,8). Our results indicate that this conjunction outperforms both scatter search and tabu search.  相似文献   

18.
Analysis for univariate and multivariate categorical data in block designs is given and illustrated through examples. The univariate analysis compares the treatments on the basis of their pooled frequency distributions (pooled over blocks). The test statistic used is called Q after Cochran (1950). The large sample null distribution of Q is a chi-square. Analysis of p-variate categorical data (kth variable having ck classes, K=1,...,p) can be done by treating it as a univariate categorical problem with [d] classes. Very often [d] is large in relation to the size of the experiment. This makes the expected frequencies for some of the cells very small, making the univariate method inapplicable. In these circumstances it may be reasonable to compare the treatments on the basis of marginal distributions up to the mth dimension, 1[d] , which is given in this paper. This method is also illustrated for missing observations  相似文献   

19.
The multivariate adaptive regression splines (MARS) model is one of the well-known, additive non-parametric models that can deal with highly correlated and nonlinear datasets successfully. From our previous analyses, we have seen that lasso-type MARS (LMARS) can be a strong alternative of the Gaussian graphical model (GGM) which is a well-known probabilistic method to describe the steady-state behaviour of the complex biological systems via the lasso regression. In this study, we extend our original LMARS model by taking into account the second-order interaction effects of genes as the representative of the feed-forward loop in biological networks. By this way, we can describe both linear and nonlinear activations of the genes in the same model. We evaluate the performance of our new model under different dimensional simulated and real systems, and then compare the accuracy of the estimates with GGM and LMARS outputs. The results show the advantage of this new model over its close alternatives.  相似文献   

20.
To estimate the high-dimensional covariance matrix, row sparsity is often assumed such that each row has a small number of nonzero elements. However, in some applications, such as factor modeling, there may be many nonzero loadings of the common factors. The corresponding variables are also correlated to one another and the rows are non-sparse or dense. This paper has three main aims. First, a detection method is proposed to identify the rows that may be non-sparse, or at least dense with many nonzero elements. These rows are called dense rows and the corresponding variables are called pivotal variables. Second, to determine the number of rows, a ridge ratio method is suggested, which can be regarded as a sure screening procedure. Third, to handle the estimation of high-dimensional factor models, a two-step procedure is suggested with the above screening as the first step. Simulations are conducted to examine the performance of the new method and a real dataset is analyzed for illustration.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号