首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
This paper focuses on unsupervised curve classification in the context of nuclear industry. At the Commissariat à l'Energie Atomique (CEA), Cadarache (France), the thermal-hydraulic computer code CATHARE is used to study the reliability of reactor vessels. The code inputs are physical parameters and the outputs are time evolution curves of a few other physical quantities. As the CATHARE code is quite complex and CPU time-consuming, it has to be approximated by a regression model. This regression process involves a clustering step. In the present paper, the CATHARE output curves are clustered using a k-means scheme, with a projection onto a lower dimensional space. We study the properties of the empirically optimal cluster centres found by the clustering method based on projections, compared with the ‘true’ ones. The choice of the projection basis is discussed, and an algorithm is implemented to select the best projection basis among a library of orthonormal bases. The approach is illustrated on a simulated example and then applied to the industrial problem.  相似文献   

2.
This paper develops an extension of the Riemann sum techniques of Philippe (J. Statist. Comput. Simul. 59: 295–314) in the setup of MCMC algorithms. It shows that these techniques apply equally well to the output of these algorithms, with similar speeds of convergence which improve upon the regular estimator. The restriction on the dimension associated with Riemann sums can furthermore be overcome by Rao–Blackwellization methods. This approach can also be used as a control variate technique in convergence assessment of MCMC algorithms, either by comparing the values of alternative versions of Riemann sums, which estimate the same quantity, or by using genuine control variate, that is, functions with known expectations, which are available in full generality for constants and scores.  相似文献   

3.
Probabilistic sensitivity analysis of complex models: a Bayesian approach   总被引:3,自引:0,他引:3  
Summary.  In many areas of science and technology, mathematical models are built to simulate complex real world phenomena. Such models are typically implemented in large computer programs and are also very complex, such that the way that the model responds to changes in its inputs is not transparent. Sensitivity analysis is concerned with understanding how changes in the model inputs influence the outputs. This may be motivated simply by a wish to understand the implications of a complex model but often arises because there is uncertainty about the true values of the inputs that should be used for a particular application. A broad range of measures have been advocated in the literature to quantify and describe the sensitivity of a model's output to variation in its inputs. In practice the most commonly used measures are those that are based on formulating uncertainty in the model inputs by a joint probability distribution and then analysing the induced uncertainty in outputs, an approach which is known as probabilistic sensitivity analysis. We present a Bayesian framework which unifies the various tools of prob- abilistic sensitivity analysis. The Bayesian approach is computationally highly efficient. It allows effective sensitivity analysis to be achieved by using far smaller numbers of model runs than standard Monte Carlo methods. Furthermore, all measures of interest may be computed from a single set of runs.  相似文献   

4.
In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS.  相似文献   

5.
Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells.  相似文献   

6.
Clustering of Variables Around Latent Components   总被引:1,自引:0,他引:1  
Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies.  相似文献   

7.
This paper reviews five related types of analysis, namely (i) sensitivity or what-if analysis, (ii) uncertainty or risk analysis, (iii) screening, (iv) validation, and (v) optimization. The main questions are: when should which type of analysis be applied; which statistical techniques may then be used? This paper claims that the proper sequence to follow in the evaluation of simulation models is as follows. 1) Validation, in which the availability of data on the real system determines which type of statistical technique to use for validation. 2) Screening: in the simulation‘s pilot phase the really important inputs can be identified through a novel technique, called sequential bifurcation, which uses aggregation and sequential experimentation. 3) Sensitivity analysis: the really important inputs should be subjected to a more detailed analysis, which includes interactions between these inputs; relevant statistical techniques are design of experiments (DOE) and regression analysis. 4) Uncertainty analysis: the important environmental inputs may have values that are not precisely known, so the uncertainties of the model outputs that result from the uncertainties in these model inputs should be quantified; relevant techniques are the Monte Carlo method and Latin hypercube sampling. 5) Optimization: the policy variables should be controlled; a relevant technique is Response Surface Methodology (RSM), which combines DOE, regression analysis, and steepest-ascent hill-climbing. The recommended sequence implies that sensitivity analysis procede uncertainty analysis. Several case studies for each phase are briefly discussed in this paper.  相似文献   

8.
Running complex computer models can be expensive in computer time, while learning about the relationships between input and output variables can be difficult. An emulator is a fast approximation to a computationally expensive model that can be used as a surrogate for the model, to quantify uncertainty or to improve process understanding. Here, we examine emulators based on singular value decompositions (SVDs) and use them to emulate global climate and vegetation fields, examining how these fields are affected by changes in the Earth's orbit. The vegetation field may be emulated directly from the orbital variables, but an appealing alternative is to relate it to emulations of the climate fields, which involves high-dimensional input and output. The SVDs radically reduce the dimensionality of the input and output spaces and are shown to clarify the relationships between them. The method could potentially be useful for any complex process with correlated, high-dimensional inputs and/or outputs.  相似文献   

9.
Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.  相似文献   

10.
The problem of estimation of unknown response function of a time-invariant continuous linear system is considered. Integral sample input–output cross-correlogram is taken as an estimator of the response function. The inputs are supposed to be zero-mean stationary Gaussian process. A criterion on the shape of impulse response function is given. For this purpose, we apply a theory of square–Gaussian random processes and estimate the probability that supremum of square–Gaussian process exceeds the level specified by some function.  相似文献   

11.
We consider the problem of statistically evaluating the similarity of DNA intronic regions of genes. Present algorithms are based on matching a sequence of interest with known DNA sequences in a gene bank and are designed primarily to assess homology among exonic regions of genes. Most research focuses on exonic regions because they have a clear biological significance, coding for proteins, and therefore tend to be more conserved in evolution than intronic regions. To investigate whether the intronic features of genes whose expression is highly sensitive to environmental perturbations differ from genes that have a more constant expression, a collection of oncogenes, tumor suppressor genes, and nonregulatory genes involved in energy metabolism are compared. An analysis of the features of these genes' intronic regions result in clustering by regulatory group. In addition, Billingsley's test for Markov structure (1961) suggests that 67% of the intronic regions in this collection of genes show evidence of nonrandom structure, indicating the possibility of a biological function for these regions. The result of Billingsley's test for homology is used as input to a clustering algorithm. The biological significance of this methodology lies in the identification of groups based on the intronic regions from genes of unknown function. With the advent of rapid sequencing techniques, there is a great need for statistical techniques to help identify the purpose of poorly understood portions of genes. These methods can be utilized to assess the functional group to which such a gene might possibly belong.  相似文献   

12.
This paper discusses the development of a multivariate control charting technique for short-run autocorrelated data manufacturing environment. The proposed approach is a combination of the multivariate residual charts for autocorrelated data and the multivariate transformation technique for i.i.d. process observations of short lengths. The proposed approach consists in fitting adequate multivariate time-series model of various process outputs and computes the residuals, transforming them into standard normal N(0, 1) data and then using standardized data as inputs to plot conventional univariate i.i.d. control charts. The objective for applying multivariate finite horizon techniques for autocorrelated processes is to allow continuous process monitoring, since all process outputs are controlled trough the use of a single control chart with constant control limits. Throughout simulated examples, it is shown that the proposed short-run process monitoring technique provides approximately similar shifts detection properties as VAR residual charts.  相似文献   

13.
The simulation of statistical models in a computer is a fundamental aspect of research in the field of nonparametric curve estimation. Methods such as the FFT (Fast Fourier Transform) or WARP (Weighted Average of Rounded Points) have been developed and analysed for computer implementation of the different techniques in this realm, with the aim of reducing the computation time as much as possible. In this work we analyse two techniques with this objective. These are the vectorization of the source code in which the different algorithms are implemented, and their distributed execution. It can be observed that the vectorization of the programs can improve the results obtained with techniques such as the FFT or WARP, or, in some cases, can prevent the use of these.  相似文献   

14.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

15.
A Bayesian network (BN) is a probabilistic graphical model that represents a set of variables and their probabilistic dependencies. Formally, BNs are directed acyclic graphs whose nodes represent variables, and whose arcs encode the conditional dependencies among the variables. Nodes can represent any kind of variable, be it a measured parameter, a latent variable, or a hypothesis. They are not restricted to represent random variables, which form the “Bayesian” aspect of a BN. Efficient algorithms exist that perform inference and learning in BNs. BNs that model sequences of variables are called dynamic BNs. In this context, [A. Harel, R. Kenett, and F. Ruggeri, Modeling web usability diagnostics on the basis of usage statistics, in Statistical Methods in eCommerce Research, W. Jank and G. Shmueli, eds., Wiley, 2008] provide a comparison between Markov Chains and BNs in the analysis of web usability from e-commerce data. A comparison of regression models, structural equation models, and BNs is presented in Anderson et al. [R.D. Anderson, R.D. Mackoy, V.B. Thompson, and G. Harrell, A bayesian network estimation of the service–profit Chain for transport service satisfaction, Decision Sciences 35(4), (2004), pp. 665–689]. In this article we apply BNs to the analysis of customer satisfaction surveys and demonstrate the potential of the approach. In particular, BNs offer advantages in implementing models of cause and effect over other statistical techniques designed primarily for testing hypotheses. Other advantages include the ability to conduct probabilistic inference for prediction and diagnostic purposes with an output that can be intuitively understood by managers.  相似文献   

16.
The correspondence analysis (CA) method appears to be an effective tool for analysis of interrelations between rows and columns in two-way contingency data. A discrete version of the method, box clustering, is developed in the paper using an approximation version of the CA model extended to the case when CA factor values are required to be Boolean. Several properties of the proposed SEFIT-BOX algorithm are proved to facilitate interpretation of its output. It is also shown that two known partitioning algorithms (applied within row or column sets only) could be considered as locally optimal algorithms for fitting the model, and extensions of these algorithms to a simultaneous row and column partitioning problem are proposed.  相似文献   

17.
In the literature, technical efficiency is measured as the ratio of the observed output to potential output. Although there is no a priori theoretical reasoning, in the stochastic framework of measuring technical efficiency, potential output has been conventionally assumed as a neutral shift from observed output, owing solely to a larger intercept term in the frontier production function and without change in the input response coefficients. The objective of this paper is to propose and apply a method to measure technical efficiency without the above assumption. Furthermore, this methodology does not require the restrictive assumption of a particular distribution for the efficiency-related error term, as has been the case until now in the stochastic production function literature. A random sample of farmers from Madurai district in Tamil Nadu, India, was used. The analysis revealed substantial variation in the farm-specific input response coefficients between farms, which means that the contributions of individual inputs to the output differ from farm to farm, because the methods of application of the individual inputs vary. The frontier production function, which defines the potential of a technology, is determined by the highest values of the coefficients of each individual input which may come from one or more farms. Farm-specific frontier functions generally showed a considerable potential for improving the technical.performance of each input.  相似文献   

18.
A tutorial on spectral clustering   总被引:33,自引:0,他引:33  
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.  相似文献   

19.
Understanding the causes and consequences of genetic variation in human immunodeficiency virus (HIV) is one of the most important tasks facing medical and evolutionary biologists alike. A powerful analytical tool which is available to those working in this field is the phylogenetic tree, which describes the evolutionary relationships of the sequences in a sample and the history of the mutational events which separate them. Although phylogenetic trees of HIV are becoming commonplace, their use can be improved by tailoring the underlying statistical models to the idiosyncrasies of viral biology. The design and refinement of phylogenetic analyses consequently represents an important practical use of statistical methods in HIV research.  相似文献   

20.
Cluster analysis is an important technique of explorative data mining. It refers to a collection of statistical methods for learning the structure of data by solely exploring pairwise distances or similarities. Often meaningful structures are not detectable in these high-dimensional feature spaces. Relevant features can be obfuscated by noise from irrelevant measurements. These observations led to the design of subspace clustering algorithms, which can identify clusters that originate from different subsets of features. Hunting for clusters in arbitrary subspaces is intractable due to the infinite search space spanned by all feature combinations. In this work, we present a subspace clustering algorithm that can be applied for exhaustively screening all feature combinations of small- or medium-sized datasets (approximately 30 features). Based on a robustness analysis via subsampling we are able to identify a set of stable candidate subspace cluster solutions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号