首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper proposes modified splitting criteria for classification and regression trees by modifying the definition of the deviance. The modified deviance is based on local averaging instead of global averaging and is more successful at modelling data with interactions. The paper shows that the modified criteria result in much simpler trees for pure interaction data (no main effects) and can produce trees with fewer errors and lower residual mean deviances than those produced by Clark & Pregibon's (1992) method when applied to real datasets with strong interaction effects.  相似文献   

2.
The broken stick model is a model of the abundance of species in a habitat, and it has been widely extended. In this paper, we present results from exploratory data analysis of this model. To obtain some of the statistics, we formulate the broken stick model as a probability distribution function based on the same model, and we provide an expression for the cumulative distribution function, which is needed to obtain the results from exploratory data analysis. The inequalities we present are useful in ecological studies that apply broken stick models. These results are also useful for testing the goodness of fit of the broken stick model as an alternative to the chi square test, which has often been the main test used. Therefore, these results may be used in several alternative and complementary ways for testing the goodness of fit of the broken stick model.  相似文献   

3.
Datasets are sometimes divided into distinct subsets, e.g. due to multi-center sampling, or to variations in instruments, questionnaire item ordering or mode of administration, and the data analyst then needs to assess whether a joint analysis is meaningful. The Principal Component Analysis-based Data Structure Comparisons (PCADSC) tools are three new non-parametric, visual diagnostic tools for investigating differences in structure for two subsets of a dataset through covariance matrix comparisons by use of principal component analysis. The PCADCS tools are demonstrated in a data example using European Social Survey data on psychological well-being in three countries, Denmark, Sweden, and Bulgaria. The data structures are found to be different in Denmark and Bulgaria, and thus a comparison of for example mean psychological well-being scores is not meaningful. However, when comparing Denmark and Sweden, very similar data structures, and thus comparable concepts of well-being, are found. Therefore, inter-country comparisons are warranted for these countries.  相似文献   

4.
We describe inferactive data analysis, so-named to denote an interactive approach to data analysis with an emphasis on inference after data analysis. Our approach is a compromise between Tukey's exploratory and confirmatory data analysis allowing also for Bayesian data analysis. We see this as a useful step in concrete providing tools (with statistical guarantees) for current data scientists. The basis of inference we use is (a conditional approach to) selective inference, in particular its randomized form. The relevant reference distributions are constructed from what we call a DAG-DAG—a Data Analysis Generative DAG, and a selective change of variables formula is crucial to any practical implementation of inferactive data analysis via sampling these distributions. We discuss a canonical example of an incomplete cross-validation test statistic to discriminate between black box models, and a real HIV dataset example to illustrate inference after making multiple queries on data.  相似文献   

5.
Families of splitting criteria for classification trees   总被引:6,自引:0,他引:6  
Several splitting criteria for binary classification trees are shown to be written as weighted sums of two values of divergence measures. This weighted sum approach is then used to form two families of splitting criteria. One of them contains the chi-squared and entropy criterion, the other contains the mean posterior improvement criterion. Both family members are shown to have the property of exclusive preference. Furthermore, the optimal splits based on the proposed families are studied. We find that the best splits depend on the parameters in the families. The results reveal interesting differences among various criteria. Examples are given to demonstrate the usefulness of both families.  相似文献   

6.
There is much interest in predicting the impact of global warming on the genetic diversity of natural populations and the influence of climate on biodiversity is an important ecological question. Since Holocene, we face many climate perturbations and the geographical ranges of plant taxa have changed substantially. Actual genetic diversity of plant is a result of these processes and a first step to study the impact of future climate change is to understand the important features of reconstructed climate variables such as temperature or precipitation for the last 15,000 years on actual genetic diversity of forest. We model the relationship between genetic diversity in the European beech (Fagus sylvatica) forests and curves of temperature and precipitation reconstructed from pollen databases. Our model links the genetic measure to the climate curves. We adapt classical functional linear model to take into account interactions between climate variables as a bilinear form. Since the data are georeferenced, our extensions also account for the spatial dependence among the observations. The practical issues of these methodological extensions are discussed.  相似文献   

7.
The performance of computationally inexpensive model selection criteria in the context of tree-structured subgroup analysis is investigated. It is shown through simulation that no single model selection criterion exhibits a uniformly superior performance over a wide range of scenarios. Therefore, a two-stage approach for model selection is proposed and shown to perform satisfactorily. Applied example of subgroup analysis is presented. Problems associated with tree-structured subgroup analysis are discussed and practical solutions are suggested.  相似文献   

8.
In modeling count data with multivariate predictors, we often encounter problems with clustering of observations and interdependency of predictors. We propose to use principal components of predictors to mitigate the multicollinearity problem and to abate information losses due to dimension reduction, a semiparametric link between the count dependent variable and the principal components is postulated. Clustering of observations is accounted into the model as a random component and the model is estimated via the backfitting algorithm. Simulation study illustrates the advantages of the proposed model over standard poisson regression in a wide range of scenarios.  相似文献   

9.
The most common techniques for graphically presenting a multivariate dataset involve projection onto a one or two-dimensional subspace. Interpretation of such plots is not always straightforward because projections are smoothing operations in that structure can be obscured by projection but never enhanced. In this paper an alternative procedure for finding interesting features is proposed that is based on locating the modes of an induced hyperspherical density function, and a simple algorithm for this purpose is developed. Emphasis is placed on identifying the non-linear effects, such as clustering, so to this end the data are firstly sphered to remove all of the location, scale and correlational structure. A set of simulated bivariate data and artistic qualities of painters data are used as examples.  相似文献   

10.
We present a graphical method based on the empirical probability generating function for preliminary statistical analysis of distributions for counts. The method is especially useful in fitting a Poisson model, or for identifying alternative models as well as possible outlying observations from general discrete distributions.  相似文献   

11.
A multiple regression method based on distance analysis and metric scaling is proposed and studied. This method allow us to predict a continuous response variable from several explanatory variables, is compatible with the general linear model and is found to be useful when the predictor variables are both continuous and categorical. Real data examples are given to illustrate the results obtained.  相似文献   

12.
ABSTRACT

We present methods for modeling and estimation of a concurrent functional regression when the predictors and responses are two-dimensional functional datasets. The implementations use spline basis functions and model fitting is based on smoothing penalties and mixed model estimation. The proposed methods are implemented in available statistical software, allow the construction of confidence intervals for the bivariate model parameters, and can be applied to completely or sparsely sampled responses. Methods are tested to data in simulations and they show favorable results in practice. The usefulness of the methods is illustrated in an application to environmental data.  相似文献   

13.
We introduce the log-odd Weibull regression model based on the odd Weibull distribution (Cooray, 2006). We derive some mathematical properties of the log-transformed distribution. The new regression model represents a parametric family of models that includes as sub-models some widely known regression models that can be applied to censored survival data. We employ a frequentist analysis and a parametric bootstrap for the parameters of the proposed model. We derive the appropriate matrices for assessing local influence on the parameter estimates under different perturbation schemes and present some ways to assess global influence. Further, for different parameter settings, sample sizes and censoring percentages, some simulations are performed. In addition, the empirical distribution of some modified residuals are given and compared with the standard normal distribution. These studies suggest that the residual analysis usually performed in normal linear regression models can be extended to a modified deviance residual in the proposed regression model applied to censored data. We define martingale and deviance residuals to check the model assumptions. The extended regression model is very useful for the analysis of real data.  相似文献   

14.
15.
16.
This paper discusses the regression analysis of current status failure time data arising from the additive hazards model with auxiliary covariates. As often occurs in practice, it is impossible or impractical to measure the exact magnitude of covariates for all subjects in a study. To compensate the missing information, some auxiliary covariates are utilized instead. We propose two easy-to-implement procedures for estimation of regression parameters by making use of auxiliary information. The asymptotic properties of the resulting estimators are established and extensive numerical studies indicate that both procedures work well in practice.  相似文献   

17.
Summary  In panel studies binary outcome measures together with time stationary and time varying explanatory variables are collected over time on the same individual. Therefore, a regression analysis for this type of data must allow for the correlation among the outcomes of an individual. The multivariate probit model of Ashford and Sowden (1970) was the first regression model for multivariate binary responses. However, a likelihood analysis of the multivariate probit model with general correlation structure for higher dimensions is intractable due to the maximization over high dimensional integrals thus severely restricting ist applicability so far. Czado (1996) developed a Markov Chain Monte Carlo (MCMC) algorithm to overcome this difficulty. In this paper we present an application of this algorithm to unemployment data from the Panel Study of Income Dynamics involving 11 waves of the panel study. In addition we adapt Bayesian model checking techniques based on the posterior predictive distribution (see for example Gelman et al. (1996)) for the multivariate probit model. These help to identify mean and correlation specification which fit the data well. C. Czado was supported by research grant OGP0089858 of the Natural Sciences and Engineering Research Council of Canada.  相似文献   

18.
We consider the problem of estimation of a density function in the presence of incomplete data and study the Hellinger distance between our proposed estimators and the true density function. Here, the presence of incomplete data is handled by utilizing a Horvitz–Thompson-type inverse weighting approach, where the weights are the estimates of the unknown selection probabilities. We also address the problem of estimation of a regression function with incomplete data.  相似文献   

19.
In binary regression, imbalanced data result from the presence of values equal to zero (or one) in a proportion that is significantly greater than the corresponding real values of one (or zero). In this work, we evaluate two methods developed to deal with imbalanced data and compare them to the use of asymmetric links. The results based on simulation study show, that correction methods do not adequately correct bias in the estimation of regression coefficients and that the models with power links and reverse power considered produce better results for certain types of imbalanced data. Additionally, we present an application for imbalanced data, identifying the best model among the various ones proposed. The parameters are estimated using a Bayesian approach, considering the Hamiltonian Monte-Carlo method, utilizing the No-U-Turn Sampler algorithm and the comparisons of models were developed using different criteria for model comparison, predictive evaluation and quantile residuals.  相似文献   

20.
ABSTRACT

We aim at analysing geostatistical and areal data observed over irregularly shaped spatial domains and having a distribution within the exponential family. We propose a generalized additive model that allows to account for spatially varying covariate information. The model is fitted by maximizing a penalized log-likelihood function, with a roughness penalty term that involves a differential quantity of the spatial field, computed over the domain of interest. Efficient estimation of the spatial field is achieved resorting to the finite element method, which provides a basis for piecewise polynomial surfaces. The proposed model is illustrated by an application to the study of criminality in the city of Portland, OR, USA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号