首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The integration of different data sources is a widely discussed topic among both the researchers and the Official Statistics. Integrating data helps to contain costs and time required by new data collections. The non-parametric micro Statistical Matching (SM) allows to integrate ‘live’ data resorting only to the observed information, potentially avoiding the misspecification bias and speeding the computational effort. Despite these pros, the assessment of the integration goodness when we use this method is not robust. Moreover, several applications comply with some commonly accepted practices which recommend e.g. to use the biggest data set as donor. We propose a validation strategy to assess the integration goodness. We apply it to investigate these practices and to explore how different combinations of the SM techniques and distance functions perform in terms of the reliability of the synthetic (complete) data set generated. The validation strategy takes advantage of the relation existing among the variables pre-and-post the integration. The results show that ‘the biggest, the best’ rule must not be considered mandatory anymore. Indeed, the integration goodness increases in relation to the variability of the matching variables rather than with respect to the dimensionality ratio between the recipient and the donor data set.  相似文献   

2.
Summary. We present a technique for extending generalized linear models to the situation where some of the predictor variables are observations from a curve or function. The technique is particularly useful when only fragments of each curve have been observed. We demonstrate, on both simulated and real data sets, how this approach can be used to perform linear, logistic and censored regression with functional predictors. In addition, we show how functional principal components can be used to gain insight into the relationship between the response and functional predictors. Finally, we extend the methodology to apply generalized linear models and principal components to standard missing data problems.  相似文献   

3.
This paper describes how embedded sequences of positive interpolatory integration rules (PIIRs) obtained from Gauss-Hermite product rules can be applied in Bayesian analysis. These embedded sequences are very promising for two major reasons. First, they provide a rich class of spatially distributed rules which are particularly useful in high dimensions. Second, they provide a way of producing more efficient integration strategies by enabling approximations to be updated sequentially through the addition of new nodes at each step rather than through changing to a completely new set of nodes. Moreover, as points are added successive rules change naturally from spatially distributed non-product rules to product rules. This feature is particularly attractive when the rules are used for the evaluation of marginal posterior densities. We illustrate the use of embedded sequences of PIIRs in two examples. These illustrate how embedded sequences can be applied to improve the efficiency of the adaptive integration strategy currently in use.  相似文献   

4.
Pattern Matching     
An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search.  相似文献   

5.
Abstract

Dominance analysis is a procedure for measuring the importance of predictors in multiple regression analysis. We show that dominance analysis can be enhanced using a dynamic programing approach for the rank-ordering of predictors. Using customer satisfaction data from a call center operation, we demonstrate how the integration of dominance analysis with dynamic programing can provide a better understanding of predictor importance. As a cautionary note, we recommend careful reflection on the relationship between predictor importance and variable subset selection. We observed that slight changes in the selected predictor subset can have an impact on the importance rankings produced by a dominance analysis.  相似文献   

6.
Abstract. A stochastic epidemic model is defined in which each individual belongs to a household, a secondary grouping (typically school or workplace) and also the community as a whole. Moreover, infectious contacts take place in these three settings according to potentially different rates. For this model, we consider how different kinds of data can be used to estimate the infection rate parameters with a view to understanding what can and cannot be inferred. Among other things we find that temporal data can be of considerable inferential benefit compared with final size data, that the degree of heterogeneity in the data can have a considerable effect on inference for non‐household transmission, and that inferences can be materially different from those obtained from a model with only two levels of mixing. We illustrate our findings by analysing a highly detailed dataset concerning a measles outbreak in Hagelloch, Germany.  相似文献   

7.
A variety of primary endpoints are used in clinical trials treating patients with severe infectious diseases, and existing guidelines do not provide a consistent recommendation. We propose to study simultaneously two primary endpoints, cure and death, in a comprehensive multistate cure‐death model as starting point for a treatment comparison. This technique enables us to study the temporal dynamic of the patient‐relevant probability to be cured and alive. We describe and compare traditional and innovative methods suitable for a treatment comparison based on this model. Traditional analyses using risk differences focus on one prespecified timepoint only. A restricted logrank‐based test of treatment effect is sensitive to ordered categories of responses and integrates information on duration of response. The pseudo‐value regression provides a direct regression model for examination of treatment effect via difference in transition probabilities. Applied to a topical real data example and simulation scenarios, we demonstrate advantages and limitations and provide an insight into how these methods can handle different kinds of treatment imbalances. The cure‐death model provides a suitable framework to gain a better understanding of how a new treatment influences the time‐dynamic cure and death process. This might help the future planning of randomised clinical trials, sample size calculations, and data analyses.  相似文献   

8.
We show how register data combined at person-level with survey data can be used to conduct a novel type of nonresponse analysis in a panel survey. The availability of register data provides a unique opportunity to directly test the type of the missingness mechanism as well as estimate the size of bias due to initial nonresponse and attrition. We are also able to study in-depth the determinants of initial nonresponse and attrition. We use the Finnish subset of the European Community Household Panel (FI ECHP) data combined with register panel data and unemployment spells as outcome variables of interest. Our results show that initial nonresponse and attrition are clearly different processes driven by different background variables. Both the initial nonresponse and attrition mechanisms are nonignorable with respect to analysis of unemployment spells. Finally, our results suggest that initial nonresponse may play a role at least as important as attrition in causing bias. This result challenges the common view of attrition being the main threat to the value of panel data.  相似文献   

9.
The measurable multiple bio-markers for a disease are used as indicators for studying the response variable of interest in order to monitor and model disease progression. However, it is common for subjects to drop out of the studies prematurely resulting in unbalanced data and hence complicating the inferences involving such data. In this paper we consider a case where data are unbalanced among subjects and also within a subject because for some reason only a subset of the multiple outcomes of the response variable are observed at any one occasion. We propose a nonlinear mixed-effects model for the multivariate response variable data and derive a joint likelihood function that takes into account the partial dropout of the outcomes of the response variable. We further show how the methodology can be used in the estimation of the parameters that characterise HIV disease dynamics. An approximation technique of the parameters is also given and illustrated using a routine observational HIV dataset.  相似文献   

10.
Abstract

In this article, we consider a panel data partially linear regression model with fixed effect and non parametric time trend function. The data can be dependent cross individuals through linear regressor and error components. Unlike the methods using non parametric smoothing technique, a difference-based method is proposed to estimate linear regression coefficients of the model to avoid bandwidth selection. Here the difference technique is employed to eliminate the non parametric function effect, not the fixed effects, on linear regressor coefficient estimation totally. Therefore, a more efficient estimator for parametric part is anticipated, which is shown to be true by the simulation results. For the non parametric component, the polynomial spline technique is implemented. The asymptotic properties of estimators for parametric and non parametric parts are presented. We also show how to select informative ones from a number of covariates in the linear part by using smoothly clipped absolute deviation-penalized estimators on a difference-based least-squares objective function, and the resulting estimators perform asymptotically as well as the oracle procedure in terms of selecting the correct model.  相似文献   

11.
Meta-analytical approaches have been extensively used to analyze medical data. In most cases, the data come from different studies or independent trials with similar characteristics. However, these methods can be applied in a broader sense. In this paper, we show how existing meta-analytic techniques can also be used as well when dealing with parameters estimated from individual hierarchical data. Specifically, we propose to apply statistical methods that account for the variances (and possibly covariances) of such measures. The estimated parameters together with their estimated variances can be incorporated into a general linear mixed model framework. We illustrate the methodology by using data from a first-in-man study and a simulated data set. The analysis was implemented with the SAS procedure MIXED and example code is offered.  相似文献   

12.
Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set.  相似文献   

13.
Abstract. In this article we consider a problem from bone marrow transplant (BMT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the overall survival. The BMT study we consider is based on donors and patients that are genotype matched, and this therefore leads to a missing data problem. We show how Aalen's additive risk model can be applied in this setting with the benefit that the time‐varying haplomatch effect can be easily studied. This problem has not been considered before, and the standard approach where one would use the expected‐maximization (EM) algorithm cannot be applied for this model because the likelihood is hard to evaluate without additional assumptions. We suggest an approach based on multivariate estimating equations that are solved using a recursive structure. This approach leads to an estimator where the large sample properties can be developed using product‐integration theory. Small sample properties are investigated using simulations in a setting that mimics the motivating haplomatch problem.  相似文献   

14.
15.
There has been considerable interest in studying the magnitude and type of inheritance of specific diseases. This is typically derived from family or twin studies, where the basic idea is to compare the correlation for different pairs that share different amount of genes. We here consider data from the Danish twin registry and discuss how to define heritability for cancer occurrence. The key point is that this should be done taking censoring as well as competing risks due to e.g.  death into account. We describe the dependence between twins on the probability scale and show that various models can be used to achieve sensible estimates of the dependence within monozygotic and dizygotic twin pairs that may vary over time. These dependence measures can subsequently be decomposed into a genetic and environmental component using random effects models. We here present several novel models that in essence describe the association in terms of the concordance probability, i.e., the probability that both twins experience the event, in the competing risks setting. We also discuss how to deal with the left truncation present in the Nordic twin registries, due to sampling only of twin pairs where both twins are alive at the initiation of the registries.  相似文献   

16.
Estimating a curve nonparametrically from data measured with error is a difficult problem that has been studied by many authors. Constructing a consistent estimator in this context can sometimes be quite challenging, and in this paper we review some of the tools that have been developed in the literature for kernel‐based approaches, founded on the Fourier transform and a more general unbiased score technique. We use those tools to rederive some of the existing nonparametric density and regression estimators for data contaminated by classical or Berkson errors, and discuss how to compute these estimators in practice. We also review some mistakes made by those working in the area, and highlight a number of problems with an existing R package decon .  相似文献   

17.
In this paper we consider the linear compartment model and consider the estimation procedures of the different parameters. We discuss a method to obtain the initial estimators, which can be used for any iterative procedures to obtain the least-squares estimators. Four different types of confidence intervals have been discussed and they have been compared by computer simulations. We propose different methods to estimate the number of components of the linear compartment model. One data set has been used to see how the different methods work in practice.  相似文献   

18.
An exploratory model analysis device we call CDF knotting is introduced. It is a technique we have found useful for exploring relationships between points in the parameter space of a model and global properties of associated distribution functions. It can be used to alert the model builder to a condition we call lack of distinguishability which is to nonlinear models what multicollinearity is to linear models. While there are simple remedial actions to deal with multicollinearity in linear models, techniques such as deleting redundant variables in those models do not have obvious parallels for nonlinear models. In some of these nonlinear situations, however, CDF knotting may lead to alternative models with fewer parameters whose distribution functions are very similar to those of the original overparameterized model. We also show how CDF knotting can be exploited as a mathematical tool for deriving limiting distributions and illustrate the technique for the 3-parameterWeibull family obtaining limiting forms and moment ratios which correct and extend previously published results. Finally, geometric insights obtained by CDF knotting are verified relative to data fitting and estimation.  相似文献   

19.
This paper shows how recursive integration methodologies can be used to evaluate high-dimensional integral expressions. This has applications to many areas of statistical inference where probability calculations and critical point evaluations often require such high-dimensional integral evaluations. Recursive integration can allow an integral expression of a given dimension to be evaluated by a series of calculations of a smaller dimension. This significantly reduces the computation time. The application of the recursive integration methodology is illustrated with several examples.  相似文献   

20.
It is often desirable to select a subset of regression variables so as to maximise the accuracy of prediction at a pre-specified point. There are a variety of possible mean-square-error-type criteria which could be used to measure the accuracy of prediction and hence to select an optimal subset. We shall show how these can easily be included in existing stepwise regression codes. The performance of the criteria is compared on a data set, where it becomes obvious that not only do different criteria give rise to different subsets at the same prediction point, but the same criterion quite commonly gives rise to different subsets at different prediction points. Thus the choice of a criterion has a major effect on the subset selected, and so requires conscious selection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号