首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure.  相似文献   

3.
In this paper, we discuss a parsimonious approach to estimation of high-dimensional covariance matrices via the modified Cholesky decomposition with lasso. Two different methods are proposed. They are the equi-angular and equi-sparse methods. We use simulation to compare the performance of the proposed methods with others available in the literature, including the sample covariance matrix, the banding method, and the L1-penalized normal loglikelihood method. We then apply the proposed methods to a portfolio selection problem using 80 series of daily stock returns. To facilitate the use of lasso in high-dimensional time series analysis, we develop the dynamic weighted lasso (DWL) algorithm that extends the LARS-lasso algorithm. In particular, the proposed algorithm can efficiently update the lasso solution as new data become available. It can also add or remove explanatory variables. The entire solution path of the L1-penalized normal loglikelihood method is also constructed.  相似文献   

4.
Box and Behnken [1958. Some new three level second-order designs for surface fitting. Statistical Technical Research Group Technical Report No. 26. Princeton University, Princeton, NJ; 1960. Some new three level designs for the study of quantitative variables. Technometrics 2, 455–475.] introduced a class of 3-level second-order designs for fitting the second-order response surface model. These 17 Box–Behnken designs (BB designs) are available for 3–12 and 16 factors. Although BB designs were developed nearly 50 years ago, they and the central-composite designs of Box and Wilson [1951. On the experimental attainment of optimum conditions. J. Royal Statist. Soc., Ser. B 13, 1–45.] are still the most often recommended response surface designs. Of the 17 aforementioned BB designs, 10 were constructed from balanced incomplete block designs (BIBDs) and seven were constructed from partially BIBDs (PBIBDs). In this paper we show that these seven BB designs constructed from PBIBDs can be improved in terms of rotatability as well as average prediction variance, DD- and GG-efficiency. In addition, we also report new orthogonally blocked solutions for 5, 8, 9, 11 and 13 factors. Note that an 11-factor BB design is available but cannot be orthogonally blocked. All new designs can be found at http://www.math.montana.edu/jobo/bbd/.  相似文献   

5.
This article advocates the following perspective: When confronting a scientific problem, the field of statistics enters by viewing the problem as one where the scientific answer could be calculated if some missing data, hypothetical or real, were available. Thus, statistical effort should be devoted to three steps:
  1. formulate the missing data that would allow this calculation,
  2. stochastically fill in these missing data, and
  3. do the calculations as if the filled-in data were available.
This presentation discusses: conceptual benefits, such as for causal inference using potential outcomes; computational benefits, such as afforded by using the EM algorithm and related data augmentation methods based on MCMC; and inferential benefits, such as valid interval estimation and assessment of assumptions based on multiple imputation.  相似文献   

6.
Methods have been developed by several authors to address the problem of bias in regression coefficients due to errors in exposure measurement. These approaches typically assume that there is one surrogate for each exposure. Occupational exposures are quite complex and are often described by characteristics of the workplace and the amount of time that one has worked in a particular area. In this setting, there are several surrogates which are used to define an individual's exposure. To analyze this type of data, regression calibration methodology is extended to adjust the estimates of exposure-response associations for the bias and additional uncertainty due to exposure measurement error from multiple surrogates. The health outcome is assumed to be binary and related to the quantitative measure of exposure by a logistic link function. The model for the conditional mean of the quantitative exposure measurement in relation to job characteristics is assumed to be linear. This approach is applied to a cross-sectional epidemiologic study of lung function in relation to metal working fluid exposure and the corresponding exposure assessment study with quantitative measurements from personal monitors. A simulation study investigates the performance of the proposed estimator for various values of the baseline prevalence of disease, exposure effect and measurement error variance. The efficiency of the proposed estimator relative to the one proposed by Carroll et al. [1995. Measurement Error in Nonlinear Models. Chapman & Hall, New York] is evaluated numerically for the motivating example. User-friendly and fully documented Splus and SAS routines implementing these methods are available (http://www.hsph.harvard.edu/faculty/spiegelman/multsurr.html).  相似文献   

7.
In this paper, we describe an overall strategy for robust estimation of multivariate location and shape, and the consequent identification of outliers and leverage points. Parts of this strategy have been described in a series of previous papers (Rocke, Ann. Statist., in press; Rocke and Woodruff, Statist. Neerlandica 47 (1993), 27–42, J. Amer. Statist. Assoc., in press; Woodruff and Rocke, J. Comput. Graphical Statist. 2 (1993), 69–95; J. Amer. Statist. Assoc. 89 (1994), 888–896) but the overall structure is presented here for the first time. After describing the first-level architecture of a class of algorithms for this problem, we review available information about possible tactics for each major step in the process. The major steps that we have found to be necessary are as follows: (1) partition the data into groups of perhaps five times the dimension; (2) for each group, search for the best available solution to a combinatorial estimator such as the Minimum Covariance Determinant (MCD) — these are the preliminary estimates; (3) for each preliminary estimate, iterate to the solution of a smooth estimator chosen for robustness and outlier resistance; and (4) choose among the final iterates based on a robust criterion, such as minimum volume. Use of this algorithm architecture can enable reliable, fast, robust estimation of heavily contaminated multivariate data in high (> 20) dimension even with large quantities of data. A computer program implementing the algorithm is available from the authors.  相似文献   

8.
In this paper we consider the long-run availability of a parallel system having several independent renewable components with exponentially distributed failure and repair times. We are interested in testing availability of the system or constructing a lower confidence bound for the availability by using component test data. For this problem, there is no exact test or confidence bound available and only approximate methods are available in the literature. Using the generalized p-value approach, an exact test and a generalized confidence interval are given. An example is given to illustrate the proposed procedures. A simulation study is given to demonstrate their advantages over the other available approximate procedures. Based on type I and type II error rates, the simulation study shows that the generalized procedures outperform the other available methods.  相似文献   

9.
This article advocates the following perspective: When confronting a scientific problem, the field of statistics enters by viewing the problem as one where the scientific answer could be calculated if some missing data, hypothetical or real, were available. Thus, statistical effort should be devoted to three steps:
1.  formulate the missing data that would allow this calculation,
2.  stochastically fill in these missing data, and
3.  do the calculations as if the filled-in data were available.
This presentation discusses: conceptual benefits, such as for causal inference using potential outcomes; computational benefits, such as afforded by using the EM algorithm and related data augmentation methods based on MCMC; and inferential benefits, such as valid interval estimation and assessment of assumptions based on multiple imputation. JEL classification  C10, C14, C15  相似文献   

10.
This article deals with Bayesian inference and prediction for M/G/1 queueing systems. The general service time density is approximated with a class of Erlang mixtures which are phase-type distributions. Given this phase-type approximation, an explicit evaluation of measures such as the stationary queue size, waiting time and busy period distributions can be obtained. Given arrival and service data, a Bayesian procedure based on reversible jump Markov Chain Monte Carlo methods is proposed to estimate system parameters and predictive distributions.  相似文献   

11.
For attribute data with (very) small failure rates often control charts are used which decide whether to stop or to continue each time r failures have occurred, for some r?1. Because of the small probabilities involved, such charts are very sensitive to estimation effects. This is true in particular if the underlying failure rate varies and hence the distributions involved are not geometric. Such a situation calls for a nonparametric approach, but this may require far more Phase I observations than are typically available in practice. In the present paper it is shown how this obstacle can be effectively overcome by looking not at the sum but rather at the maximum of each group of size r.  相似文献   

12.
In this paper we further consider the problem of determining optimal block designs which can be used to compare v test treatments to a standard treatment in experimental situations where the available experimental units are to be arranged in b blocks of size k. A design is said to be MV-optimal in such an experimental setting it is minimizes the maximal variance with which treatment differences involving the standard treatment are estimated. In this paper we derive some further sufficient conditions for a design to be MV-optimal in an experimental situation such as described above.  相似文献   

13.
Various approximate methods have been proposed for obtaining a two-tailed confidence interval for the ratio R of two proportions (independent samples). This paper evaluates 73 different methods (64 of which are new methods or modifications of older methods) and concludes that: (1) none of the classic methods (including the well-known score method) is acceptable since they are too liberal; (2), the best of the classic methods is the one based on logarithmic transformation (after increasing the data by 0.5), but it is only valid for large samples and moderate values of R; (3) the best methods among the 73 methods is based on an approximation to the score method (after adding 0.5 to all the data), with the added advantage of obtaining the interval by a simple method (i.e. solving a second degree equation); and (4) an option that is simpler than the previous one, and which is almost as effective for moderate values of R, consists of applying the classic Wald method (after adding a quantity to the data which is usually $z_{\alpha /2}^{2}/4$ ).  相似文献   

14.
In this paper, we examine the potential determinants of foreign direct investment. For this purpose, we apply new exact subset selection procedures, which are based on idealized assumptions, as well as their possibly more plausible empirical counterparts to an international data set to select the optimal set of predictors. Unlike the standard model selection procedures AIC and BIC, which penalize only the number of variables included in a model, and the subset selection procedures RIC and MRIC, which consider also the total number of available candidate variables, our data-specific procedures even take the correlation structure of all candidate variables into account. Our main focus is on a new procedure, which we have designed for situations where some of the potential predictors are certain to be included in the model. For a sample of 73 developing countries, this procedure selects only four variables, namely imports, net income from abroad, gross capital formation, and GDP per capita. An important secondary finding of our study is that the data-specific procedures, which are based on extensive simulations and are therefore very time-consuming, can be approximated reasonably well by the much simpler exact methods.  相似文献   

15.
Abstract

In this paper, we propose maximum entropy in the mean methods for propensity score matching classification problems. We provide a new methodological approach and estimation algorithms to handle explicitly cases when data is available: (i) in interval form; (ii) with bounded measurement or observational errors; or (iii) both as intervals and with bounded errors. We show that entropy in the mean methods for these three cases generally outperform benchmark error-free approaches.  相似文献   

16.
We propose a Bayesian implementation of the lasso regression that accomplishes both shrinkage and variable selection. We focus on the appropriate specification for the shrinkage parameter λ through Bayes factors that evaluate the inclusion of each covariate in the model formulation. We associate this parameter with the values of Pearson and partial correlation at the limits between significance and insignificance as defined by Bayes factors. In this way, a meaningful interpretation of λ is achieved that leads to a simple specification of this parameter. Moreover, we use these values to specify the parameters of a gamma hyperprior for λ. The parameters of the hyperprior are elicited such that appropriate levels of practical significance of the Pearson correlation are achieved and, at the same time, the prior support of λ values that activate the Lindley-Bartlett paradox or lead to over-shrinkage of model coefficients is avoided. The proposed method is illustrated using two simulation studies and a real dataset. For the first simulation study, results for different prior values of λ are presented as well as a detailed robustness analysis concerning the parameters of the hyperprior of λ. In all examples, detailed comparisons with a variety of ordinary and Bayesian lasso methods are presented.  相似文献   

17.
When a large amount of spatial data is available computational and modeling challenges arise and they are often labeled as “big n problem”. In this work we present a brief review of the literature. Then we focus on two approaches, respectively based on stochastic partial differential equations and integrated nested Laplace approximation, and on the tapering of the spatial covariance matrix. The fitting and predictive abilities of using the two methods in conjunction with Kriging interpolation are compared in a simulation study.  相似文献   

18.
The analysis of survival endpoints subject to right-censoring is an important research area in statistics, particularly among econometricians and biostatisticians. The two most popular semiparametric models are the proportional hazards model and the accelerated failure time (AFT) model. Rank-based estimation in the AFT model is computationally challenging due to optimization of a non-smooth loss function. Previous work has shown that rank-based estimators may be written as solutions to linear programming (LP) problems. However, the size of the LP problem is O(n 2+p) subject to n 2 linear constraints, where n denotes sample size and p denotes the dimension of parameters. As n and/or p increases, the feasibility of such solution in practice becomes questionable. Among data mining and statistical learning enthusiasts, there is interest in extending ordinary regression coefficient estimators for low-dimensions into high-dimensional data mining tools through regularization. Applying this recipe to rank-based coefficient estimators leads to formidable optimization problems which may be avoided through smooth approximations to non-smooth functions. We review smooth approximations and quasi-Newton methods for rank-based estimation in AFT models. The computational cost of our method is substantially smaller than the corresponding LP problem and can be applied to small- or large-scale problems similarly. The algorithm described here allows one to couple rank-based estimation for censored data with virtually any regularization and is exemplified through four case studies.  相似文献   

19.
This article describes a recursive nonparametric estimation for the local partial first derivative of an arbitrary function satisfied some regularity conditions and establishes its consistency and asymptotic normality under the assumption of strong mixing sequence. The proposed estimator is a variable window width version of the Watson-Nadaraya type of derivative estimator. The window width varied as more data points become available enables a recursive algorithm that reduce computational complexity from order N 3 normally required by batch methods for kernel regression to order N 2. This approach is computationally simple and attractive from practical viewpoint especially when the situation call for frequent updating of first derivative estimates. For example, maintaining a delta-hedged position of a portfolio of equities with index options is one of many applications of such estimation.  相似文献   

20.
Profile data emerges when the quality of a product or process is characterized by a functional relationship among (input and output) variables. In this paper, we focus on the case where each profile has one response variable Y, one explanatory variable x, and the functional relationship between these two variables can be rather arbitrary. The basic concept can be applied to a much wider case, however. We propose a general method based on the Generalized Likelihood Ratio Test (GLRT) for monitoring of profile data. The proposed method uses nonparametric regression to estimate the on-line profiles and thus does not require any functional form for the profiles. Both Shewhart-type and EWMA-type control charts are considered. The average run length (ARL) performance of the proposed method is studied. It is shown that the proposed GLRT-based control chart can efficiently detect both location and dispersion shifts of the on-line profiles from the baseline profile. An upper control limit (UCL) corresponding to a desired in-control ARL value is constructed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号