期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Novel Method to Calculate Mean Survival Time for Time-to-Event Data

Zheng Su 《统计学通讯:模拟与计算》2013,42(5):611-620

In the analysis of time-to-event data, restricted mean survival time has been well investigated in the literature and provided by many commercial software packages, while calculating mean survival time remains as a challenge due to censoring or insufficient follow-up time. Several researchers have proposed a hybrid estimator of mean survival based on the Kaplan–Meier curve with an extrapolated tail. However, this approach often leads to biased estimate due to poor estimate of the parameters in the extrapolated “tail” and the large variability associated with the tail of the Kaplan–Meier curve due to small set of patients at risk. Two key challenges in this approach are (1) where the extrapolation should start and (2) how to estimate the parameters for the extrapolated tail. The authors propose a novel approach to calculate mean survival time to address these two challenges. In the proposed approach, an algorithm is used to search if there are any time points where the hazard rates change significantly. The survival function is estimated by the Kaplan–Meier method prior to the last change point and approximated by an exponential function beyond the last change point. The parameter in the exponential function is estimated locally. Mean survival time is derived based on this survival function. The simulation and case studies demonstrated the superiority of the proposed approach. 相似文献

2.

On kernel machine learning for propensity score estimation under complex confounding structures

Baiming Zou Xinlei Mi Patrick J. Tighe Gary G. Koch Fei Zou 《Pharmaceutical statistics》2021,20(4):752-764

Post marketing data offer rich information and cost-effective resources for physicians and policy-makers to address some critical scientific questions in clinical practice. However, the complex confounding structures (e.g., nonlinear and nonadditive interactions) embedded in these observational data often pose major analytical challenges for proper analysis to draw valid conclusions. Furthermore, often made available as electronic health records (EHRs), these data are usually massive with hundreds of thousands observational records, which introduce additional computational challenges. In this paper, for comparative effectiveness analysis, we propose a statistically robust yet computationally efficient propensity score (PS) approach to adjust for the complex confounding structures. Specifically, we propose a kernel-based machine learning method for flexibly and robustly PS modeling to obtain valid PS estimation from observational data with complex confounding structures. The estimated propensity score is then used in the second stage analysis to obtain the consistent average treatment effect estimate. An empirical variance estimator based on the bootstrap is adopted. A split-and-merge algorithm is further developed to reduce the computational workload of the proposed method for big data, and to obtain a valid variance estimator of the average treatment effect estimate as a by-product. As shown by extensive numerical studies and an application to postoperative pain EHR data comparative effectiveness analysis, the proposed approach consistently outperforms other competing methods, demonstrating its practical utility. 相似文献

3.

PERFORMANCE GUARANTEES FOR INDIVIDUALIZED TREATMENT RULES

Qian M Murphy SA 《Annals of statistics》2011,39(2):1180-1210

Because many illnesses show heterogeneous response to treatment, there is increasing interest in individualizing treatment to patients [11]. An individualized treatment rule is a decision rule that recommends treatment according to patient characteristics. We consider the use of clinical trial data in the construction of an individualized treatment rule leading to highest mean response. This is a difficult computational problem because the objective function is the expectation of a weighted indicator function that is non-concave in the parameters. Furthermore there are frequently many pretreatment variables that may or may not be useful in constructing an optimal individualized treatment rule yet cost and interpretability considerations imply that only a few variables should be used by the individualized treatment rule. To address these challenges we consider estimation based on l(1) penalized least squares. This approach is justified via a finite sample upper bound on the difference between the mean response due to the estimated individualized treatment rule and the mean response due to the optimal individualized treatment rule. 相似文献

4.

Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods

Cheng Ju Mary Combs Samuel D. Lendle Jessica M. Franklin Richard Wyss Sebastian Schneeweiss 《Journal of applied statistics》2019,46(12):2216-2236

ABSTRACT

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. Super Learner (SL) is a generic ensemble learning algorithm that uses cross-validation to select among a ‘library’ of candidate prediction models. While SL has been widely studied in a number of settings, it has not been thoroughly evaluated in large electronic healthcare databases that are common in pharmacoepidemiology and comparative effectiveness research. In this study, we applied and evaluated the performance of SL in its ability to predict the propensity score (PS), the conditional probability of treatment assignment given baseline covariates, using three electronic healthcare databases. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also proposed a novel strategy for prediction modeling that combines SL with the high-dimensional propensity score (hdPS) variable selection algorithm. Predictive performance was assessed using three metrics: the negative log-likelihood, area under the curve (AUC), and time complexity. Results showed that the best individual algorithm, in terms of predictive performance, varied across datasets. The SL was able to adapt to the given dataset and optimize predictive performance relative to any individual learner. Combining the SL with the hdPS was the most consistent prediction method and may be promising for PS estimation and prediction modeling in electronic healthcare databases. 相似文献

5.

Subsemble: an ensemble method for combining subset-specific algorithm fits

Stephanie Sapp Mark J. van der Laan John Canny 《Journal of applied statistics》2014,41(6):1247-1259

Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set. 相似文献

6.

Adaptive sampling for Bayesian geospatial models

Hongxia Yang Fei Liu Chunlin Ji David Dunson 《Statistics and Computing》2014,24(6):1101-1110

Bayesian hierarchical modeling with Gaussian process random effects provides a popular approach for analyzing point-referenced spatial data. For large spatial data sets, however, generic posterior sampling is infeasible due to the extremely high computational burden in decomposing the spatial correlation matrix. In this paper, we propose an efficient algorithm—the adaptive griddy Gibbs (AGG) algorithm—to address the computational issues with large spatial data sets. The proposed algorithm dramatically reduces the computational complexity. We show theoretically that the proposed method can approximate the real posterior distribution accurately. The sufficient number of grid points for a required accuracy has also been derived. We compare the performance of AGG with that of the state-of-the-art methods in simulation studies. Finally, we apply AGG to spatially indexed data concerning building energy consumption. 相似文献

7.

Dynamic factor analysis for short panels: estimating performance trajectories for water utilities

Nikolaos Zirogiannis Yorghos Tripodis 《Statistical Methods and Applications》2018,27(1):131-150

We develop a novel estimation algorithm for a dynamic factor model (DFM) applied to panel data with a short time dimension and a large cross sectional dimension. Current DFMs usually require panels with a minimum of 20 years of quarterly data (80 time observations per panel). In contrast, the application we consider includes panels with a median of 8 annual observations. As a result, the time dimension in our paper is substantially shorter than previous work in the DFM literature. This difference increases the computational challenges of the estimation process which we address by developing the “Two-Cycle Conditional Expectation - Maximization” (2CCEM) algorithm which is a variant of the EM algorithm and its extensions. We analyze the conditions under which our model is identified and provide simulation results demonstrating consistency of our 2CCEM estimator. We apply the DFM to a dataset of 802 water and sanitation utilities from 43 countries and use the 2CCEM algorithm in order to estimate dynamic performance trajectories for each utility. 相似文献

8.

Estimating a Finite Mixed Exponential Distribution under Progressively Type-II Censored Data

Yuzhu Tian Maozai Tian Qianqian Zhu 《统计学通讯:理论与方法》2014,43(17):3762-3776

The Type-II progressive censoring scheme has become very popular for analyzing lifetime data in reliability and survival analysis. However, no published papers address parameter estimation under progressive Type-II censoring for the mixed exponential distribution (MED), which is an important model for reliability and survival analysis. This is the problem that we address in this paper. It is noted that maximum likelihood estimation of unknown parameters cannot be obtained in closed form due to the complicated log-likelihood function. We solve this problem by using the EM algorithm. Finally, we obtain closed form estimates of the model. The proposed methods are illustrated by both some simulations and a case analysis. 相似文献

9.

An analysis pipeline for estimating true intake from repeated measurements with random errors

Seongil Jo Jeongseon Kim 《统计学通讯:理论与方法》2019,48(5):1239-1254

The accurate estimation of an individual's usual dietary intake is an important topic in nutritional epidemiology. This paper considers the best linear unbiased predictor (BLUP) computed from repeatedly measured dietary data and derives several nonparametric prediction intervals for true intake. However, the performance of the BLUP and the validity of prediction intervals depends on whether required model assumptions for the true intake estimation problem hold. To address this issue, the paper examines how the BLUP and prediction intervals behave in the case of a violation of model assumptions, and then proposes an analysis pipeline for checking them with data. 相似文献

10.

How increased automation will improve the 1990 census of population and housing of the United States

Bounpane P 《Journal of official statistics》1986,2(4):545-553

"The U.S. Bureau of the Census will increase significantly the automation of operations for the 1990 Census of Population and Housing, thus eliminating or reducing many of the labor-intensive clerical operations of past censuses and contributing to the speedier release of data products. An automated address control file will permit the computer to monitor the enumeration status of an address. The automated address file will also make it possible to begin electronic data processing concurrently with data collection, and, thus, 5-7 months earlier than for the 1980 Census. An automated geographic support system will assure consistency between various census geographic products, and computer-generated maps will be possible. Other areas where automation will be introduced or increased are questionnaire editing, coding of written entries on questionnaires, and reporting of progress and cost by field offices." 相似文献

11.

多源高维数据的多分类纵向整合分析及应用

吴梦云等《统计研究》2021,38(8):132-145

多分类数据分析在实证研究中具有重要意义。然而,由于高维数、小样本及低信噪比等原因,现有的多分类方法仍面临信息量不足而导致的效果不佳问题。为此,学者们通过收集更多信息源数据以更全面地刻画实际问题。不同于收集相同自变量的不同源样本,目前较为流行的多源数据收集了相同样本的不同源自变量,它们的独立性和相关性为统计建模带来了新的挑战。本文提出基于典型变量回归的多分类纵向整合分析方法,其中利用惩罚技术实现变量选择,并独特地考虑不同源数据间的关联结构,提出高效的ADMM算法进行模型优化。数值模拟结果表明,该方法在变量选择和分类预测上均具有优越性。基于我国上证50的多源股票数据,利用该方法对2019年股票日收益率的影响因素进行了实证探究。研究表明,本文提出的多分类整合分析在筛选出具有解释意义变量的同时具有更好的预测效果。相似文献

12.

BUSDM – an algorithm for the bottom-up search of departures from a model

《Journal of Statistical Computation and Simulation》2012,82(5):561-578

Searching for regions of the input space where a statistical model is inappropriate is useful in many applications. The study proposes an algorithm for finding local departures from a regression-type prediction model. The algorithm returns low-dimensional hypercubes where the average prediction error clearly departs from zero. The study describes the developed algorithm, and shows successful applications on the simulated and real data from the steel plate production. The algorithms that have been originally developed for searching regions of the high-response value from the input space are reviewed and considered as alternative methods for locating model departures. The proposed algorithm succeeds in locating the model departure regions better than the compared alternatives. The algorithm can be utilized in sequential follow-up of a model as time goes along and new data are observed. 相似文献

13.

The stochastic topic block model for the clustering of vertices in networks with textual edges

C. Bouveyron P. Latouche R. Zreik 《Statistics and Computing》2018,28(1):11-31

Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become an unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This paper introduces the stochastic topic block model, a probabilistic model for networks with textual edges. We address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization algorithm is proposed to perform inference. Simulated datasets are considered in order to assess the proposed approach and to highlight its main features. Finally, we demonstrate the effectiveness of our methodology on two real-word datasets: a directed communication network and an undirected co-authorship network. 相似文献

14.

Estimands in hematologic oncology trials

Steven Sun Hans-Jochen Weber Emily Butler Kaspar Rufibach Satrajit Roychoudhury 《Pharmaceutical statistics》2021,20(4):793-805

The estimand framework included in the addendum to the ICH E9 guideline facilitates discussions to ensure alignment between the key question of interest, the analysis, and interpretation. Therapeutic knowledge and drug mechanism play a crucial role in determining the strategy and defining the estimand for clinical trial designs. Clinical trials in patients with hematological malignancies often present unique challenges for trial design due to complexity of treatment options and existence of potential curative but highly risky procedures, for example, stem cell transplant or treatment sequence across different phases (induction, consolidation, maintenance). Here, we illustrate how to apply the estimand framework in hematological clinical trials and how the estimand framework can address potential difficulties in trial result interpretation. This paper is a result of a cross-industry collaboration to connect the International Conference on Harmonisation (ICH) E9 addendum concepts to applications. Three randomized phase 3 trials will be used to consider common challenges including intercurrent events in hematologic oncology trials to illustrate different scientific questions and the consequences of the estimand choice for trial design, data collection, analysis, and interpretation. Template language for describing estimand in both study protocols and statistical analysis plans is suggested for statisticians' reference. 相似文献

15.

ON THE PREDICTION OF FUTURE FAILURES FOR A REPAIRABLE EQUIPMENT SUBJECT TO OVERHAULS

《统计学通讯:理论与方法》2013,42(4):691-706

This paper deals with the prediction, from a Bayes viewpoint, of future failures for a repairable equipment subjected both to minimal repairs and periodic overhauls. The effect of major overhauls on the reliability of the equipment is modeled by a proportional age reduction model, while the failure process between two successive overhaul epochs is modeled by the power law process. Prediction both of the future failure times and of the number of failures in a future time interval are provided on the basis of the observed data and of a number of suitable prior densities, which reflect different degrees of belief on the failure mechanism and overhaul effectiveness. Finally, a numerical application illustrates the proposed prediction procedures and their use in assessing the adequacy of the model to describe the observed data set. 相似文献

16.

产业基本性与重点产业选择

林晨等《统计研究》2020,37(6):93-105

在国家提出发展战略性新兴产业的大背景下,本文将技术结构意义上的“基本”性和技术进步潜力结合起来,归纳出重点产业选择的内在逻辑,并从理论的角度论证了其对经济增长的意义。本文进一步给出了基于投入产出表数据来辨别重点产业的数值方法。基于矩阵三角化方法的数值分析发现,我国的重点产业包含通信设备、计算机及其他电子设备制造业,通用与专用设备制造业,交通运输设备制造业。本文也同时计算出了我国重点产业的竞争力水平,并开展了国际比较。测算结果表明：通用与专用设备制造业,交通运输设备制造业的竞争力相对较强,而通信设备、计算机及其他电子设备制造业中的生产性投入品的竞争力则相对较弱。相似文献

17.

Sparse regression techniques in low-dimensional survival data settings

Christine Porzelius Martin Schumacher Harald Binder 《Statistics and Computing》2010,20(2):151-163

In high-dimensional data settings, sparse model fits are desired, which can be obtained through shrinkage or boosting techniques. We investigate classical shrinkage techniques such as the lasso, which is theoretically known to be biased, new techniques that address this problem, such as elastic net and SCAD, and boosting technique CoxBoost and extensions of it, which allow to incorporate additional structure. To examine, whether these methods, that are designed for or frequently used in high-dimensional survival data analysis, provide sensible results in low-dimensional data settings as well, we consider the well known GBSG breast cancer data. In detail, we study the bias, stability and sparseness of these model fitting techniques via comparison to the maximum likelihood estimate and resampling, and their prediction performance via prediction error curve estimates. 相似文献

18.

Convergence of the Guesstimation Algorithm

Adriana Agapie 《统计学通讯:理论与方法》2013,42(5):711-718

Economists attempting to build econometric or forecasting models are frequently restricted due to data scarcity in terms of short time series of data, and also of parameter non constancy and under-specification. In this case, a realistic alternative is often to guess rather than to estimate parameters of such models. An algorithm of repetitive guessing (drawing) parameters from iteratively changing distributions, with the objective of minimizing the squares of ex-post prediction errors, weighted by penalty weights and subject to a learning process, has been recently introduced. Despite obvious advantages, especially when applied for undersized empirical models with a large number of parameters, applications of Repetitive Stochastic Guesstimation have been, so far, limited. This has presumably been caused by the lack of rigorous proof of its convergence. Such proof for a class of linear models, both identifiable (in the economic sense) and not, is provided in this article. 相似文献

19.

Robust logistic regression of family data in the presence of missing genotypes

Yanping Qiu 《Journal of applied statistics》2019,46(5):926-945

Large cohort studies are commonly launched to study the risk effect of genetic variants or other risk factors on a chronic disorder. In these studies, family data are often collected to provide additional information for the purpose of improving the inference results. Statistical analysis of the family data can be very challenging due to the missing observations of genotypes, incomplete records of disease occurrences in family members, and the complicated dependence attributed to the shared genetic background and environmental factors. In this article, we investigate a class of logistic models with family-shared random effects to tackle these challenges, and develop a robust regression method based on the conditional logistic technique for statistical inference. An expectation–maximization (EM) algorithm with fast computation speed is developed to handle the missing genotypes. The proposed estimators are shown to be consistent and asymptotically normal. Additionally, a score test based on the proposed method is derived to test the genetic effect. Extensive simulation studies demonstrate that the proposed method performs well in finite samples in terms of estimate accuracy, robustness and computational speed. The proposed procedure is applied to an Alzheimer's disease study. 相似文献

20.

Variable selection for partially linear proportional hazards model with covariate measurement error

Xiao Song Li Wang Shuangge Ma Hanwen Huang 《Journal of nonparametric statistics》2019,31(1):196-220

In survival analysis, we may encounter the following three problems: nonlinear covariate effect, variable selection and measurement error. Existing studies only address one or two of these problems. The goal of this study is to fill the knowledge gap and develop a novel approach to simultaneously address all three problems. Specifically, a partially time-varying coefficient proportional hazards model is proposed to more flexibly describe covariate effects. Corrected score and conditional score approaches are employed to accommodate potential measurement error. For the selection of relevant variables and regularised estimation, a penalisation approach is adopted. It is shown that the proposed approach has satisfactory asymptotic properties. It can be effectively realised using an iterative algorithm. The performance of the proposed approach is assessed via simulation studies and further illustrated by application to data from an AIDS clinical trial. 相似文献