首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Subsemble: an ensemble method for combining subset-specific algorithm fits
Authors:Stephanie Sapp  Mark J van der Laan  John Canny
Institution:1. Department of Statistics, University of California at Berkeley, Berkeley, CA, USA;2. Division of Biostatistics, University of California at Berkeley, Berkeley, CA, USA;3. Division of Computer Science, University of California at Berkeley, Berkeley, CA, USA
Abstract:Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set.
Keywords:ensemble methods  prediction  cross-validation  machine learning  big data
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号