Subsemble: an ensemble method for combining subset-specific algorithm fits |
| |
Authors: | Stephanie Sapp Mark J van der Laan John Canny |
| |
Institution: | 1. Department of Statistics, University of California at Berkeley, Berkeley, CA, USA;2. Division of Biostatistics, University of California at Berkeley, Berkeley, CA, USA;3. Division of Computer Science, University of California at Berkeley, Berkeley, CA, USA |
| |
Abstract: | Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set. |
| |
Keywords: | ensemble methods prediction cross-validation machine learning big data |
|
|