High dimensional variable selection with clustered data: an application of random multivariate survival forests for detection of outlier medical device components |
| |
Authors: | Guy Cafri Peter Calhoun Juanjuan Fan |
| |
Affiliation: | 1. Surgical Outcomes and Analysis, Kaiser Permanente, San Diego, CA, USA;2. Computational Science Research Center, San Diego State University, San Diego, CA, USA;3. Department of Mathematics and Statistics, San Diego State University, San Diego, CA, USA |
| |
Abstract: | In many medical studies patients are nested or clustered within doctor. With many explanatory variables, variable selection with clustered data can be challenging. We propose a method for variable selection based on random forest that addresses clustered data through stratified binary splits. Our motivating example involves the detection orthopedic device components from a large pool of candidates, where each patient belongs to a surgeon. Simulations compare the performance of survival forests grown using the stratified logrank statistic to conventional and robust logrank statistics, as well as a method to select variables using a threshold value based on a variable's empirical null distribution. The stratified logrank test performs superior to conventional and robust methods when data are generated to have cluster-specific effects, and when cluster sizes are sufficiently large, perform comparably to the splitting alternatives in the absence of cluster-specific effects. Thresholding was effective at distinguishing between important and unimportant variables. |
| |
Keywords: | Medical devices multivariate random forest stratification survival |
|
|