Using balanced iterative reducing and clustering hierarchies to compute approximate rank statistics on massive datasets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Using balanced iterative reducing and clustering hierarchies to compute approximate rank statistics on massive datasets

Abstract:	The balanced iterative reducing and clustering hierarchies (BIRCH) algorithm handles massive datasets by reading the data file only once, clustering the data as it is read, and retaining only a few clustering features to summarize the data read so far. Using BIRCH allows to analyse datasets that are too large to fit in the computer main memory. We propose estimates of Spearman's ρ and Kendall's τ that are calculated from a BIRCH output and assess their performance through Monte Carlo studies. The numerical results show that the BIRCH-based estimates can achieve the same efficiency as the usual estimates of ρ and τ while using only a fraction of the memory otherwise required.

Keywords:	correlation rank statistics massive dataset Kendall's τ Spearman's ρ BIRCH