Random sampling from databases: a survey |
| |
Authors: | Frank Olken Doron Rotem |
| |
Institution: | (1) Information and Computing Sciences Division, Lawrence Berkeley Laboratory, 94720 Berkeley, CA, USA;(2) Management Information Systems Department, School of Business, San Jose State University, San Jose, CA, USA |
| |
Abstract: | This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B
+ trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed. |
| |
Keywords: | clustered sampling estimation materialized views query optimization query result size estimation query processing random sampling relational databases reservoir sampling sampling algorithms simple random sampling sequential sampling size of projections size of transitive closure |
本文献已被 SpringerLink 等数据库收录! |
|