Concise and accurate data summaries for fast approximate query answering.

Concise and accurate data summaries for fast ...
Hai Wang, Hai Wang
Not in Library

My Reading Lists:

Create a new list

Check-In

×Close
Add an optional check-in date. Check-in dates are used to track yearly reading goals.
Today


Buy this book

Last edited by WorkBot
December 15, 2009 | History

Concise and accurate data summaries for fast approximate query answering.

Many techniques have been proposed to support fast approximate query answering using summarized information of the data. Among them, histogram techniques and wavelet techniques are two popular types that have been extensively studied.Second, we present a thorough experimental evaluation of previously proposed histogram and wavelet techniques.In this thesis, we investigate the trade-off between the space used and the accuracy of various histogram and wavelet techniques. We also examine their construction costs and query answering time. The major contributions of this thesis are as follows.Finally, we identify the characteristics of the data for which wavelet techniques perform poorly or excellently. We present an algorithm, called the Majorization Ranking Test (MRT) algorithm, to quickly determine which wavelet technique to use for fast approximate query answering (if any). The MRT algorithm also allows us to decide whether to use wavelet techniques or histogram techniques. We also present a new family of wavelet techniques, the Space Efficient Wavelet (SEW) techniques, which improve on previously proposed wavelet techniques by utilizing space in a more efficient way. We show that the SEW techniques dominate previously proposed wavelet techniques in both one-dimensional and multi-dimensional cases.Third, we present a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length (MDL) principle, which has been widely used for model selection in statistics and machine learning. The one-dimensional HMF histogram is applicable to one-dimensional data, and the multi-dimensional HMF histogram is applicable to multi-dimensional data. The HMF histograms can be constructed to either seek the highest possible accuracy within a given space budget, or seek the most concise representation that leads to accuracy within a specified tolerance. We show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads.Fourth, using Information Theory, we quantitatively assess the information gain due to each of the different types of histogram information both individually and in combination. Based on theoretical and experimental evidence, we suggest effective heuristics for allocating space to utilize different types of histogram information. We also present a new type of multi-dimensional histogram, called the multi-dimensional Values & Intervals (VI) histogram, that can be constructed in just one scan through the data. All other types of multi-dimensional histograms require much larger construction costs than the multi-dimensional VI histogram, and they are seldom used in practice due to their high construction costs. Through a set of experiments, we show that the multi-dimensional VI histogram is capable of providing more accurate approximations than the techniques currently used in major commercial database management systems, including IBM DB2, Oracle Database, and Microsoft SQL Server, with similar construction time.First, we present a general model for fast approximate query answering in many database applications. This model unifies different scenarios so that histogram and wavelet techniques can be systematically evaluated and compared.

Publish Date
Language
English
Pages
288

Buy this book

Edition Availability
Cover of: Concise and accurate data summaries for fast approximate query answering.

Add another edition?

Book Details


Edition Notes

Adviser: Kenneth C. Sevcik.

Thesis (Ph.D.)--University of Toronto, 2004.

Electronic version licensed for access by U. of T. users.

Source: Dissertation Abstracts International, Volume: 65-05, Section: B, page: 2485.

The Physical Object

Pagination
288 leaves.
Number of pages
288

ID Numbers

Open Library
OL20339036M
ISBN 10
0612916731

Community Reviews (0)

Feedback?
No community reviews have been submitted for this work.

Lists

This work does not appear on any lists.

History

Download catalog record: RDF / JSON / OPDS | Wikipedia citation
December 15, 2009 Edited by WorkBot link works
October 26, 2008 Created by ImportBot Imported from University of Toronto MARC record