OMII-UK Home

Adaptive Cost Estimation Techniques for OGSA-DAI DQP

Google Summer of Code 2009 ideas

Primary Mentor: Bartosz Dobrzelecki, EPCC, The University of Edinburgh
Secondary Mentor: Steven Lynden, AIST
OMII Project: OGSA-DAI.

Background

The OGSA-DAI DQP product, currently in development, is a distributed query processor based on the OGSA-DAI technology. It is going to extend the functionality of the existing OGSA-DQP system and aims to provide a flexible framework for data integration and research in distributed query processing.

Implementation of methods that would allow monitoring of query execution and adaptive cost estimation is the focus of this project.

Project Goals

Implementation of:

  • monitoring activities for OGSA-DAI
  • optimizers that insert monitoring operators into the query plan
  • a module that gathers post execution data and refines dynamic data structures storing information about value distributions and costs
  • selectivity estimation module that uses self tuning histograms
  • optimizer that uses self tuning cost functions

Project Description

The OGSA-DAI DQP product, currently in development, is a distributed query processor based on the OGSA-DAI technology. It is going to extend the functionality of the existing OGSA-DQP system and aims to provide a flexible framework for data integration and research in distributed query processing.

The optimization of query execution plans requires accurate information about characteristics of the data being queried. This information is used to estimate predicate selectivity and cardinality of intermediate results.

Assume we have two relations: cities and people, exposed by two distinct resources R1 and R2. The user wants to execute the following SQL query:

SELECT * FROM cities, people 
  WHERE age>30 AND people.city = cities.name

This query is equivalent to the following query plan

     JOIN (people.city = cities.name)
    /    \
cities  SELECT (age > 30)
          |
        people

Before this plan can be executed, the optimizer needs to decide which resource will execute the JOIN operator. There are two options:

  1. Transfer all the tuples for the cities table from R1 to R2
  2. Transfer the result of the SELECT operator from R2 to R1

Our goal might be to minimize the data being transferred over the network. To be able to do this we need to know how many records are stored in the cities table and how many records are stored in the people table. This information is usually available.

The problem is that we also need to estimate how many rows in the people table have values of the age attribute greater than 30. To be able to calculate such an estimate we need some information about how age values are distributed.

There are several approaches to this problem. One approach relies on histograms generated for each table, which can be a time consuming process and therefore it is usually done once before executing any queries. One problem with this approach is that if new records are being inserted our histograms quickly become outdated.

Another way to approach this problem would be to make a rough first estimate of the selectivity, monitor execution of the plan and gather data about the actual selectivity for each operator/predicate in the query plan. This information could then be used to build a self tuning histogram that would be constantly refined from query to query.

Such self tuning histograms are one of the ways of bringing adaptive cardinality estimation for distributed query processing. Implementation of methods that would allow monitoring of query execution and adaptive selectivity estimation is the focus of this project.

Possible monitoring probes could be extended to not only measure cardinalities, but also average size of attributes and the average time taken to process a tuple. Accurate cost estimates could be of a great use in queries where user defined functions are being used. Consider the following query:

SELECT * FROM cities, people
  WHERE distance(people.city, cities.name) > 10

where distance() is a user defined function. Usually it is very hard to estimate the actual cost of executing a user defined function. If our dynamically gathered post execution data included the average time needed to calculate the distance(people.city, cities.name) > 10 predicate, we could make some informed decisions whether parallel execution of the SELECT operator would decrease the total query execution time.

Project Requirements

  • fluent in Java
  • experience with relational databases
  • basic knowledge about relational theory

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-6) was last changed on 19-Mar-2009 17:10 by MarioAntonioletti [RSS]

© The University of Southampton on behalf of OMII-UK. All Rights Reserved. | Terms of Use | Privacy Policy | PageRank Checker