Approximation Techniques for Semi-Structured Data

Project Description

The Extensible Markup Language (XML) has rapidly emerged as a de-facto standard for data exchange and integration over the Internet, and its increasing popularity has created a real need for processing the growing volume of available XML data. Within the realm of XML query processing, XML summarization has become a cost-effective solution for providing fast, yet accurate approximate computations over XML data. In short, an XML summary, or synopsis, captures in limited space the key statistical characteristics of the underlying data set and thus represents a highly-compressed, approximate version of the base data.

The goal of this project is the development of novel summarization techniques in order to support approximate query answering over semi-structured data sets. The resulting synopses can be used as the "statistics" component of query optimizers, or used to assist users in the exploration of large collections of semi-structured data.

As part of our ongoing work on this topic, we have implemented an approximate query answering system termed AQAX. AQAX is based on the XClusters framework and provides fast and accurate approximate results over large XML data stores. Some information on AQAX can be found here.

Research in this project is mainly supported by NSF CAREER grant 0447966 and partly by an IBM Faculty Development Award.

People

External Collaborators Alumni Publications (by year)

2009 2008 2007 2006 2005 2004