SUXESS: Schema-based Utilities for XML: Efficient Storage and Statistics

Database Systems Lab
Indian Institute of Science

[Projects] [Download] [Publications] [Reports] [Contact]


People: K. Leela, Jayant Haritsa
User queries on XML documents are typically expressed as regular path expressions. A variety of indexing techniques for efficiently retrieving the results to such queries have been proposed in the recent literature. While these techniques are applicable to documents that are completely schema-less, in practice XML documents often adhere to a schema, such as a DTD. In this project, we have developed SphinX, a new XML indexing scheme that utilizes the schema to significantly enhance the search process. SphinX implements a persistent index structure that seamlessly combines the schema information with standard B-tree technology, resulting in a simple and scalable solution. A performance evaluation over a variety of XML documents, including the Xmark benchmark, indicates significant benefits with regard to both index construction and index access.

People: Pankaj Tolani, Jayant Haritsa
XML documents are extremely verbose since the "schema" is repeated for every "record" in the document. While a variety of compressors are available to address this problem, they are not designed to support direct querying of the compressed document, a useful feature from a database perspective. In this work, we propose a new compression tool called XGrind, that directly supports queries in the compressed domain. A special feature of XGrind is that the compressed document retains the structure of the original document, permitting reuse of the standard XML techniques for processing the compressed document. Performance evaluation over a variety of XML documents and user queries indicates that XGrind simultaneously delivers improved query processing times and reasonable compression ratios.

People: Maya Ramanath, Jayant Haritsa
Juliana Freire, Prasan Roy
We consider the problem of cost-based strategies to derive efficient relational configurations for XML applications that subscribe to an XML Schema. In particular, we propose a flexible framework for XML schema transformations and show how it can be used to design algorithms to search the space of equivalent relational configurations. We study the impact of the schema transformations and query workload on the search strategies for finding efficient XML-to-relational mappings. In addition, we propose several optimizations to speed up the search process. Our experiments indicate that a judicious choice of transformations and search strategies can lead to relational configurations of substantially higher quality than those recommended by previous approaches.

People: Maya Ramanath, Jayant Haritsa
Juliana Freire, Prasan Roy, Jerome Simeon, Lingzhi Zhang
The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. This information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. Our experimental evaluation demonstrates the accuracy and scalability of our approach.

People: Maya Ramanath, Jayant Haritsa
Juliana Freire, Lingzhi Zhang
The problem of estimating the cardinality of XML queries has received considerable attention recently. Existing approaches assume a static scenario wherein the XML data does not change, and focus on collecting sufficiently detailed statistics to accurately estimate the cardinality of various subsets of XQuery. However, many applications are dynamic and involve updates to the XML data. In this project, we investigate efficient strategies for incrementally maintaining statistical summaries as and when updates are applied to the data. We propose algorithms that handle both the addition of new documents as well as random insertions in  the document tree. Our performance evaluation indicates that our incremental techniques are significantly faster than the naive recomputation approach and that accuracy can be maintained even with a fixed memory budget.

People: Priti Patil, Jayant Haritsa
When hosting XML information on relational backends, a mapping has to be established between the schemas of the information source and the target storage repositories. A rich body of recent literature exists for mapping isolated components of XML Schema to their relational counterparts, especially with regard to table configurations. In this project, we are developing the Elixir system for designing ``industrial-strength'' mappings for real-world applications. Specifically, it produces an information-preserving holistic mapping that transforms the complete XML world-view (XML documents, XML schema with constraints, XQuery queries including triggers and views) into a full-scale relational mapping (table definitions, integrity constraints, indices, triggers, and views) that is tuned to the application workload. A key design feature of Elixir is that it performs all its mapping-related optimizations in the XML source space, rather than in the relational target space. Further, unlike the XML mapping tools of commercial database systems, which rely heavily on user input, Elixir takes a principled cost-based approach to automatically find an efficient relational mapping. A prototype of Elixir is operational and we have quantitatively demonstrated its functionality and efficacy on a variety of real-life XML schemas.


XGrind (hosted at sourceforge)


Schema-conscious XML Indexing
K. Leela and J. Haritsa
Information Systems Journal (in press)

Holistic Schema Mappings for XML-on-RDBMS
P. Patil and J. Haritsa
Proc. of 11th Intl. Conf. on Database Systems for Advanced Applications (DASFAA), Singapore, April 2006 (in press)

Incremental Maintenance of Schema-based Dynamic XML Statistics
M. Ramanath, L. Zhang, J. Freire and J. Haritsa
Proc. of 21st IEEE Intl. Conf. on Data Engineering (ICDE), Tokyo, Japan, April 2005, pgs. 273-284

A Flexible Infrastructure for Gathering XML Statistics and Estimating Query Cardinality (demo)
J. Freire, M. Ramanath and L. Zhang
Proc. of 20th IEEE Intl. Conf. on Data Engineering (ICDE), Boston, USA, March 2004, pgs. 857

Searching for Efficient XML-to-Relational Mappings
M. Ramanath, J. Freire, J. Haritsa and P. Roy
Proc. of 1st Intl. XML Database Symposium (XSym 2003), Berlin, Germany, September 2003
published as Database and XML Technologies, Springer, Lecture Notes in Computer Science (LNCS) 2824,  eds. Z. Bellahsene, A.B. Chaudhri, E. Rahm, M. Rys and R. Unland, pgs. 19-36

Bridging the XML-Relational Divide with LegoDB: A Demonstration (demo)
P. Bohannon, J. Freire, J. Haritsa, M. Ramanath, P. Roy and J. Simeon
Proc. of 19th IEEE Intl. Conf. on Data Engineering (ICDE), Bangalore, India, March 2003, pgs. 759-761

LegoDB: Customizing Relational Storage for XML Documents (demo)
P. Bohannon, J. Freire, J. Haritsa, M. Ramanath, P. Roy and J. Simeon
Proc. of 28th Intl. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, August 2002, pgs. 1091-1094

StatiX: Making XML Count
J. Freire, J. Haritsa, M. Ramanath, P. Roy and J. Simeon
Proc. of ACM SIGMOD Intl. Conf. on Management of Data, Madison, Wisconsin, USA, June 2002, pgs. 181-192

XGrind: A Query-Friendly XML Compressor
P. Tolani and J. Haritsa
Proc. of 18th IEEE Intl. Conf. on Data Engineering (ICDE), San Jose, USA, February 2002, pgs. 225-234


IMAX: Incremental maintenance of schema-based XML statistics
M. Ramanath, L. Zhang, J. Freire and J. Haritsa
TR-2004-05, DSL/SERC

Searching for Efficient XML-to-Relational Mappings
M. Ramanath, J. Freire, J. Haritsa and P. Roy
TR-2003-01, DSL/SERC

SphinX: Schema-conscious XML Indexing
K. Leela and J. Haritsa
TR-2001-04, DSL/SERC

XGRIND: A Query-friendly XML Compressor
P. Tolani and J. Haritsa
TR-2001-03, DSL/SERC


Email: haritsa [AT] dsl [dot] serc [dot] iisc [dot] ernet [dot] in