Last Updated: Oct 25, 2002
The heuristic phase of optimization uses query-to-query rewrites to transform queries into equivalent expressions that are more amenable to plan generation. This process has proven to be error-prone in the past, even when queries have been confined to the relatively simple Relational Data Model [Cod70]. Rewrites over nested queries and queries returning duplicates have been especially problematic, as evidenced by the well-known COUNT bug revealed of the query rewrites for nested queries presented in [Kim82]. The advent of object databases (i.e., object-oriented and object-relational databases) only exacerbates this issue by introducing more complex data and by implication, more complex queries and more complex query rewrites.
We introduce a novel approach to expressing query rewrites that permits formal reasoning about their correctness. Unlike existing approaches which express query rewrites fully or partially with programming language code, this approach enables verification of query rewrites with a theorem prover. At its foundation lies KOLA: our combinator-based (i.e., variable-free) query algebra that permits expression of simple query rewrites (rewrite rules) without code. Rewrite rules are formal, but lack the expressivity to capture many query rewrites that get used in practice. We address this issue in two ways:
This page describes our completed and on-going work in this area. The first section lists the papers we've written describing the development and features of COKO and KOLA. In the second section, we include source code for various implementations of our work. In the third section, we present a formal specification of KOLA expressed in Larch [GHG+93] as well as theorem prover scripts that verify well over 500 query rewrite rules expressed over KOLA queries. In the fourth section, we show the project members and in the fifth section we show related work in the field.
This material is based upon work supported by the National Science Foundation under Grant No. 9984960. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
Mitch Cherniack and Xiaoyu Wang, Submitted to VLDB '02. (pdf)
As very large and complex database components, query optimizers are brittle and error-prone. Over the past several years, testing has revealed that database systems frequently return the wrong results to queries. Although it can be shown that testing is sound (i.e., that identified bugs are indeed bugs), it is difficult to prove that testing is complete (i.e., that it finds all bugs).
Testing is a form of software validation. In this paper, we present work addressing the verification, rather than validation, of query optimizers. This work introduces formal methods to the design and construction of query optimizers, thereby ensuring that errors are never introduced into optimizer software in the first place. Specifically, in this paper we present the formal specification of a generic plan algebra (GPA), as well as the files of records it acts upon and the supporting structures (e.g., indexes) it uses. We then show the viability of this algebra by describing translations of standard SQL queries into plans, and some verified plan generation rules.
Daniel Abadi and Mitch Cherniack, Demonstration: Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, WI, June, 2002. To appear. (pdf)
Mitch Cherniack, Ph.D. Dissertation, Brown University, December, 1998. (Compressed Postscript)
Query optimizers generate plans to retrieve data requested by queries. Optimizers are hard to build because for any given query, there can be a prohibitively large number of plans to choose from. Typically, the complexity of optimization is handled by dividing optimization into two phases: a heuristic phase (called query rewriting) that narrows the space of plans to consider, and a cost-based phase that compares the relative merits of plans that lie in the narrowed space.
The goal of query rewriting is to transform queries into equivalent queries that are more amenable to plan generation. This process has proven to be error-prone. Rewrites over nested queries and queries returning duplicates have been especially problematic, as evidenced by the well-known COUNT bug of the unnesting rewrites of Kim. The advent of object-oriented and object-relational databases only exacerbates this issue by introducing more complex data and by implication, more complex queries and query rewrites.
This thesis addresses the correctness issue for query rewriting. We introduce a novel framework (COKO-KOLA) for expressing query rewrites that can be verified with an automated theorem prover. At its foundation lies KOLA: our combinator-based query algebra that permits expression of simple query rewrites (rewrite rules) without imperative code. While rewrite rules are easily verified, they lack the expressivity to capture many query rewrites used in practice. We address this issue in two ways:
The recurring theme of this work is that all of the proposed techniques are made possible by a combinator-based representation of queries.
Mitch Cherniack, Ashok Malhotra and Stan Zdonik, (Compressed Postscript)
COKO-KOLA is a two-tiered framework for building efficient and correct query rewriters. KOLA is a combinator-based (i.e., variable-free) query algebra and representation that permits the expression of basic query rewrites in terms of rewrite rules that can be verified with a theorem prover. COKO is a programming language for expressing complex query rewrites using sets of KOLA rewrite rules supplemented with a (firing) algorithm that controls their application. This paper describes our experiences using COKO-KOLA in an industrial setting. The goal of this work was to build a translating query rewriter for an object-oriented database built on top of the relational database DB2, that would translate object queries posed in a subset of OQL into equivalent SQL queries over DB2.
To assess the difficulty of using COKO-KOLA in a practical setting, we reflect on the effort that was required to specify normalizations specific to translation, and the rewrites to do the ``actual work'' of making object queries relational. Interestingly, the normalization of translated queries proved most difficult to write. But the experience revealed deficiencies in the COKO language and compiler implementation that can be addressed in future versions. Most encouraging was the ease with which we were able to write the transformations that did the actual work of the query rewrite. We were able to express these useful query rewrites with only a few rules and few lines of firing algorithm code.
Mitch Cherniack and Stan Zdonik, Proceedings of the 24th International Conference on Very Large Data Bases (VLDB), New York, NY, August, 1998. (Compressed Postscript)
Rule-based optimizers approach query optimization from a software engineering perspective. Rules make optimizers extensible and straightforward to verify. But declarative (i.e., code-free) rewrite rules are too simplistic to express the kinds of query transformations that get used in practice. Existing rule-based systems (e.g., [PHH92], [GM93], [Gra95]) address this problem by adding code to rules that seize control of rule-firing from the unification engine. But code makes rules hard to reason about and therefore counter the verifability benefits provided by rules in the first place.
The goal of the COKO-KOLA project is to express the rules of rule-based optimizers in a manner facilitating their verification with a theorem prover. In [CZ96], we addressed the issue of expressing query transformations that were too general to be expressed with rewrite rules. In this paper, we address the complementary issue of expressing query transformations that are too specific to express with rewrite rules. Such transformations require rewrite rules to be supplemented with additional semantic conditions that matching queries must satisfy to avoid erroneous firings. This work considers both the expression of such transformations (with conditional rewrite rules), as well as the expression of inference rules that instruct the optimizer in how to decide if the semantic conditions hold.
Semantic query optimization has been considered before (e.g., [HZ80], [Kin81], [AF95], [GGMR97]). But our work is unique in that semantic transformations are expressed declaratively and therefore are verifiable with a theorem prover. Further, our use of a modifiable set of inference rules to guide semantic inference makes rule-based optimizers extensible in ways not previously considered. This complements the extensibility benefits that come from using rewrite rules to express query transformations. By modifying the set of rewrite rules used by an optimizer, an optimizer implementor can change the potential query expressions the optimizer might produce. On the other hand, by modifying the inference rules used by an optimizer, an optimizer implementor can change the circumstances under which each of these rewrite rules is fired.
On-line Appendix:
To Execute the Theorem Prover Scripts, download the package of specifications and theorem prover scripts described below
Mitch Cherniack and Stan Zdonik, Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, June, 1998. (Compressed Postscript)
Rule-based optimizers are extensible because they consist of modifiable sets of rules. For modification to be straightforward, rules must be easily reasoned about (i.e., understood and verified). At the same time, rules must be expressive and efficient (to fire) for rule-based optimizers to be practical. Production-style rules (as in Starburst) are expressed with code and are hard to reason about. Pure rewrite rules (as in Gral) lack code, but cannot atomically express complex transformations (e.g., normalizations). Some systems allow rules to be grouped, but sacrifice efficiency by providing limited control over their firing. Therefore, none of these approaches succeeds in making rules expressive, efficient and understandable.
We propose a language (COKO) for expressing an alternative form of input to a rule-based optimizer. A COKO transformation consists of a set of declarative (KOLA) rewrite rules and a (firing) algorithm that specifies their firing. It is straightforward to reason about COKO transformations because all query modification is expressed with declarative rewrite rules. Firing is specified algorithmically with an expressive language that provides direct control over how query representations are traversed, and under what conditions rules are fired. Therefore, COKO achieves a delicate balance of understandability, efficiency and expressivity.
On-line Appendices:
Mitch Cherniack, Technical Report, December, 1996. (Compressed Postscript)
This paper presents our OQL -> KOLA translator and a proof of its correctness. We begin by introducing a nested relational variant of OQL, giving it a denotational semantics. We then provide an operational (reduction) semantics of KOLA, showing how every KOLA query reduces to an OQL query. We then define an OQL -> KOLA translation function, T, and show its correctness by showing that translation of any OQL query gives a KOLA query whose OQL equivalent has the same semantics as the original query.
Mitch Cherniack and Stanley B. Zdonik, Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, June, 1996. (Compressed Postscript)
Rule-based optimizers and optimizer generators use rules to specify query transformations. Rules act directly on query representations, which typically are based on query algebras. But most algebras complicate rule formulation, and rules over these algebras must often resort to calling to externally defined bodies of code. Code makes rules difficult to formulate, prove correct and reason about, and therefore compromises the effectiveness of rule-based systems.
In this paper we present KOLA; a combinator-based algebra designed to simplify rule formulation. KOLA is not a user language, and KOLA's variable-free queries are difficult for humans to read. But KOLA is an effective internal algebra because its combinator-style makes queries manipulable and structurally revealing. As a result, rules over KOLA queries are easily expressed without the need for supplemental code. We illustrate this point, first by showing some transformations that despite their simplicity, require head and body routines when expressed over algebras that include variables. We show that these transformations are expressible without supplemental routines in KOLA. We then show complex transformations of a class of nested queries expressed over KOLA. Nested query optimizations, while having been studied before, have seriously challenged the rule-based paradigm.
Mitch Cherniack and Eui-Suk Chung, February, 1996. (Compressed Postscript)
The ability of an object to mutate is one of the defining characteristics of Object-Oriented Databases (OODBs). But the impact of object mutability on querying is typically ignored. Presumably, this is because queries are usually assumed free of side-effects, and therefore objects are effectively immutable while they are queried. But mutability affects other object behaviors aside from how they change, such as how they are compared for equality, and if it is possible to detect that they are shared. In this paper, we show how these behaviors impact querying in general and query optimization and mutable set construction in particular.
The issues considered in this paper arose during our efforts to provide OQL-based query facilities for Thor. Whereas OQL (and specifically, the ODMG object model) make specific assumptions about the mutability of sets, Thor had not adopted a policy for sets when this project began. In formulating a policy, we considered the impact of the ODMG policy on query efficiency and then considered how Thor could reconcile different assumptions about set mutability with OQL. The analysis of these issues is the focus of the paper. This analysis, though done with respect to Thor, has implications for any system that wishes to adopt the ODMG standard (as many have pledged to do).
Marian Nodine, Mitch Cherniack and Mark Nodine, Proceedings of the Seventh International Workshop on Persistent Object Systems, Cape May, New Jersey, USA, June, 1996. (Compressed Postscript)
In this paper, we present a simple cost model that uses semantic information about collections (namely, knowledge about what are their supersets) to predict sizes of collections generated as a result of querying. The results presented are described in terms of AQUA; the precursor algebra to KOLA.
Mitch Cherniack, Stanley B. Zdonik and Marian Nodine, Proceedings of the Fifth International Workshop on Database Programming Languages, Gubbio, Italy, September, 1995. (Compressed Postscript)
The AQUA [LMSV+93] query algebra allows user-defined equivalence relations as arguments to query operators that generalize standard set operations. These predicates determine what objects are included in the query result, and the duplicates that must be removed.
While an expressive enhancement, the use of arbitrary equivalence relations to decide set membership can result in sets with counterintuitive behavior, and therefore can make queries return unreasonable results. In this paper, we show that equality predicates assume two roles with respect to sets. Distinguishers differentiate between set members and implicitly give meaning to standard set properties such as set equality. Constructors determine which object from input sets contribute to the query result. The requirements of distinguishers and constructors differ. AQUA's set operators are problematic because they use constructors where distinguishers are required. We propose alternatives to AQUA's set operators that address this limitation.
The COKO compiler translates COKO transformations into C++ classes that
can be subsequently compiled into executables that transform KOLA
trees.
This implementation uses Ox:
a preprocessor of yacc and lex files. As this software is no
longer supported, we offer two versions of the COKO-KOLA
implementation. The first does not require Ox to be installed,
and instead includes the lex and yacc files generated by Ox at our site.
The second requires Ox to be installed. We have also provided a link
to a site where Ox can be found. Note that the linux binary only
works on libc.so.4 and with a.out executable support.
Distribution:
the theorem prover scripts verify the rules presented in our papers presented in SIGMOD '98 and VLDB '98. As well, the specifications give a formal specification for KOLA. Compressed and tar'd directory tree of specifications and theorem prover scripts is located here. The README file is located here.
Our work on query optimization builds upon the work of Gail Mitchell who proposed a region-based design of a modular extensible query optimizer. COKO can be viewed as a language for building regions. Papers concerning the design and function of EPOQ, the extensible query optimizer, are listed below:
The roots of KOLA are found in the query algebra AQUA, designed by Ted Leung, Gail Mitchell, Bharathi Subramanian, Bennet Vance, Scott Vandenburg and Stan Zdonik.
Papers concerning AQUA are listed below: