Picasso Algorithms

Picasso Database Query Optimizer Visualizer

©Indian Institute of Science, Bangalore, India

ALGORITHMIC DETAILS

1. Plan Reduction Algorithms

A detailed assessment of the Plan Reduction problem is available in this technical report. It is shown in the report that, given an optimality criterion of minimizing the plan cardinality of the reduced plan diagram for a user-specified cost increase threshold, finding the optimal solution is NP-Hard. Therefore, the only recourse is to develop heuristic-based algorithms – Picasso 1.0 supported two greedy algorithms, referred to as AreaGreedy and CostGreedy, respectively. In Picasso 2.0, only CostGreedy is retained since it is has significantly better performance characteristics, with regard to both reduction and efficiency, than AreaGreedy.

CostGreedy

This algorithm operates under the assumption that the Cost Domination Principle (query processing costs increase as we move outwards from the origin in the selectivity space) holds and therefore only plan swallowing possibilities in the first quadrant are considered with respect to the plan under consideration. A direct corollary of this assumption is that in exploring swallowing possibilities for a given plan, we do not need to retain all points contained in each potential swallower plan, but only those corresponding to the minimum cost in the first quadrant. The algorithm processes the query points starting from the top-right corner (with the Cost Domination assumption, this corner will have the highest cost of all points) and progressively makes its way to the origin, in the process using the Greedy SetCover algorithm to recolor the points – the complete details are in this technical report . The time complexity of CostGreedy is O(mn) where n is the number of plans in the plan diagram and m is the number of of points in the diagram, a significant improvement over AreaGreedy whose complexity is O(m²). More importantly, it can be shown that CostGreedy provides the best possible approximation factor with respect to the optimal reduction.

Knee Estimator

As an aid to help users make informed choices of the values to enter in the Enter Cost Increase Threshold dialog box, a rough estimate of the location of the threshold value corresponding to the “knee” in the graph characterizing “Plan Cardinality of the Reduced Plan Diagram versus Cost Increase Threshold”, is provided in the box. Also provided is an estimate of the minimum threshold required to reach a desired number of plans (specified by the DESIRED_NUM_PLANS macro in PicassoConstants.java, set to 10 in the distribution) in the reduced plan diagram. The estimator implemented in Picasso is the AmmEst estimator described in this technical report.

Cost Increase Statistics

In the panel immediately to the right of the reduced diagram, we show bounds on the minimum, average and maximum cost increase (in percentage terms) computed over all points that have undergone plan replacement during the reduction process. By definition, these values will all be smaller than the user-specified cost increase threshold.

2. Robust Plan Reduction Algorithms

The above-mentioned reduction algorithms, AreaGreedy and CostGreedy, ensure that the replacement plans are within the cost-increase-threshold at all points in the optimality regions of the replaced plans. However, in practice, query optimizers are subject to significant errors in their selectivity estimates, and it may therefore be possible that the location of the query at run-time may lie outside the optimality region of the replaced plan. At such locations, the performance of the replacement plan could be much worse than the replaced plan, resulting in the replacement being harmful from a global perspective. This possibility naturally leads to the concept of a robust replacement – that is, a replacement where the λ-threshold criterion is satisfied at all points in the selectivity space, i.e. the replacement ensures global safety. Providing robust replacements requires costing of plans outside their optimality regions, and it is therefore possible only with database engines that support the Abstract Plan Costing feature (in Picasso, these engines are SQL Server and Sybase ASE). Direct approaches to achieving robust plan reduction are computationally prohibitively expensive, but through empirical modeling of plan cost function behavior, we have been able to develop efficient alternatives, which are described in this technical report. The algorithm described there, called SEER, guarantees global safety and requires Abstract-plan-costing operations only along the surfaces of the selectivity space hyper-cube. More remarkably, we have found through detailed experiments with industrial-strength benchmark environments that SEER provides global safety without jeopardizing anorexic reduction. Unlike the Cost Bounding-based reduction algorithms, which are completely conditioned on Cost Domination holding true, the Abstract-plan-costing-based approach can supported a richer class of plan cost models wherein the plan cost function can have one local maxima or one local minima in the interior of the selectivity space. In Picasso 2.0, the following two variations of SEER have been implemented:

CC-SEER (CornerCube SEER)

CC-SEER guarantees global safety like SEER but improves on its efficiency by implementing a more conservative test for global robust replacement. Specifically, CC-SEER only involves Abstract-plan-costing operations at the corner hyper-cubes of the selectivity space and is therefore significantly faster than SEER. Moreover, its performance is resolution independent unlike SEER, and therefore the performance gap between CC-SEER and SEER increases with higher resolution diagrams. Concretely, while SEER’s time complexity is O(m.r^{d -1}), where m is the number of points in the diagram, r is the resolution and d is the dimensionality, CC-SEER cuts it down to O(m.4^d). Our experimental results indicate that CC-SEER’s reduction quality is comparable to that of SEER, and it therefore provides an extremely attractive tradeoff between efficiency and reduction quality.

LiteSEER

LiteSeer is a light-weight heuristic-based variant of SEER that makes its replacement decisions solely based on Abstract-plan-costing operations at the corners of the hypercube, and is therefore extremely efficient. In fact, it can be shown that LiteSEER is optimal in the sense that it incurs the minimum work (complexity-wise) required by any reduction algorithm. While it does not guarantee global safety, our experimental results indicate that in practice, its safety and reduction characteristics are quite close to that of SEER and CC-SEER.

Knee Estimator

Currently, no estimators have been designed for CC-SEER and we intend to address this issue in our future work. As an interim measure, we suggest that users can first obtain the estimates from CostGreedy, and then use the same with CC-SEER. A word of caution, however – there may be significant differences between the Reduced-Plan-Cardinality vs. Threshold graphs of CostGreedy and CC-SEER. This is because, on the one hand, there is greater potential for reduction in CC-SEER due to using explicit costs instead of cost bounds. But, on the other hand, the global safety guarantee may prevent some replacements permitted by CostGreedy. Our experience has been that the latter factor dominates in the higher-resolution diagrams and therefore the knee of the graph is typically at a larger threshold for CC-SEER as compared to CostGreedy.

The above-mentioned issues for CC-SEER arise with LiteSEER as well, and due to its relaxed safety criterion, LiteSEER's knee will always be lower than that of CC-SEER.

Cost Increase Statistics

The cost increase statistics, mentioned above for the CostGreedy and AreaGreedy algorithms, are not provided for the robust reduction algorithms, CC-SEER and LiteSEER. This is because computing these statistics would require costing (using Abstract-plan-costing) of the replacement plans at all replaced points in the selectivity space, which would be prohibitively expensive. However, at any specific replaced point in the reduced plan diagram, the cost of the replacement plan can be ascertained by using the Shift+Right-click command which provides the QueryInfo message box.

3. Plan Difference Algorithm

The plan difference algorithm visually highlights the difference between a pair of selected plans. The common parts between the two plan trees are shown in white, whereas all differences are colored as mentioned in the Plan Tree Windows documentation, reproduced below:

§ Node labels, Node parameters and Node inputs are identical: White nodes with Black input links

§ Node labels are the same: White Fill

o Parameters different: Green-bordered nodes with Black input links

o Left and right inputs swapped: Orange-bordered nodes with Blue input links

o Left and right inputs different: Orange-bordered nodes with Red input links

§ Node labels are different: Red-bordered nodes filled with the native node color

o Left and right inputs swapped: Blue input links

o Left and right inputs different: Red input links

§ Nodes are unmapped (i.e. no corresponding node in the other tree): Nodes filled with the native node color and Black input links

We describe here the node matching process between the two trees, P_Aand P_B . In our description, the term “branch” is used to refer to any connected chain of nodes between a pair of branching nodes, or between a branching node and a leaf, in these trees. Branches are directed from the lower node to the higher node. The matching proceeds as follows:

1. First all the leaf nodes (relations) and all the nodes with binary inputs (typically join nodes) are identified for P_A and P_B.

2. A leaf of P_A is matched with a leaf of P_Bif and only if they both have the same relation label. In the situation that there are multiple matches available (that is, if the same relation label appears in multiple leaves), an edit-distance computation is made between the branches of all pairs of matching leaves between P_A and P_B. The assignments are then made in increasing order of edit-distances.

3. A binary node of P_Ais matched with a binary node of P_B if the set of base relations that are processed is the same. If the node label, node parameters, and the left and right inputs are identical (in terms of base relations), the node is colored white and the input links are black. However, if the node label is different, or the node parameters are different, or if the left and right input relation subsets are different, then the node and the input links are colored as per the above coloring scheme.

4. A minimal edit-distance computation is made between the branches arising out of each pair of matched nodes, and the nodes that have to be added or deleted, if any, in order to make the branches identical are colored as per the above coloring scheme. Unmodified nodes, on the other hand, are matched with their counterparts in the sibling tree and colored either white or white with green border.

Note: When plan difference is carried out on a OperatorDiff-based plan diagram, the treeswill always include at least one colored node or link (that is, unlike ParameterDiff -based plan diagrams, it is not possible to have pure black-and-white trees with only green node borders).

4. Approximation Algorithms

Plan diagrams can be computationally extremely expensive for plan diagrams on higher-dimension query templates and fine-grained diagram resolutions. For example, a 3D plan diagram with a resolution of 100 on each dimension could well take about a week to produce. To address this problem, diagram approximation algorithms have been incorporated in Picasso 2.0. Specifically, these algorithms produce approximate diagrams by explicitly optimizing only a small subset of points from the query selectivity space and inferring the remaining points, thereby drastically reducing the computation time. Two categories of errors can arise in the approximation process:

1. Plan Identity Error: This error metric refers to the possibility of the approximation missing out on a subset of the plans present in the exact plan diagram. It is computed as the percentage of plans lost in the approximation process. This error is challenging to control since a majority of the plans appearing in plan diagrams are typically small in area and therefore hard to find.

2. Plan Location Error: This error metric refers to the possibility of incorrectly assigning plans to query points in the approximate plan diagram. It is computed as the percentage of incorrectly assigned points. This error is also challenging to control since the plan boundaries can be highly non-linear and are sometimes even irregular in shape.

We take from the user the tolerance levels for these two errors and our algorithms attempt to ensure that the actual errors in the approximate diagram are in the close proximity of these thresholds, while at the same time minimizing the diagram production overheads. The two approximation algorithms currently implemented in Picasso are briefly described below.

Random Sampling with Nearest Neighbor Inference (RS_NN)

This algorithm employs classical random sampling and nearest neighbor classification to generate approximate diagrams. It begins by optimizing a small seed-set of randomly chosen points in the diagram, and then repeatedly optimizes a similarly-sized random group from the remaining un-optimized query points in each successive iteration. The stopping condition for the iteration is when the identity-error is expected, using a statistical estimator, to have reached below the user's tolerance level. After this, the algorithm repeatedly (a) infers the plan assignments for all un-optimized points using nearest neighbor classification, and then (b) explicitly optimizes a subset of these points, until the location-error is estimated to have reached below the user's tolerance level. To reduce the number of misclassifications along plan region boundaries, the plan assignments of the inferred points in the diagram are passed through a low-pass filter in the final step.

Grid Partitioning with Parametric Query Optimization Inference (GS_PQO)

This algorithm employs grid partitioning and the basic notions of parametric query optimization, along with plan-tree differencing techniques to generate approximate diagrams. Unlike RS_NN, heuristics are used to measure the expected quality of approximation. The entire diagram is partitioned into a coarse grid and the plan richness factor is computed for each of the grid cells. The plan richness is determined by the extent of structural differences between the plans at the corners of the cell. The cells are first arranged in a max-heap structure based on their plan richness factor. Then, the cell at the top of the heap is extracted and repeatedly sub-divided until the plan richness of the resultant cells reaches below a certain threshold (this value is derived directly from the user given error tolerances).

The plan assignment process operates as follows: If any pair of adjacent corner points have different plans, then explicitly optimize the middle point of the edge joining these two corners, otherwise assign the corner plan to the middle point. Then, process the center point of the cell based on the plans lying on the cross-hairs joining the middle points. Break the cell into smaller cells using the newly assigned points and calculate their plan richness factor. Insert them into the heap.

Once all the rectangles in the heap reach a plan richness factor below the desired threshold, the plans are inferred for all the remaining un-optimized points — the middle point of each edge is assigned a plan chosen randomly from the plans at the corners of the edge. Finally, the algorithm terminates if the heap is empty.

For detailed descriptions of the approximation algorithms and the associated experimental results, please refer to this technical report .

Documentation Home