Picasso Database Query Optimizer Visualizer
©Indian Institute of Science,
ALGORITHMIC
DETAILS
1.
Plan
Reduction Algorithms
A detailed
assessment of the Plan Reduction problem is available in this technical
report. It is shown in the report that, given an
optimality criterion of minimizing the plan cardinality of the reduced plan
diagram for a user-specified cost increase threshold, finding the optimal
solution is NP-Hard. Therefore, the only recourse is to develop heuristic-based
algorithms – Picasso 1.0 supported two greedy algorithms, referred to
as AreaGreedy and CostGreedy, respectively. In Picasso
2.0, only
CostGreedy is retained since it is has significantly better performance characteristics,
with regard to both reduction and efficiency, than AreaGreedy.
CostGreedy
This algorithm operates under the assumption that the Cost Domination Principle (query processing costs increase as we move outwards from the origin in the selectivity space) holds and therefore only plan swallowing possibilities in the first quadrant are considered with respect to the plan under consideration. A direct corollary of this assumption is that in exploring swallowing possibilities for a given plan, we do not need to retain all points contained in each potential swallower plan, but only those corresponding to the minimum cost in the first quadrant. The algorithm processes the query points starting from the top-right corner (with the Cost Domination assumption, this corner will have the highest cost of all points) and progressively makes its way to the origin, in the process using the Greedy SetCover algorithm to recolor the points – the complete details are in this technical report . The time complexity of CostGreedy is O(mn) where n is the number of plans in the plan diagram and m is the number of of points in the diagram, a significant improvement over AreaGreedy whose complexity is O(m2). More importantly, it can be shown that CostGreedy provides the best possible approximation factor with respect to the optimal reduction.
Knee Estimator
As an aid to help users make informed choices of the values to enter in the Enter Cost Increase Threshold dialog box, a rough estimate of the location of the threshold value corresponding to the “knee” in the graph characterizing “Plan Cardinality of the Reduced Plan Diagram versus Cost Increase Threshold”, is provided in the box. Also provided is an estimate of the minimum threshold required to reach a desired number of plans (specified by the DESIRED_NUM_PLANS macro in PicassoConstants.java, set to 10 in the distribution) in the reduced plan diagram. The estimator implemented in Picasso is the AmmEst estimator described in this technical report.
Cost Increase Statistics
In the panel
immediately to the right of the reduced diagram, we show bounds on the minimum,
average and maximum cost increase (in percentage terms) computed over all
points that have undergone plan replacement during the reduction process. By definition, these values will all be
smaller than the user-specified cost increase threshold.
2.
Robust
Plan Reduction Algorithms The above-mentioned reduction algorithms, AreaGreedy and CostGreedy,
ensure that the replacement plans are within the cost-increase-threshold at all
points in the optimality regions of the replaced plans. However, in practice,
query optimizers are subject to significant errors in their selectivity
estimates, and it may therefore be possible that the location of the query at
run-time may lie outside the optimality region of the replaced plan. At such
locations, the
performance of the replacement plan could be much worse than the replaced plan,
resulting in the replacement being harmful from a global perspective. This possibility naturally leads to the
concept of a robust replacement –
that is, a replacement where the λ-threshold criterion is satisfied at all points in the selectivity space,
i.e. the replacement ensures global
safety. Providing robust replacements requires costing of plans outside
their optimality regions, and it is therefore possible only with database
engines that support the Abstract Plan
Costing feature (in Picasso, these engines are SQL Server and
Sybase
ASE). Direct approaches to
achieving robust plan reduction are computationally prohibitively expensive,
but through empirical modeling of plan cost function behavior, we have been able to
develop efficient alternatives, which are described in this
technical
report. The algorithm described there, called SEER, guarantees global safety and requires Abstract-plan-costing operations only
along the surfaces of the selectivity
space hyper-cube. More remarkably, we
have found through detailed experiments with industrial-strength benchmark
environments that SEER provides global safety without jeopardizing anorexic reduction. Unlike the Cost
Bounding-based reduction algorithms, which are completely conditioned on Cost
Domination holding true, the Abstract-plan-costing-based approach can supported a richer
class of plan cost models wherein the plan cost function can have one
local maxima or one local minima in the interior of the selectivity space. In Picasso 2.0, the following two variations
of SEER have been implemented:
CC-SEER (CornerCube SEER)
CC-SEER guarantees global safety like SEER but improves on its efficiency by implementing a more conservative test for global robust replacement. Specifically, CC-SEER only involves Abstract-plan-costing operations at the corner hyper-cubes of the selectivity space and is therefore significantly faster than SEER. Moreover, its performance is resolution independent unlike SEER, and therefore the performance gap between CC-SEER and SEER increases with higher resolution diagrams. Concretely, while SEER’s time complexity is O(m.rd -1), where m is the number of points in the diagram, r is the resolution and d is the dimensionality, CC-SEER cuts it down to O(m.4d ). Our experimental results indicate that CC-SEER’s reduction quality is comparable to that of SEER, and it therefore provides an extremely attractive tradeoff between efficiency and reduction quality.
LiteSEER
LiteSeer is a light-weight heuristic-based variant of SEER that makes its replacement decisions solely based on Abstract-plan-costing operations at the corners of the hypercube, and is therefore extremely efficient. In fact, it can be shown that LiteSEER is optimal in the sense that it incurs the minimum work (complexity-wise) required by any reduction algorithm. While it does not guarantee global safety, our experimental results indicate that in practice, its safety and reduction characteristics are quite close to that of SEER and CC-SEER.
Knee Estimator
Currently, no estimators
have been designed for CC-SEER and
we
intend to address this issue in our future work. As an interim measure, we
suggest that users can first obtain the estimates from CostGreedy,
and then use the same with CC-SEER. A word of caution, however
–
there may be significant differences between the Reduced-Plan-Cardinality vs.
Threshold graphs of CostGreedy and CC-SEER. This is because, on the one
hand, there is greater potential for reduction in CC-SEER due to using explicit
costs instead of cost bounds. But, on the other hand, the global safety
guarantee may prevent some replacements permitted by CostGreedy. Our
experience has been that the latter factor dominates in the higher-resolution
diagrams and therefore the knee of the graph is typically at a larger threshold
for CC-SEER as compared to CostGreedy.
The above-mentioned issues for CC-SEER arise with LiteSEER as well, and due to its relaxed safety criterion, LiteSEER's knee will always be lower than that of CC-SEER.
Cost Increase Statistics
The cost increase statistics, mentioned above for the CostGreedy and AreaGreedy algorithms, are not provided for the robust reduction algorithms, CC-SEER and LiteSEER. This is because computing these statistics would require costing (using Abstract-plan-costing) of the replacement plans at all replaced points in the selectivity space, which would be prohibitively expensive. However, at any specific replaced point in the reduced plan diagram, the cost of the replacement plan can be ascertained by using the Shift+Right-click command which provides the QueryInfo message box.
3.
Plan
Difference Algorithm
The plan difference
algorithm visually highlights the difference between a pair of selected plans.
The common parts between the two plan trees are shown in white, whereas all
differences are colored as mentioned in the
Plan
Tree Windows documentation, reproduced below:
§
Node labels, Node parameters and Node
inputs are identical: White nodes with Black input links
§
Node labels are the same: White Fill
o
Parameters different: Green-bordered
nodes with Black input links
o
Left and right inputs swapped:
Orange-bordered nodes with Blue input links
o
Left and right inputs different:
Orange-bordered nodes with Red input links
§
Node labels are different:
Red-bordered nodes filled with the native node color
o
Left and right inputs swapped: Blue input
links
o
Left and right inputs different: Red
input links
§
Nodes are unmapped (i.e. no
corresponding node in the other tree): Nodes filled with the native node color
and Black input links
We describe
here the node matching process between the two trees, PA and
PB
. In our description, the term “branch” is used to refer to any connected chain
of nodes between a pair of branching nodes, or between a branching node and a
leaf, in these trees. Branches are directed from the lower node to the higher
node. The matching proceeds as follows:
1. First all the leaf
nodes (relations) and all the nodes with binary inputs (typically join nodes)
are identified for PA and PB.
2. A leaf of PA
is matched with a leaf of PB if and only if they both have the same
relation label. In the situation that there are multiple matches available
(that is, if the same relation label appears in multiple leaves), an
edit-distance computation is made between the branches of all pairs of matching
leaves between PA and PB. The assignments are then made
in increasing order of edit-distances.
3. A binary node of PA
is matched with a binary node of PB if the set of base
relations that are processed is the same. If the node label, node parameters,
and the left and right inputs are identical (in terms of base relations), the
node is colored white and the input links are black. However, if the node label
is different, or the node parameters are different, or if the left and right
input relation subsets are different, then the node and the input links are
colored as per the above coloring scheme.
4. A minimal edit-distance computation is made between the branches arising out of each pair of matched nodes, and the nodes that have to be added or deleted, if any, in order to make the branches identical are colored as per the above coloring scheme. Unmodified nodes, on the other hand, are matched with their counterparts in the sibling tree and colored either white or white with green border.
Note: When plan difference is carried out on a OperatorDiff-based plan diagram, the treeswill always include at least one colored node or link (that is, unlike ParameterDiff -based
plan diagrams, it is not possible to have pure black-and-white trees with only green node borders).
4.
Approximation
Algorithms
Plan diagrams
can be computationally extremely expensive for plan diagrams on
higher-dimension query templates and fine-grained diagram resolutions. For
example, a 3D plan diagram with a resolution of 100 on each dimension could
well take about a week to produce. To address this problem, diagram
approximation algorithms have been incorporated in Picasso 2.0. Specifically,
these algorithms produce approximate diagrams by explicitly optimizing only a
small subset of points from the query selectivity space and inferring the remaining points, thereby
drastically reducing the computation time. Two categories
of errors can arise in the approximation process:
1.
Plan Identity Error: This error metric refers to the possibility of the
approximation missing out on a subset of the plans present in the exact plan
diagram. It is computed as the percentage of plans lost in the
approximation process.
2. Plan Location Error: This error metric refers to the possibility of incorrectly assigning plans to query points in the approximate plan diagram. It is computed as the percentage of incorrectly assigned points. This error is also challenging to control since the plan boundaries can be highly non-linear and are sometimes even irregular in shape.
We take from the user the tolerance levels for these two errors and our algorithms attempt to ensure that the actual errors in the approximate diagram are in the close proximity of these thresholds, while at the same time minimizing the diagram production overheads. The two approximation algorithms currently implemented in Picasso are briefly described below.
Random Sampling with
Nearest Neighbor Inference (RS_NN)
This algorithm employs classical random sampling and nearest neighbor classification to generate approximate diagrams. It begins by optimizing a small seed-set of randomly chosen points in the diagram, and then repeatedly optimizes a similarly-sized random group from the remaining un-optimized query points in each successive iteration. The stopping condition for the iteration is when the identity-error is expected, using a statistical estimator, to have reached below the user's tolerance level. After this, the algorithm repeatedly (a) infers the plan assignments for all un-optimized points using nearest neighbor classification, and then (b) explicitly optimizes a subset of these points, until the location-error is estimated to have reached below the user's tolerance level. To reduce the number of misclassifications along plan region boundaries, the plan assignments of the inferred points in the diagram are passed through a low-pass filter in the final step.
Grid Partitioning with
Parametric Query Optimization Inference (GS_PQO)
This algorithm employs grid partitioning and the basic notions of parametric query optimization, along with plan-tree differencing techniques to generate approximate diagrams. Unlike RS_NN, heuristics are used to measure the expected quality of approximation. The entire diagram is partitioned into a coarse grid and the plan richness factor is computed for each of the grid cells. The plan richness is determined by the extent of structural differences between the plans at the corners of the cell. The cells are first arranged in a max-heap structure based on their plan richness factor. Then, the cell at the top of the heap is extracted and repeatedly sub-divided until the plan richness of the resultant cells reaches below a certain threshold (this value is derived directly from the user given error tolerances).
The plan assignment process operates as follows: If any pair of adjacent corner points have different plans, then explicitly optimize the middle point of the edge joining these two corners, otherwise assign the corner plan to the middle point. Then, process the center point of the cell based on the plans lying on the cross-hairs joining the middle points. Break the cell into smaller cells using the newly assigned points and calculate their plan richness factor. Insert them into the heap.
Once all the rectangles in the heap reach a plan richness factor below the desired threshold, the plans are inferred for all the remaining un-optimized points — the middle point of each edge is assigned a plan chosen randomly from the plans at the corners of the edge. Finally, the algorithm terminates if the heap is empty.
For detailed
descriptions of the approximation algorithms and the associated experimental
results, please refer to this
technical
report .