CSE-4/562 Spring 2019

Cost-Based Optimization

February 25, 2019

Textbook: Ch. 16

General Query Optimizers

Apply blind heuristics (e.g., push down selections)
Enumerate all possible execution plans by varying (or for a reasonable subset)
- Join/Union Evaluation Order (commutativity, associativity, distributivity)
- Algorithms for Joins, Aggregates, Sort, Distinct, and others
- Data Access Paths
Estimate the cost of each execution plan
Pick the execution plan with the lowest cost

Idea 1: Run each plan

If we can't get the exact cost of a plan, what can we do?

Idea 2: Run each plan on a small sample of the data.

Idea 3: Analytically estimate the cost of a plan.

Plan Cost

CPU Time: How much time is spent processing.
# of IOs: How many random reads + writes go to disk.
Memory Required: How much memory do you need.

Randal Munroe (cc-by-nc)

Remember the Real Goals

Accurately rank the plans.
Don't spend more time optimizing than you get back.
Don't pick a plan that uses more memory than you have.

Accounting

Figure out the cost of each individual operator.

Only count the number of IOs added by each operator.

Operation	RA	IOs Added (#pages)	Memory (#tuples)
Table Scan	$R$	$\frac{\|R\|}{\mathcal P}$	$O(1)$
Projection	$\pi(R)$	$0$	$O(1)$
Selection	$\sigma(R)$	$0$	$O(1)$
Union	$R \uplus S$	$0$	$O(1)$
Sort (In-Mem)	$\tau(R)$	$0$	$O(\|R\|)$
Sort (On-Disk)	$\tau(R)$	$\frac{2 \cdot \lfloor log_{\mathcal B}(\|R\|) \rfloor}{\mathcal P}$	$O(\mathcal B)$
(B+Tree) Index Scan	$Index(R, c)$	$\log_{\mathcal I}(\|R\|) + \frac{\|\sigma_c(R)\|}{\mathcal P}$	$O(1)$
(Hash) Index Scan	$Index(R, c)$	$1$	$O(1)$

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)

Operation	RA	IOs Added (#pages)	Memory (#tuples)
Nested Loop Join (Buffer $S$ in mem)	$R \times S$	$0$	$O(\|S\|)$
Nested Loop Join (Buffer $S$ on disk)	$R \times_{disk} S$	$(1+ \|R\|) \cdot \frac{\|S\|}{\mathcal P}$	$O(1)$
1-Pass Hash Join	$R \bowtie_{1PH, c} S$	$0$	$O(\|S\|)$
2-Pass Hash Join	$R \bowtie_{2PH, c} S$	$\frac{2\|R\| + 2\|S\|}{\mathcal P}$	$O(1)$
Sort-Merge Join	$R \bowtie_{SM, c} S$	[Sort]	[Sort]
(Tree) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot (\log_{\mathcal I}(\|S\|) + \frac{\|\sigma_c(S)\|}{\mathcal P})$	$O(1)$
(Hash) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot 1$	$O(1)$
(In-Mem) Aggregate	$\gamma_A(R)$	$0$	$adom(A)$
(Sort/Merge) Aggregate	$\gamma_A(R)$	[Sort]	[Sort]

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)
Number of distinct values of $A$ ($adom(A)$)

Symbol	Parameter	Type
$\mathcal P$	Tuples Per Page	Fixed ($\frac{\|\text{page}\|}{\|\text{tuple}\|}$)
$\|R\|$	Size of $R$	Precomputed$^*$ ($\|R\|$)
$\mathcal B$	Pages of Buffer	Configurable Parameter
$\mathcal I$	Keys per Index Page	Fixed ($\frac{\|\text{page}\|}{\|\text{key+pointer}\|}$)
$adom(A)$	Number of distinct values of $A$	Precomputed$^*$ ($\|\delta_A(R)\|$)

* unless $R$ is a query

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

Cardinality Estimation

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.

Operator	RA	Estimated Size
Table	$R$	$\|R\|$
Projection	$\pi(Q)$	$\|Q\|$
Union	$Q_1 \uplus Q_2$	$\|Q_1\| + \|Q_2\|$
Cross Product	$Q_1 \times Q_2$	$\|Q_1\| \times \|Q_2\|$
Sort	$\tau(Q)$	$\|Q\|$
Limit	$\texttt{LIMIT}_N(Q)$	$N$
Selection	$\sigma_c(Q)$	$\|Q\| \times \texttt{SEL}(c, Q)$
Join	$Q_1 \bowtie_c Q_2$	$\|Q_1\| \times \|Q_2\| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct	$\delta_A(Q)$	$\texttt{UNIQ}(A, Q)$
Aggregate	$\gamma_{A, B \leftarrow \Sigma}(Q)$	$\texttt{UNIQ}(A, Q)$

$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

Cardinality Estimation

(The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation): How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation): How many distinct values of attribute(s) $A$ exist?

Idea 1: Assume each selection filters down to 10% of the data.

no... really!

... there are problems

Inconsistent estimation

$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$

Too consistent estimation

$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$

... but remember that all we need is to rank plans.

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND clause, and 75% for every subsequent clause)

(Some) Estimation Techniques

The 10% rule: Rules of thumb if you have no other options...
Uniform Prior: Use basic statistics to make a very rough guess.
Sampling / History: Small, Quick Sampling Runs (or prior executions of the query).
Histograms: Using more detailed statistics for improved guesses.
Constraints: Using rules about the data for improved guesses.

Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

Basic statistics are known about $Q$:
- COUNT(*)
- COUNT(DISTINCT A) (for each A)
- MIN(A), MAX(A) (for each numeric A)
Attribute values are uniformly distributed.
No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

$P(A = x_1)$	$=$	$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$	$=$	$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$	$=$	$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$	$=$	$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$	$=$	$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$	$=$	$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$	$=$	$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

Limitations

Don't always have statistics for $Q$: For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
Don't always have clear rules for $c$: For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.: For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.: For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

...but handles most usage patterns

... next class more!