- Exam Review

Exam Review

May 4, 2021

Exam Format

You will need:

A smartphone or second device with a camera and zoom installed.
A quiet, well lit area.
A view of your work surface, you, and your computer screen.

Day-Of

Remember to enable "Do-not-Disturb" mode on your devices.
Log into Zoom 10-15 min prior to start.
Everyone logged into Zoom will be given a pin to access their exam.
Write your answers on paper in dark, visible ink.
Scan your exam via zoom session in a breakout room.


          SELECT  [DISTINCT] targetlist
          FROM    relationlist
          WHERE   condition

Compute the $2^n$ combinations of tuples in all relations appearing in relationlist
Discard tuples that fail the condition
Delete attributes not in targetlist
If DISTINCT is specified, eliminate duplicate rows

This is the least efficient strategy to compute a query! A good optimizer will find more efficient strategies to compute the same answer.

Record Layouts

Record Layout 1: Fixed

Record Layout 2: Delimiters

Record Layout 2: Headers

File: A collection of pages (or records)
Page: A fixed-size collection of records; Page size is usually dictated by hardware.
Mem Page $\approx$ 4K
Record: One or more fields (for now)
Field: A primitive value (for now)

Relational Algebra

We start with a database instance with a fixed schema

Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$

Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)

Relational Algebra

Operation	Sym	Meaning
Selection	$\sigma$	Select a subset of the input rows
Projection	$\pi$	Delete unwanted columns
Cross-product	$\times$	Combine two relations
Set-difference	$-$	Tuples in Rel 1, but not Rel 2
Union	$\cup$	Tuples either in Rel 1 or in Rel 2
Intersection	$\cap$	Tuples in both Rel 1 and Rel 2
Join	$\bowtie$	Pairs of tuples matching a specified condition
Division	$/$	"Inverse" of cross-product
Sort	$\tau_A$	Sort records by attribute(s) $A$
Limit	$\texttt{LIMIT}_N$	Return only the first $N$ records (according to sort order if paired with sort).

RA Equivs

We say that $Q_1 \equiv Q_2$ if and only if
we can guarantee that the bag of tuples produced by $Q_1(R, S, T, \ldots)$
is the same as the bag of tuples produced by $Q_2(R, S, T, \ldots)$
for any combination of valid inputs $R, S, T, \ldots$.

... that satisfy any necessary properties.

Rule	Notes
$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$
$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$	Note, this is only true for set, not bag union
$\sigma_C(R \times S) \equiv R \bowtie_C S$
$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$	If $C$ references only $R$'s attributes, also works for joins
$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$
$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$	If $A$ contains all of the attributes referenced by $C$
$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$	Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$)
$R \times (S \times T) \equiv (R \times S) \times T$	Also works for joins
$R \times S \equiv S \times R$	Also works for joins
$R \cup (S \cup T) \equiv (R \cup S) \cup T$	Also works for intersection and bag-union
$R \cup S \equiv S \cup R$	Also works for intersections and bag-union
$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$	Also works for intersections and bag-union
$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$	Also works for intersections and bag-union
$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$	If $A$ contains all of the attributes referenced by $C$

Algorithms

"Volcano" Operators (Iterators): Operators "pull" tuples, one-at-a-time, from their children.
2-Pass (External) Sort: Create sorted runs, then repeatedly merge runs
Join Algorithms: Quickly picking out specific pairs of tuples.
Aggregation Algorithms: In-Memory vs 2-Pass, Normal vs Group-By

Nested-Loop Join

Block-Nested Loop Join

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

Sort/Merge Join: Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join): Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join): Partition both sides so that tuples don't join across partitions.

Sort/Merge Join

Sort/Merge typically expressed as 3 operators
(2xSort + Merge)

1-Pass Hash Join

2-Pass Hash Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Low Memory: Never need more than 1 pair of partitions in memory
High IO Cost: Every record gets written out to disk, and back in.

Can partition on data-values to support other types of queries.

Index Nested Loop Join

To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$

Read One Row of $R$
Get the value of $R.A$
Start index scan on $S.B > [R.A]$
Return rows as normal

Basic Aggregate Pattern

Init: Define a starting value for the accumulator
Fold(Accum, New): Merge a new value into the accumulator
Finalize(Accum): Extract the aggregate from the accumulator.

Basic Aggregate Types

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Distributive: Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
Algebraic: Finite-sized accumulator but needs a finalize (AVG)
Holistic: Unbounded accumulator (MEDIAN)

Grouping Algorithms

2-pass Hash Aggregate: Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
Sort-Aggregate: Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.

Cost-Based Optimization

Accounting

Figure out the cost of each individual operator.

Only count the number of IOs added by each operator.

Operation	RA	Total IOs (#pages)	Memory (#tuples)
Table Scan	$R$	$\frac{\|R\|}{\mathcal P}$	$O(1)$
Projection	$\pi(R)$	$\textbf{io}(R)$	$O(1)$
Selection	$\sigma(R)$	$\textbf{io}(R)$	$O(1)$
Union	$R \uplus S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort (In-Mem)	$\tau(R)$	$\textbf{io}(R)$	$O(\|R\|)$
Sort (On-Disk)	$\tau(R)$	$\frac{2 \cdot \lfloor log_{\mathcal B}(\|R\|) \rfloor}{\mathcal P} + \textbf{io}(R)$	$O(\mathcal B)$
(B+Tree) Index Scan	$Index(R, c)$	$\log_{\mathcal I}(\|R\|) + \frac{\|\sigma_c(R)\|}{\mathcal P}$	$O(1)$
(Hash) Index Scan	$Index(R, c)$	$1$	$O(1)$

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)

Operation	RA	Total IOs (#pages)	Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem)	$R \times_{mem} S$	$\textbf{io}(R)+\textbf{io}(S)$	$O(\|S\|)$
Block NLJ (Buffer $S$ on disk)	$R \times_{disk} S$	$\frac{\|R\|}{\mathcal B} \cdot \frac{\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Block NLJ (Recompute $S$)	$R \times_{redo} S$	$\textbf{io}(R) + \frac{\|R\|}{\mathcal B} \cdot \textbf{io}(S)$	$O(1)$
1-Pass Hash Join	$R \bowtie_{1PH, c} S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(\|S\|)$
2-Pass Hash Join	$R \bowtie_{2PH, c} S$	$\frac{2\|R\| + 2\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort-Merge Join	$R \bowtie_{SM, c} S$	[Sort]	[Sort]
(Tree) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot (\log_{\mathcal I}(\|S\|) + \frac{\|\sigma_c(S)\|}{\mathcal P})$	$O(1)$
(Hash) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot 1$	$O(1)$
(In-Mem) Aggregate	$\gamma_A(R)$	$\textbf{io}(R)$	$adom(A)$
(Sort/Merge) Aggregate	$\gamma_A(R)$	[Sort]	[Sort]

Tuples per Page ($\mathcal P$) – Normally defined per-schema
Size of $R$ ($|R|$)
Pages of Buffer ($\mathcal B$)
Keys per Index Page ($\mathcal I$)
Number of distinct values of $A$ ($adom(A)$)

Estimating IOs requires Estimating $|Q(R)|$

Operator	RA	Estimated Size
Table	$R$	$\|R\|$
Projection	$\pi(Q)$	$\|Q\|$
Union	$Q_1 \uplus Q_2$	$\|Q_1\| + \|Q_2\|$
Cross Product	$Q_1 \times Q_2$	$\|Q_1\| \times \|Q_2\|$
Sort	$\tau(Q)$	$\|Q\|$
Limit	$\texttt{LIMIT}_N(Q)$	$N$
Selection	$\sigma_c(Q)$	$\|Q\| \times \texttt{SEL}(c, Q)$
Join	$Q_1 \bowtie_c Q_2$	$\|Q_1\| \times \|Q_2\| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct	$\delta_A(Q)$	$\texttt{UNIQ}(A, Q)$
Aggregate	$\gamma_{A, B \leftarrow \Sigma}(Q)$	$\texttt{UNIQ}(A, Q)$

$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

(Some) Estimation Techniques

Guess Randomly: Rules of thumb if you have no other options...
Uniform Prior: Use basic statistics to make a very rough guess.
Sampling / History: Small, Quick Sampling Runs (or prior executions of the query).
Histograms: Using more detailed statistics for improved guesses.
Constraints: Using rules about the data for improved guesses.

Sketching

Flips	Score	Probability	E[# Games]
(👽)	0	0.5	2
(🐕)(👽)	1	0.25	4
(🐕)(🐕)(👽)	2	0.125	8
(🐕)$\times N$ (👽)	$N$	$\frac{1}{2^{N+1}}$	$2^{N+1}$

If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.

To do that, I only need to track my top score!

Flajolet-Martin Sketches

($\approx$ HyperLogLog)

For each record...
1. Hash each record
2. Find the index of the lowest-order non-zero bit
3. Add the index of the bit to a set
Find $R$, the lowest index not in the set
Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)
Repeat (in parallel) as needed

Count Sketches

Pick a number of "trials" and a number of "bins"
For each record $O_i$
1. For each "trial" $j$
  1. Use a hash function $h_j(O_i)$ to pick a bin
  2. Add a $\pm 1$ value determined by hash function $\delta_j(O_i)$ to the bin

For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$

Take the median value for all trials.

Count-Min Sketches

Pick a number of "trials" and a number of "bins"
For each record $O_i$
1. For each "trial" $j$
  1. Use a hash function $h_j(O_i)$ to pick a bin
  2. Add 1 to the bin

For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$

Take the minimum value for all trials.

Flajolet-Martin Sketches (HyperLogLog): Estimating Count-Distinct
Count Sketches: Estimating Count-GroupBy
(roughly uniform counts)
Count-Min Sketches: Estimating Count-GroupBy
(small number of heavy hitters)