- Data Sketching

Data Sketching

March 25, 2021

Garcia-Molina/Ullman/Widom: (readings only)

SELECT COUNT(DISTINCT A) FROM R
SELECT A, COUNT(*) FROM R GROUP BY A
SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10

These are all "Holistic" aggregates ($O(|A|)$ memory). What happens when you run out of memory?

Sketching: Hash function tricks used to estimate useful statistical properties.

Flajolet-Martin Sketches (HyperLogLog): Estimating Count-Distinct
Count Sketches: Estimating Count-GroupBy
Count-Min Sketches: Estimating Count-GroupBy-TopK

Count-Distinct

$3$ $5$ $4$ $4$ $2$ $4$ $3$ $\ldots$

$3$ $5$ $4$ $2$ $\ldots$

Challenge: To avoid double counting, we need to track which values of $A$ we've seen. $O(|A|)$ memory required.

A brief digression

The Coin Flip Game

Start with 0 points and flip a coin

Tails (🐕): Get a point and flip again.
Heads (👽): Game over.

Flips	Score
(👽)	0
(🐕) (👽)	1
(🐕) (🐕) (🐕) (🐕) (🐕) (👽)	5

Flips	Score	Probability	E[# Games]
(👽)	0	0.5	2
(🐕)(👽)	1	0.25	4
(🐕)(🐕)(👽)	2	0.125	8
(🐕)$\times N$ (👽)	$N$	$\frac{1}{2^{N+1}}$	$2^{N+1}$

If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.

To do that, I only need to track my top score!

Idea: Simulate coin flips with a hash function

... take the index of the lowest-order nonzero bit

Object	Hash Bits	Score
$O_1$	01011011	0
$O_2$	00110111	0
$O_3$	00111000	3
$O_4$	10010010	1
$O_3$	00111000	3
		3

Estimate: $2^{3+1} = 16$

Duplicates can't raise the top score!

Problem: Noisy estimate!

Idea 1: Instead of your top score, track the lowest score you have not gotten yet ($R$).

Object	Hash Bits	Score
$O_1$	01011011	0
$O_2$	00110111	0
$O_3$	00111000	3
$O_4$	10010010	1
$O_3$	00111000	3
		{0, 1, 3} $R = 2$

Estimate: $\frac{2^R}{\phi} = \frac{2^{2}}{0.77351} \approx 5.2$

Idea 2: Compute several estimates in parallel and average estimates.

Flajolet-Martin Sketches

($\approx$ HyperLogLog)

For each record...
1. Hash each record
2. Find the index of the lowest-order non-zero bit
3. Add the index of the bit to a set
Find $R$, the lowest index not in the set
Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)
Repeat (in parallel) as needed

Group-By Count

Problem: Need a counter for each individual A

Idea: Keep only one counter!

No... seriously

$$\delta(O_i) = \begin{cases} \textbf{if } h(O_i) = 0 \mod 2 & \textbf{then } -1 \\ \textbf{if } h(O_i) = 1 \mod 2 & \textbf{then } +1\end{cases}$$

$$\sum_i \delta(O_i)$$

Object	$\delta(O_i)$	Running Count
$O_3$	-1	-1
$O_1$	+1	0
$O_4$	-1	-1
$O_2$	+1	0
$O_4$	-1	-1
$O_1$	+1	0
$O_3$	-1	-1
$O_3$	-1	-2
$O_1$	+1	-1

$Total =$

$\texttt{COUNT_OF}(O_i) \cdot \delta(O_i)$

$+ \sum_{j \neq i}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)$

$E[\sum_{j}\texttt{COUNT_OF}(O_j) \cdot \delta(O_j)]$=

$\frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$

$ - \frac{1}{2}\sum \texttt{COUNT_OF}(O_j)$

$$Total \approx \texttt{COUNT_OF}(O_i) \cdot \delta(O_i) + 0$$

Running total was $-1$

Object	$\delta(O_i)$	Estimate
$O_1$	+1	-1
$O_2$	+1	-1
$O_3$	-1	+1
$O_4$	-1	+1

Not... so... great

Problem 1: All of the objects use the same counter (no way to differentiate an estimate for $O_1$ from $O_2$).

Problem 2: The estimate is really noisy

Idea 1: Multiple Buckets ($h(x)$ picks a bucket)

Idea 2: Multiple Trials ($h \rightarrow h_1, h_2, \ldots$; $\delta \rightarrow \delta_1, \delta_2, \ldots$)

Object	$h_1(O_i)$	$\delta_1(O_i)$	$h_2(O_i)$	$\delta_2(O_i)$
$O_1$	Bucket 1	-1	Bucket 2	1
$O_2$	Bucket 1	-1	Bucket 1	-1
$O_3$	Bucket 2	1	Bucket 1	-1
$O_4$	Bucket 1	-1	Bucket 1	1

Objects Seen: $$

	Bucket 1	Bucket 2
Trial 0	0	0
Trial 1	0	0

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	0	0	0.0	0
$O_2$	0	0	0.0	0
$O_3$	0	0	0.0	0
$O_4$	0	0	0.0	0

Objects Seen: $O_2$

	Bucket 1	Bucket 2
Trial 0	-1	0
Trial 1	-1	0

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	1	0	0.5	0
$O_2$	1	1	1.0	1
$O_3$	0	1	0.5	0
$O_4$	1	-1	0.0	0

Objects Seen: $O_2,O_1$

	Bucket 1	Bucket 2
Trial 0	-2	0
Trial 1	-1	1

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	2	1	1.5	1
$O_2$	2	1	1.5	1
$O_3$	0	1	0.5	0
$O_4$	2	-1	0.5	0

Objects Seen: $O_2,O_1,O_4$

	Bucket 1	Bucket 2
Trial 0	-3	0
Trial 1	0	1

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	3	1	2.0	1
$O_2$	3	0	1.5	1
$O_3$	0	0	0.0	0
$O_4$	3	0	1.5	1

Objects Seen: $O_2,O_1,O_4,O_1$

	Bucket 1	Bucket 2
Trial 0	-4	0
Trial 1	0	2

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	4	2	3.0	2
$O_2$	4	0	2.0	1
$O_3$	0	0	0.0	0
$O_4$	4	0	2.0	1

Objects Seen: $O_2,O_1,O_4,O_1,O_2$

	Bucket 1	Bucket 2
Trial 0	-5	0
Trial 1	-1	2

Object	Trial 1	Trial 2	Estimate	Real
$O_1$	5	2	3.5	2
$O_2$	5	1	3.0	2
$O_3$	0	1	0.5	0
$O_4$	5	-1	2.0	1

In practice, use Median and not Mode to combine trials

Top-K Group-By Count

Problem: "Heavy Hitters" overwhelm smaller counts

Idea: Give up. Drop $\delta$.

Count-Min Sketch

Object	Appearances	$h_1(O_i)$	$h_2(O_i)$
$O_1$	10	Bucket 1	Bucket 2
$O_2$	32	Bucket 1	Bucket 1
$O_3$	1002	Bucket 2	Bucket 1
$O_4$	500	Bucket 1	Bucket 1

	Bucket 1	Bucket 2
Trial 0	542	1002
Trial 1	1534	10

	Bucket 1	Bucket 2
Trial 0	542	1002
Trial 1	1534	10

Object	Appearances	Estimate 1	Estimate 2	Min
$O_1$	10	542	10	10
$O_2$	32	542	1534	542
$O_3$	1002	1002	1534	1002
$O_4$	500	542	1534	542