- Checkpoint 2

Checkpoint 2

March 16, 2021

Garcia-Molina/Ullman/Widom: Ch. 15.1-15.5, 16.2-16.3, 16.7

Checkpoint 2

Just like Checkpoint 1, but now...

  • Selecting actual expressions
  • New SQL features (Limit, Order-By)
  • Tighter Constraints (Bigger data and less time/query)
  • Bonus Leaderboard Queries: More data than memory

New SQL Features: Sort

ORDER BY col1 asc/desc, col2 asc/desc, ...
Sort the data on col1 (using col2, col3, ... as tiebreakers) in ascending or descending order.

  'Sort ['bar ASC NULLS FIRST], true
  +- 'Project [*]
     +- 'UnresolvedRelation [FOO], [], false
    

You can rely on Scala's native sort.


  case class Sort(
    order: Seq[SortOrder],  // Order clauses
    global: Boolean,        // ignore this (for distributed execution)
    child: LogicalPlan      // Input plan
  ) extends UnaryNode with Product with Serializable
    

  case class SortOrder(
    child: Expression,                    // The expression to sort by
    direction: SortDirection,             // Ascending or Descending
    nullOrdering: NullOrdering,           // NullsFirst or NullsLast
    sameOrderExpressions: Seq[Expression] // ignore this (used by spark)
  )
    
https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Sort.html

      SELECT * FROM R ORDER BY A ASC, B DESC
    

      Sort(Seq(
          SortOrder(UnresolvedAttribute(Seq("A")), 
                    Ascending,
                    Ascending.defaultNullOrder,
                    Seq()),
          SortOrder(UnresolvedAttribute(Seq("B")), 
                    Descending,
                    Descending.defaultNullOrder,
                    Seq())
        ), true, UnresolvedRelation(Seq("R"))),
      ))
    

New SQL Features: Limit

LIMIT N
Return only the first N rows. If the data is sorted, return the first rows according to the sort order. If not, return an arbitrary N rows.

  'GlobalLimit 3
  +- 'LocalLimit 3
      +- 'Project [*]
         +- 'UnresolvedRelation [FOO], [], false
    

Output exactly limitExpr rows if available


  case class GlobalLimit(
    limitExpr: Expression, 
    child: LogicalPlan
  ) extends OrderPreservingUnaryNode with Product with Serializable
    

Output exactly limitExpr rows if available per partition


  case class LocalLimit(
    limitExpr: Expression, 
    child: LogicalPlan
  ) extends OrderPreservingUnaryNode with Product with Serializable
    

For this checkpoint: Pick one, ignore the other

https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/GlobalLimit.html
https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/LocalLimit.html

      SELECT * FROM R LIMIT 3
    

  GlobalLimit(
    Literal(3),
    LocalLimit(
      Literal(3),
      UnresolvedRelation(Seq("R"))
    )
  )
    

Handling Tighter Constraints

Selection Pushdown
Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion
Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

$\sigma_{c_1 \wedge c_2 \wedge c_3}(R \times S) = \sigma_{c_1}(R) \bowtie_{c_2} \sigma_{c_3}(S)$

$\sigma(\pi(...)) \rightarrow \pi(\sigma(...))$?

      plan transform {
        case Filter(condition, 
              Project(expressions, child)) => ???
      }
    
$\sigma(L\times R) \rightarrow L \bowtie R$?

      plan transform {
        case Filter(condition, 
              Join(lhs, rhs, Cross, condition, hint)) => ???
      }
    

Spark's Optimizer

(simplified version)


  trait OptimizationRule {
    def apply(plan: LogicalPlan): LogicalPlan
  }

  object PushDownSelections 
    extends OptimizationRule  
  {
    def apply(plan: LogicalPlan) = 
      plan.transform { 
        case Filter(condition, 
                Project(expressions, child)) => ???
        /* and other cases here... */
      }
  }
    

Spark's Optimizer

(simplified version)

  1. Save a copy of the plan
  2. Apply every optimization rule once
  3. If the plan changed since the saved copy, goto 1
  4. If not, done!

Test whether two logical plans are the same with fastEquals

Watch out for infinite loops.


  val rules = Seq[OptimizationRule]( ??? )

  def onePass(plan: LogicalPlan) = 
    { val current = plan
      for(rule <- rules){ current = rule.apply(current) }
      
  def fixpoint(plan: LogicalPlan) = 
    { var current = plan
      var last = null
      while(last == null || !current.fastEquals(last)){
        last = current
        current = onePass(current)
      }
      return current
    }
    

Handling Tighter Constraints

Selection Pushdown
Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion
Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

One-Pass Hash Join

The reference implementation uses Scala's Map.

Keep in-mind that you may need to hash multiple tuples to the same join key.

Picking a Join Order

  • Pick the smaller table for the outer relation.
  • LogicalPlan.maxRows is one way to bound the size of the relation (but needs to be defined for your Table class).
  • Optional: Use commutativity/associativity to find all join orders.

Cross product is expensive!
Can we do better?

$\sigma_c(R\times S) \equiv R\bowtie_c S$

Cross Product

Problem: Naively, any tuple matches any other

Join Conditions

Solution: First organize the data

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

In-Memory Index Join (1-pass Hash; Hash Join)
Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join)
Partition both sides so that tuples don't join across partitions.
Sort/Merge Join
Sort all of the data upfront, then scan over both sides.

Hash Functions

  • A hash function is a function that maps a large data value to a small fixed-size value
    • Typically is deterministic & pseudorandom
  • Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …
  • Examples: MD5, SHA1, SHA2
    • MD5() part of OpenSSL (on most OSX / Linux / Unix)
  • Can map h(k) to range [0,N) with h(k) % N (modulus)

Hash Functions

$$h(X) \mod N$$

  • Pseudorandom output between $[0, N)$
  • Always the same output for a given $X$

1-Pass Hash Join

1-Pass Hash Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Moderate-High Memory
Keeps 1 full relation in memory
Low Added IO Cost
Only requires 1 scan over each input.

Bonus Problem: Memory

  • Which on-disk algorithms to use?
  • When to use them?

On-Disk

  • 2-Pass sort, the only on-disk algorithm you need
  • (plus Sort/Merge join

Sort/Merge Join

Sort/Merge Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Low Memory
Only needs to keep ~2 rows in memory at a time (not counting sort).
Low Added IO Cost
No added IO! (not counting sort).

2-Way Sort

Pass 1
Create lots of (small) sorted lists.
Pass 2+
Merge sorted lists of size $N$ into sorted lists of size $2N$

Pass 1: Create Sorted Runs

Pass 2: Merge Sorted Runs

Repeat Pass 2 As Needed.

Questions?