April, 2021
Just like Checkpoint 3, but now...
CREATE TABLE CUSTOMER(
CUSTKEY INT,
NAME STRING,
ADDRESS STRING,
NATIONKEY INT,
PHONE STRING,
ACCTBAL FLOAT,
MKTSEGMENT STRING,
COMMENT STRING
) USING csv OPTIONS(
path 'data/CUSTOMER.data',
delimiter = '|',
primary_key = 'custkey'
)
primary_key = 'custkey',
or
primary_key = 'orderkey,lineitem',
comma-separated list describing the primary key of the table
Time to scan SF 0.1 LINEITEM
Source | Time |
---|---|
CSV on NVME SSD | 0.88745s |
Time to scan SF 0.1 LINEITEM
Source | Time |
---|---|
CSV on NVME SSD | 0.88745s |
IndexedSeq[InternalRow] | 0.018s |
~30x speedup
Takeaway: Read data in at CREATE TABLE
Table(...)
Return an iterator over the preloaded table.
$\sigma_C(R)$ and $(\ldots \bowtie_C R)$
Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$
Possible Implementations:
Sort data on $(A, B, C, \ldots)$
First sort on $A$, $B$ is a tiebreaker for $A$,
$C$ is a tiebreaker for $B$, etc...
Which one do we pick?
(Start picking arbitrarily, then experiment)
These are called "Access Paths"
Time to filter SF 0.1 LINEITEM for one orderkey
Source | Time |
---|---|
CSV on NVME SSD | 0.9196s |
IndexedSeq[InternalRow] | 0.0624s |
Time to filter SF 0.1 LINEITEM for one orderkey
Source | Time |
---|---|
CSV on NVME SSD | 0.9196s |
IndexedSeq[InternalRow] | 0.0624s |
Sorted IndexedSeq[InternalRow] + Bin Search | 0.0008s |
~80x speedup
Takeaway: Sort on primary key and binary search.
Filter(expression, Table(...))
If expression is a ...
If expression is a ...
But TPC-H doesn't have filters on keys...
USING csv OPTIONS(
path '../TPCH/LINEITEM.csv',
delimiter = '|',
primary_key = 'orderkey,linenumber',
tree_index = 'shipdate',
hash_index = 'linestatus|shipmode'
)
tree_index and hash_index are |-separated lists of ,-separated indexes.