Active Students: Aaron Huber
Supported By: NSF Award #IIS-1956149
Probabilistic databases allow users to track uncertainty in data, and to better understand their effects on the outcome of queries. These are key requirements for accurate decision-making and rigorous data science over noisy data. Unfortunately, probabilistic databases have historically been slow and hard to use. The FastPDB system, part of the overall Uncertainty4U project, is an effort to overcome the performance limitations of probabilistic databases.
Past efforts to make probabilistic databases efficient have focused on so-called set-probabilistic databases, where the underlying data model is one of sets, and the primary objective is to compute the probability of a specific outcome. Several database systems, including Pip and MCDB, adopt the more common and more efficient bag semantics, focusing instead on computing expectations. Although expectations are more efficient to compute, one of our key findings is that even these must be asymptotically slower than analogous non-probabilistic queries.
Our key insight is that standard processes for sampling from query results can be inlined into standard query evaluation. We are developing the FastPDB system as an extension of the XDB Approximate Query Processing System and the GProm provenance tracking tool. XDB approximates query results in a fraction of the time of a normal database system. As our SIGMOD 2025 paper shows, the resulting system can approximate the expectation of a bag-probabilistic query result in a fraction of the time required to produce a deterministic result.
(The FastPDB project is being developed in collaboration with Boris Glavic, Atri Rudra, and Zhuoyue Zhao)
This page last updated 2025-03-04 15:41:55 -0500