ODIn: Online Data INteractions

Making Compilers and Program Analysis Declarative

Modern compilation pipelines (compilers, optimizers, program analysis tools) consume, transform, and produce abstract syntax trees. If you squint a little, these processes start to look a lot like database queries, updates, and views. The ODIn Lab is looking at ways to leverage expertise from the database community to improve the performance of compiler tool chains, while at the same time making it easier to write compilers and program analyses by decoupling the core specification logic from logic aimed at making the compiler faster.

Active ParticipantsVictoria Dib
AlumniAnkur Upadhyay, Daniel Bellinger, Darshana Balakrishnan, Hank Lin, Nick Brown
SponsorsNSF(Award IIS-1617586)
Recent Publications
- Flow-centric Query Evaluation Pipelines Victoria Dib, Andrew J. Mikalsen, Krishna Sivakumar, Jaroslaw Zola, Oliver Kennedy North East Database Day 2026 paper poster
- ASTral: A framework for modern SQL compilers Darshana Balakrishnan PhD Thesis 2025
- Flow-centric Data Pipelines Victoria Dib, Andrew J. Mikalsen, Jaroslaw Zola, Oliver Kennedy Ontario Database Day 2025 paper poster

Draupnir💾📜

Collaborators:Victoria Dib, Andrew J. Mikalsen, Krishna Sivakumar, Jaroslaw Zola, Ethan Canton

Multiple efforts in the PL community have adopted Datalog as a basis for large-scale program analysis, due to its ability to easily define recursive properties. However, existing datalog engines are incapable of running these analyses on very large programs (e.g., the Linux Kernel or Chrome browser). Draupnir is a datalog engine that that draws on group-/ring-theoretic provenance models (e.g., Semiring Provenance, Ring Databases) to scale to large datalog workloads on commodity hardware.

Hedgelog

Collaborators:

Datalog excels at recursive queries, but strugles with aggregate queries. The hedgelog project is exploring the underlying nature of this struggle, and developing group-theoretical techniques that serve as the basis for a language that is fully backwards-compatible with datalog, while providing ergonomic, optimizer-friendly support for aggregation.

DiDB

Collaborators:Victoria Dib

Incremental computation produces a unique data access pattern that is highly predictable. The DiDB (Distributed Incremental Data Base) project is exploring ways to leverage the predictability of these access patterns to minimize the overhead of concurrent updates.

Automating Re-usable and Reproducible Data Science

Data Science is frequently an iterative process, with scientists frequently revisiting and past efforts and adapting them for new research goals. A key challenge in this process is ensuring that efforts are re-used safely. Just like compilers for strongly typed languages help users understand the potential implications of changes in their code, the ODIn lab is looking to develop similar techniques to assist data scientists in developing safe, re-usable, and reproducible data science pipelines.

Active ParticipantsPratik Pokharel
AlumniPoonam Kumari, Ting Xie, Ying Yang
SponsorsNASA(Awards CURE-C STTR Phase 2, CURE-C STTR Phase 1)NSF(Awards CNS-2125516, ACI-1640864)
Recent Publications
- Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments Pratik Pokharel, Juseung Lee, Oliver Kennedy, Jeff Good, Marianthi Markatou, Andrew H. Talal, Raktim Mukhopadhyay Workshop on Human-In-the-Loop Data Analytics 2024 paper (original) paper (fixed typos)
- Embedding The Context of Attributes in Longitudinal Study Schemas To Refine Their Semantic Match Candidates Pratik Pokharel, Oliver Kennedy Ontario Database Day 2024 paper
- Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflows Marianthi Markatou, Oliver Kennedy, Michael Brachmann, Raktim Mukhopadhyay, Arpan Dharia, Andrew H. Talal Frontiers in Medicine 2023 paper

Vizier🕸️💾📜

Collaborators:Marianthi Markatou, Michael Brachmann, Raktim Mukhopadhyay, Arpan Dharia, Andrew H. Talal, Boris Glavic, Nachiket Deo, Juliana Freire, Poonam Kumari, Su Feng, Will Spoth, Ying Lu, Beda Hammerschmidt, Zhen Hua Liu, Aaron Huber, Heiko Mueller, Sonia Castelo, Carlos Bautista, Fatemeh Nargesian, Juliana Freire, Remi Rampin, Ying Yang, Zhengjie Miao, Qitian Zeng, Chenjie Li, Sudeepa Roy, Ting Xie, Dieter Gawlick, Bahareh Sadat Arab, Eric S. Chan, Adel Ghoneimy, Seokki Lee, Xing Niu, Bahareh Arab, Vasudha Krishnaswamy

Notebooks are a convenient tool for data exploration, allowing users to rapidly explore datasets, while preserving a record of the steps they took to get to their results. However, the notebook metaphor takes several design shortcuts in the name of efficiency that make it difficult to create safely reusable notebooks, and that create confusion for novice users. Vizier is the prototype of a new "workbook" metaphor that presents a notebook style interface to a "microkernel workflow system", which keeps cell results up-to-date, and doesn't have a kernel that needs to be re-started and re-populated.

Theseus📜

Collaborators:Pratik Pokharel, Juseung Lee, Jeff Good, Marianthi Markatou, Andrew H. Talal, Raktim Mukhopadhyay, Poonam Kumari

Over the course of a multi-year (longitudinal) study, as social sciences researchers gain an improved understanding of the study's domain and subjects, it may be necessary to make incremental changes to the study's data collection procedures.Theseus is a set of tools for working with data collected under evolving methodologies, allowing study designers to flag differences between study iterations, and allowing researchers to quickly select the subset of the data that is appropriate for their analysis.

Safe, Efficient Management of Uncertain Data

Ambiguous, incomplete, conflicting, and otherwise uncertain data is difficult to use safely, often requiring days or even weeks of careful study to understand the uncertainty's implication on downstream analyses. Exploration, and re-use of uncertain data pose a particularly significant challenge, as it becomes easy to lose track of assumptions made during one phase of a data science project. The ODIn lab is exploring how uncertainty in data can be precisely modeled without impeding a researcher's ability to explore or analyze their data.

Active Participants
AlumniAaron Huber, Arindam Nandi, Niccolò Meneghetti, Poonam Kumari, Shivang Aggarwal, Ying Yang
SponsorsNASA(Awards CURE-C STTR Phase 2, CURE-C STTR Phase 1)NSF(Awards IIS-1956149, IIS-1750460)Oracle University RelationsThe US Naval Postgraduate School(Award N00244-16-1-0022)
Recent Publications
- FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds Aaron Huber, Oliver Kennedy, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic Proceedings of the ACM on Management of Data 2025 paper test-scripts fastpdb
- Theoretical Bounds and Systems Development for Exact/Approximate Multiplicities in Bag PDBs Aaron Huber PhD Thesis 2025
- Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data Su Feng, Boris Glavic, Oliver Kennedy Proceedings of the VLDB Endowment 2023 paper

Mimir💾📜

Collaborators:Aaron Huber, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic, Juliana Freire, Michael Brachmann, Zhengjie Miao, Qitian Zeng, Chenjie Li, Sudeepa Roy, Poonam Kumari, Will Spoth, Ting Xie, Ying Yang, Beda Hammerschmidt, Zhen Hua Liu, Dieter Gawlick, Niccolò Meneghetti, Wolfgang Gatterbauer, Bahareh Sadat Arab, Eric S. Chan, Adel Ghoneimy, Seokki Lee, Xing Niu, Arindam Nandi, Ronny Fehling, Bahareh Arab, Vasudha Krishnaswamy, Heiko Mueller, Said Achmiz, Jan Chomicki, Anand Sankar Bhagavandas, Gourab Mitra, Shivang Aggarwal, Sneha Krishnamurthy, Vinayak Karuppasamy

Mimir adds support for uncertain data to Apache Spark by using lightweight annotations called "Caveats" that are propagated through queries. Caveated data and records can carry additional metadata (e.g., textual descriptions of uncertainty, or details about the scope of the uncertainty), but this information is not propagated during normal query evaluation. By only propagating the presence of a caveat in a query result, the performance impact on query evaluation is minimal, and the content of the resulting caveats can be subsequently derived through a form of program slicing applied to queries.

FastPDB💾📜

Collaborators:Aaron Huber, Atri Rudra, Zhuoyue Zhao, Su Feng, Boris Glavic

Although probabilistic databases (PDBs) are incredibly important for helping data scientists keep track of uncertainty and ambiguity in their data, existing PDBs can also be orders of magnitude slower than normal database engines. FastPDB is the first system to couple a PDB with an Approximate Query Processing engine, providing accurate approximations of probabilistic query results with competitive runtimes.

Uncertainty Annotated Databases📜

Collaborators:Su Feng, Boris Glavic, Aaron Huber

The UADB project, in collaboration with UIC and Nanjing Tech, is a lightweight model of incomplete data based on ranges. Instead of attempting to produce exact results, UADBs instead answer queries with bounds (both lower and upper) on the space of possible results. In addition to providing more valuable feedback for users, these bounds can be relaxed to improve performance.

Paused and Completed Projects

DBToaster🕸️📜

Collaborators:Christoph Koch, Yanif Ahmad, Milos Nicolic, Andres Nötzli, Daniel Lupei, Amir Shaikhana

DBToaster is an SQL-to-native-code compiler. It generates lightweight, specialized, embeddable query engines for applications that require real-time, low-latency data processing and monitoring capabilities. The DBToaster compiler generates code that can be easily incorporated into any C++ or JVM-based (Java, Scala, ...) project. Since 2009, DBToaster has spearheaded the currently ongoing database compilers revolution. If you are looking for the fastest possible execution of continuous analytical queries, DBToaster is the answer. DBToaster code is 3-6 orders of magnitude faster than all other systems known to us.

Status: This project continues at EPFL and the University of Edinburgh; Here, it evolved into the declarative compiler project

Ettu📜

Collaborators:Ting Xie, Varun Chandola, Gökhan Kul, Duc Thanh Luong, Shambhu Upadhyaya, Patrick Coonan, Thomas Mitchell

One of the greatest threats to a the security of a database system comes from within: Users who have been granted access to data using it in a malicious or illegitimate way. Our goal is to develop new types of statistical signatures for a user or role's behavior as they access a database. Using these signatures, we can identify non-standard behavior that could be evidence of malicious activity.

Status: Project completed

Jxplain

Collaborators:Will Spoth, Ying Lu, Beda Hammerschmidt, Zhen Hua Liu

Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. The Jxplain system adopts a set of heuristics that effectively resolve this ambiguity.

Status: Project completed

Pocket Data📜

Collaborators:Carl Nuessle, Lukasz Ziarek, Jerry Antony Ajay, Geoffrey Challen, Gökhan Kul, Gourab Mitra, Grant Wrazen

Mobile phone CPUs can slow down to save power. An OS component called the 'governor' chooses when and how this happens. It turns out most governors on Android are over-engineered, making predictions based on useless data. The Pocketdata project explores how we can use live information from apps to make better choices for power management.

Status: Project completed

Leaky Joins📜

Collaborators:Ying Yang

One of the primary challenges in graphical models is inference, or re-constructing a marginal probability from the graphical model's factorized representation. While tractable for some graphs, the cost of inference grows exponentially with the graphical model's complexity, necessitating approximation for more complex graphs. The leaky join algorithm couples together incremental view maintenance with approximate query processing to produce an anytime algorithm for inference in a very large-scale Bayes net.

Status: Project completed

Loki📜

Collaborators:Poonam Kumari, Will Spoth, Fatemeh Nargesian, Gourab Mitra

The Label Once and Keep It project looks at how we can incrementally curate data lakes by tracking how people integrate data from multiple sources. Like popular work on Unionability, our goal is to identify related datasets, but we also support datasets that must be transformed (or 'mapped') in order to be unionable.

Status: Evolved into the DDM project

Pip🕸️📜

Collaborators:Suman Nath, Steve Lee, Slawek Smyl, Charles Loboz, Christoph Koch

Pip was one of the first probabilistic database systems to cope with value-level uncertainty (as opposed to row-level uncertainty). In contrast to purely sampling-based approaches like MCDB, Pip was able to produce exact analytical results in a fraction of the time.

Status: This project evolved into the Mimir system