Informational Referential Integrity Constraints Support in Apache Spark

Slides PDF Video

An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations.

This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You’ll learn about the constraint specification, metastore storage, constraint validation and maintenance. You’ll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection.

Session hashtag: #SFdev21

Ioana Delaney, Senior Software Engineer at IBM

About Ioana

Ioana Delaney is a Senior Software Engineer in Silicon Valley Laboratory in San Jose, California. She was part of the DB2 for LUW development team until she recently joined Spark Technology Center at IBM. She worked in many areas of SQL query compilation, including query semantics, query rewrite, query optimization, and federated/distributed compiler.

Suresh Thalamati, Advisory Software Engineer at IBM

About Suresh

Suresh Thalamati is an Advisory software engineer at the Spark Technology Center at IBM. He is Apache Spark contributor and works in the open source community. He is a Apache Derby committer and a PMC member. He is experienced in Relational Databases, Distributed Computing and Big Data Analytics with focus on Hadoop MapReduce technologies.