Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
I’m an associate professor in the InfoLab who is affiliated with the Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. A major application of our work is to make it dramatically easier to build high-quality machine learning systems to process dark data including text, images, and video, e.g., Snorkel.