Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps

Slides PDF Video

Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.

Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.

Debasish Das, Data Scientist at Verizon

About Debasish

Debasish Das joined Verizon’s Big Data Analytics Group in 2013. Prior to joining Verizon Debasish worked at Intel, Synopsys, Magma and Mentor Graphics. He did his PhD in EECS from Northwestern and BTech in CS from IIT Kharagpur. His current interests include scaling distributed convex optimization, developing machine learning algorithms for batch/streaming workflows and serving the prediction output at scale. His current focus is on developing recommendation engines for mobile advertising and anomaly detection for network security. He has contributed to open source projects like Apache Spark, ScalaNLP Breeze and Embotech ECOS.