Tagging and Processing Data in Real-Time Using Spark Streaming

Slides PDF Video

Apache Spark is a flexible, scalable and fault-tolerant data processing framework that specializes in processing large amount of data. Spark Streaming builds on top of the core library to consume data from ingest systems like Apache Kafka, Apache Flume, Amazon Kinesis etc., in real time. In this talk, we will talk about the recent advances in Spark Streaming – the design of several new features that have improved performance and eliminated any possibility of data loss. We will discuss the use of Spark Streaming at to normalize data coming in from a variety of sources in real-time and how this normalized data is then tagged and made available to downstream applications for consumption. We will discuss the integration of Spark Streaming with Kafka in both directions and how such an integration is important for this use-case.

Photo of Hari Shreedharan

About Hari

Hari Shreedharan is a PMC Member and Committer on the Apache Flume Project and a committer on the Apache Sqoop Project. He is also a regular contributor to Apache Spark. Hari is a Software Engineer at Cloudera where he works on Apache Flume, Apache Spark and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark and Sqoop on their clusters, by helping them resolve any issues they are facing. Hari completed his Bachelors from Malaviya National Institute of Technology, Jaipur, India and his Masters in Computer Science from Cornell University in 2010.

Photo of Siddhartha Jain

About Siddhartha

Siddhartha Jain is a CISSP with 10+ years of experience in Information Security. He handled various roles in Infosec like Design and architecture, Incident Response, Security Operations, Forensics and Standards implementation (ISO 27001). Siddhartha is a Director of Information Security at Salesforce. He currently leads the development and deployment of several systems that use Spark and Spark Streaming