San Francisco
June 30 - July 2, 2014

Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.


Spark Summit 2014
Using Spark to generate analytics for international cable TV video distribution
Christopher Burdorf (NBC Universal)

Digital video media clips for cable TV programs, commercials, etc are produced and broadcast from our Los Angeles office to cable TV channels in Europe, Asia, and Australia/New Zealand. Metadata associated with these media clips is stored in an Oracle database and broadcast automation playlists. There has been a need to generate analytics from this metadata to enhance analysis of their usage. Furthermore, analysis of historical broadcast timestamps is used to determine which clips may be purged from near-line broadcast media caches.

The Spark Scala framework is used to query the Oracle database and distribute the loading of metadata from the broadcast automation playlists into a large in-memory resilient distributed dataset (RDDs). Mesos is used to manage the resources in our cluster for deployment of the Spark Scala application. The framework is then utilized to filter data into domain-specific objects that are then processed to build broadcast frequency counts via Spark’s map/reduceByKey utilities.

The resulting data is then bulk loaded into Hadoop/HBase where it is queried from a Java/Spring web application, which converts the queried results into graphs illustrating media broadcast frequency counts by week, month, and year – in total, and on a per channel basis. The data structure and processes will be discussed and graphs will be presented showing some of the results.

A report listing media clips in ascending order, based on last broadcast timestamp, is generated from a distributed in-memory join using the Spark Scala framework API. Data is collected from two different sources, which are built using Spark: one containing media IDs and an associated most-recent broadcast timestamp (using map/distinct/groupBy) and another, queried from Oracle, containing a list of media IDs that have media files (eg. MXF, MPEG) residing in a near-line broadcast media cache. The result of the join is sorted into ascending order on the most recent broadcast timestamp field. Thus, the associated media at the top of the list will be prime candidates for purging from our near-line broadcast media cache, as the media associated with these records has not been broadcast recently. Details on the Spark system performance will be presented.

I’m a software engineer at NBC Universal. I have extensive experience in research and development applied to distributed systems for the entertainment industry, academia, and military. I have a Phd in Mathematics from the University of Bath, UK.

Slides PDF |Video