Cascading is a popular framework to develop, maintain, and execute large-scale and robust batch data analysis applications. Originally, Cascading flows have been compiled into Apache Hadoop MapReduce programs. With the recent 3.0 release, Cascading added an extensible rule-based planner and support for Apache Tez as a runtime back-end. Apache Flink’s execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources. With its native support Hadoop YARN, Flink is another attractive runtime back-end for Cascading.
This talk introduces the Cascading Connector for Apache Flink. The connector translates Cascading flows into Apache Flink programs. Cascading flows executed using the Flink connector benefit from Flink’s runtime features such as its pipelined data shuffles and its efficient and robust in-memory operators. The talk describes the integration of Cascading and Flink, highlights its features, and points out its current limitations. We will show how to use the connector and give a demo.
About the speaker
Fabian Hueske is a PMC member of Apache Flink. He started working on this project as part of his PhD studies at TU Berlin in 2009. Fabian did internships with IBM Research, SAP Research, and Microsoft Research and is a co-founder of data Artisans, a Berlin-based start-up devoted to foster Apache Flink. He is frequently giving talks on Apache Flink at conferences and meetups. Fabian is interested in distributed data processing and query optimization.