Session starts - 16:35
Fault Tolerance and Recovery of Flink Jobs
For running data flow programs in production it is essential to deal with unforeseen failures of the system. These failure can be caused by multiple reasons, most notably hardware outages or failures of external services. Many of these failures are not fatal and, thus, should not lead to the failure of the job.
This talk will shed light into Apache Flink’s capabilities to detect and to recover from failures. On the system-level, Flink supports high availability utilizing Apache ZooKeeper. This eliminates all single points of failure and, thus, allowing Flink to stay always responsive. On the operator-level, Flink uses its own Chandy-Lamport algorithm variant to draw periodically state snapshots of a running streaming topology. These low-overhead checkpoints are used to recover the operator’s state in case of a failure. By guaranteeing consistent checkpoints throughout the whole topology, Flink achieves exactly-once processing semantics.
About the speaker
Till Rohrmann
Till is PMC member of Apache Flink.
His main work focuses on enhancing Flink’s scalability as a distributed system and building a large-scale machine learning library with Flink. Till also contributed to Apache Mahout and helps presently to add Flink support to the Mahout DSL. Till earned his MS in computer science from Technische Universität Berlin where he focused on machine learning and massively parallel dataflow systems.