Session starts - 16:45
Gradoop: Scalable Graph Analytics with Apache Flink
Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. The flexibility of graph data models (e.g., the property graph model) and the variety of graph algorithms (e.g., Page Rank, Graph Pattern Matching, Frequent Subgraph Mining) make graph analytics attractive to different domains, e.g, social networks, business intelligence or in the life sciences. Graphs in these domains are often very large with millions of vertices and billions of edges making efficient data management and distributed execution of graph algorithms challenging.
Existing approaches for graph analytics such as graph databases (e.g., Neo4j) and parallel graph processing systems (e.g., Giraph) either lack sufficient scalability or flexibility and expressiveness. We therefore develop “Gradoop”, a new framework for end-to-end graph data management and analytics based on Apache Flink and Apache HBase.
Gradoop is designed around the so-called Extended Property Graph Model (EPGM) supporting semantically rich, schema-free graph data. In this model, a database consists of multiple property graphs which we call logical graphs. These graphs are application specific subsets from shared vertex and edge sets. The EPGM provides operators for both single graphs as well as collections of graphs. Operators may also return single graphs or graph collections thus enabling the definition of analytical workflows. We mapped the EPGM to Gelly graphs and implemented EPGM operators using existing Gelly and Flink functionality. We plan on integrating some of our operators into Apache Flink [1].
In the talk, I would like to give an overview of Gradoop, the EPGM and its operators and show how Apache Flink helps us by presenting a subset of our operator implementations. Furthermore, I will sketch the usefulness of Gradoop by presenting an analytical use case from the business intelligence domain. Since our prototype was initially based on Apache Hadoop and Apache Giraph, I will explain the lessons learned and why we moved our project to Apache Flink.
The Gradoop source code and a short documentation can be found on GitHub [2], a more detailed explanation of the data model and our operators can be found in a recent technical report [3].
[1] https://issues.apache.org/jira/browse/FLINK-2411
[2] https://github.com/dbs-leipzig/gradoop
[3] http://arxiv.org/pdf/1506.00548.pdf
About the speaker
Martin Junghanns
Martin is a PhD student and lecturer at the University of Leipzig. His area of research is distributed data management and processing with a focus on graph data, in particular techniques on graph data integration, declarative graph analytics and meaningful result representation. Besides that, he is interested in optimization techniques, like graph partitioning and replication.
He received his Master’s degree in April 2014, during his studies he was working at the former graph database vendor sones, as a research assistent at the University of Leipzig and completed an internship at SAP in Palo Alto.