Okkam is working on innovative solutions combining semantic and big data technologies to define the new generation of tax assessment tools. We adopted Apache Flink since it was Stratosphere as a daily companion for many activities, that range from the duplicate detection of the Okkam Entity Name System (ENS) datasets containing million of records relying on Apache HBase and a global index (Apache Solr), to create complex ElasticSearch indexes for business intelligence and tax assessment (temporal) reasoning. Furthermore, we use it also to perform data quality and telemetry analysis based on MongoDB in the context of TAGCLOUD (EU FP7 Project). We rely on Apache Flink also to debug a complex semantic ETL pipeline, running preliminary analysis aimed at early errors detection. We mostly use Apache Flink on a single developer machine, assuming scarcity of resources as a primary constraints and aiming at optimizing processes. We learned that the chosen serialization method has an heavy impact on the efficiency of the process, and the combination with Thrift and Parquet (or Kryo) saves a lot of execution time. We learned the hard way how to manage sequences of Join operations, where a simple distinct() or a project() can make the difference between “Job Failure” and smooth and fast success execution. Apache Flink allows us to save a lot of time reducing development and testing cycles on datasets of hundreds of millions of RDF Triples, counting on the fact that when these will scale up, we will be able to perform the same valuable operations simply relying on few consumer machines.
About the speaker
Stefano Bortoli, PhD in ICT at the International Doctoral School of the University of Trento (Italy), works as technical director and researcher at Okkam S.R.L. (Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity-centric applications exploiting semantic technologies.
Flavio Pompermaier, MSc in Computer Science at University of Trento (Italy), works as senior software engineer at Okkam S.R.L. (Trento, Italy). Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies. He works at the core components of the tax assessment application developed in the context of the SICRaS project (http://www.sicras-project.