BIG DATA CONFERENCE
Vilnius and Online
Principal Solution Architect
DXC Technology, Denmark
Josef Habdank is an expert in designing advanced analytics platforms for large scale enterprises. He is currently working on two very distinct types of projects: extreme throughput systems for self-driving cars, and extreme complexity systems for enterprise data lake management. He is a long term champion of Apache Spark and other open source technologies, frequent speaker on large event such as Spark Summit.
Management of a Cloud Data Lake in Practice: How to Manage 1000s of ETLs Using Apache Spark
Nowadays the problem of speed of processing is seemingly solved. Unless you process tens of petabytes an off-the-shelf toolset will suffice for most of the problems. Currently, the main challenges in data lake systems are in the field of data governance:
- how do you make sure data is discoverable, reusable, up to date and of high quality?
- how to avoid huge technical debt when developing a massive number of complex data flows?
- how to guarantee that the project can scale despite having access to very scarce human resources and technical talent?
The goal of this talk is to showcase how to design a data lake management system scalable in all the broadest meaning of the word: that is not only scales with the growth of the data, but as well that it scales with the growth of the complexity of the whole enterprise. The talk will outline the business reasoning, key design principles as well as technical solution. Expect some (but not too much) nerdy details related to Apache Spark implementation.