27-29 November, Vilnius

Conference about Big Data, High Load, Data Science, Machine Learning & AI

Conference is over. See you next year!


Treasure Data, Japan


Kai Sasaki is a software engineer working at Treasure Data Inc. He is working on developing and maintaining distributed processing platform intheir service. He is an author of “Professional Hadoop” and “Spark: Big Data Cluster Computing in Production”. As an open source contributor, he is a committer of Apache Hivemall which is a scalable machine learning library running on Hive/Spark.


Infrastructure for Auto Scaling Distributed System

A large-scale distributed system including distributed database is a technology to enable us to extract useful information from so-called big data. But maintaining a distributed system in the real world still requires intensive human resource even nowadays.

We, Treasure Data is providing data analysis platform based on cloud technology. Our customers run their data analysis job in our distributed processing engine based on Hadoop/Presto. There are three challenges to overcome for keeping the data analysis platform available and reliable based on our experience.

– Test: The compatibility of applications running on the platform crucial thing for users. We always need to make sure to test of data analysis platform in terms of such kind of compatibility.
– Deployment: Since we cannot always release a new version without any bug, deployment, and rollback need to be done safely and quickly, which is even difficult in case of a distributed system.
– Monitoring: Distinguishing what can be changed and what should not be changed in each release is the key to regard a release as successful. It is necessary to monitor the performance, existing user-facing functionalities, and new features.

Despite the importance of these operations, we tend to have a limited human resource for that. It is required to create an infrastructure to make it easy so that our business is also successful.
In this talk, we are going to introduce our approach to create the infrastructure to make our distributed system reliable and robust as enterprise level.
– Detecting query patterns by clustering the query signatures.
– Blue/Green deployment by decoupling processing and storage layer.
– Performance probing based on state-of-the-art benchmark set.
– Monitoring criteria based on the statistics of query performance metrics.
By improving this operational infrastructure, we could succeed to reduce the time taken for each release and for handling customer inquiry.