27-29 November, Vilnius

Conference about Big Data, High Load, Data Science, Machine Learning & AI

Conference is over. See you next year!

GERARD TOONSTRA

BigData Republic, The Netherlands

Biography

Gerard Toonstra is an Apache Airflow enthousiast and is excited about it ever since it was announced as open source. He was the initial contributor of the HttpHook and HttpOperator and set up a site “ETL with airflow”, which is one of the richest practical sources of information about Apache Airflow. Gerard has a background in nautical engineering, but works in information technology since 1998, after which he worked in different engineering positions in the UK, The Netherlands and Brazil.
He now works at BigData Republic in The Netherlands as BigData Architect / Engineer. BigData Republic is a multidisciplinary team of experienced and business oriented Data Scientists, Data Engineers, and Architects. Irrespective of an organization’s data maturity level, we help to translate business goals into the design, implementation and utilization of innovative solutions. In his spare time Gerard likes oil painting and in his holidays visit a beautiful beach in Brazil to read spy novels or psychology books.

Workshop

Apache Airflow hands on

Apache Airflow is attracting more attention worldwide as a de-facto ETL platform. As the author of the site “ETL with airflow”, I’d like to share this knowledge and get novices up to speed with Apache Airflow as their ETL platform. Learn how to write your first DAG in python, email notifications, scheduler configuration, writing your own hooks and operators and pointing you towards important principles to maintain when composing your dags.

Apache Airflow has become a very popular tool for running ETL, machine learning and data processing pipelines. Embedded in the implementation are the insights and learnings from years of experience in data engineering.

The workshop explains what these principles are and how they can be achieved rather effortlessly by putting the components of Apache Airflow together in a data processing workflow.

Agenda

Installing Apache Airflow – 45 mins

  • Introduction
  • Run Apache Airflow on docker

Exploring the UI – 45 mins

  • Monitoring DAG statuses
  • Administrative tasks
  • DAG detail screens

Your first DAG – 45 mins

  • Setting a DAG schedule
  • Start date and execution date
  • Understanding macros

Failure emails, SLA’s – 45 mins

  • When tasks fail
  • Sending custom emails
  • SLA’s and their uses

Applying best practices – 2 hours

  • Explaining best practices
  • Implement them in airflow

Extending airflow – 45 mins

  • How to build your own hook
  • How to build your own operator

Deploying airflow – 45 mins

  • System components
  • Important things to keep in mind
  • PaaS solutions

Round off – 45 mins

  • Room for questions and exploration

Course objectives

The workshop allows you to get your wet feet with the Apache Airflow platform. No prior knowledge is assumed. The workshop focuses on getting familiar with the user interface, building and configuring a data processing workflow and building pipelines that adhere to best practices. The objective is to make you walk away with a rough understanding what Apache Airflow can do for your company and the challenges you will face that are specific to your organization.

Target audience

This workshop is hands on, the intended audience are people who have basic code reading abilities. All sessions rely on pre-existing code, so no code will be developed from scratch

Course prerequisites

A laptop, notebook or macbook with internet connection, ideally with docker preinstalled.

Gerard’s preference to use “docker” to run the workshop for the following reasons:

  • Your personal machine doesn’t get polluted with anything you do in the workshop
  • It’s easy to remove afterwards
  • We all start from a known state
  • Docker images are contained and can’t damage anything on your personal machine.

The first step is to make sure you have “docker” installed on your laptop. This comes for all flavours of windows, mac and linux.

Only if you prefer to run airflow on your personal machine directly, you can follow the tutorial here:

https://airflow.incubator.apache.org/start.html

Install docker

Let’s install docker first! The docker website has clear instructions, but I’m linking directly from here.

Windows versions that support virtualization:

https://docs.docker.com/docker-for-windows/install/

Windows versions that do NOT support virtualization:

https://docs.docker.com/toolbox/toolbox_install_windows/

Mac OS:

https://docs.docker.com/docker-for-mac/install/

A flavour of the common linuxes can be found by a link in this section (Use the “CE” edition):

https://docs.docker.com/v17.12/install/#server

Prepare the airflow and postgres images

We’ll be using a very basic airflow image (version 1.10), made available by “Matthieu Roisil” and we will also use an image of postgres as the underlying database. It’s best to pull both of those images prior to the session, so we don’t have to wait for the download, which can take a long time over wifi. The postgres image allows us to run tasks in parallel, so it can help to speed up processing a bit.

We will pull images of a specific version to your local computer with this:

  •  docker pull puckel/docker-airflow:1.10.0-5
  •  docker pull postgres:9.6

( The github repository that created the image is available over here for reference: https://github.com/puckel/docker-airflow )

DATE:
27 November, 2018

TIME:
10:00-17:30

VENUE:
Crowne Plaza Vilnius – M. K. Čiurlionio str. 84, Vilnius, Lithuania

Due to high number of attendees, we have a very limited number of open seats and rely on first-come, first-served basis.

DATE:
27 November, 2018

TIME:
10:00-17:30

VENUE:
Crowne Plaza Vilnius – M. K. Čiurlionio str. 84, Vilnius, Lithuania

Due to high number of attendees, we have a very limited number of open seats and rely on first-come, first-served basis.