BIG DATA CONFERENCE

EUROPE 2022

November 23-24

Online

Frank Munz

Developer Relations EMEA

Databricks, Germany

Biography

Dr. Frank Munz is a Developer Advocate at Databricks. He authored three computer science books, built up technical evangelism for Amazon Web Services in Germany, Austria, and Switzerland, and once upon a time worked as data scientist with a group that won a Nobel prize.

Frank realized his dream to speak at top-notch conferences on every continent (except antarctica, because it is too cold there). He presented at conferences such as Devoxx, Kubecon, and Java One. He holds a PhD in Computer Science from TU Munich.

Talk

Share Massive Amounts of Live Data with Delta Sharing

“Data comes at us fast” is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern object stores.

Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured.

In this session, dive deep into an open-source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open-source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time.

It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables.

Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data.

The server then verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in S3, ADLS 2, or GCS that actually make up the table.

To transfer the data, the server generates short-lived pre-signed URLs that allow the client to read these Parquet files directly from the cloud provider. This comes with the benefit that the transfer can happen in parallel at the massive bandwidth of the public cloud’s object store without streaming through the sharing server as a bottleneck.

With Delta Sharing, dozens of popular open-source and commercial systems will connect directly to shared data so that any user can use it, reducing friction for everyone. Based on this open-source and open format project, several companies announced extended support for their products like Tableau, Qlik, Power BI, Looker.

The proposed session is a technical session for developers and big data architects. The session includes a live, hands-on demonstration of Delta Sharing. A detailed explanation of how to get started with purely open source is provided to the interested audience.

Session Keywords

🔑 Data Science

🔑 Open Source

🔑 Data Sharing

« Back