Catallaxy Services | Building Your First Data Pipeline in Apache Spark

ABSTRACT

As a data engineer, the Apache Spark platform provides a great deal of functionality designed to solve common problems around data movement and processing, particularly in the cloud. In this session, we will learn how to use Apache Spark in Microsoft Azure. We will see which Azure services provide Apache Spark integration points, look at use cases in which Apache Spark is a great choice, and use the metaphor of the data pipeline to perform data movement and transformation in the cloud. We will additionally learn how to use notebook workflows in Azure Databricks to simplify the process.

ADDITIONAL MEDIA

No recordings or additional media are available for this talk.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

Click here to access demo code for this presentation.

The source code is licensed under the terms offered by the GPL.

LINKS & FURTHER INFO

Helpful Resources

Building a dynamic data pipeline with Databricks and Azure Data Factory. This is a high-level look at the topic.
Databricks goes into detail on data pipelines.
Azure Data Factory allows you to run a Databricks notebook. This is helpful if you use ADF for several other tasks and want to integrate that process with Databricks, rather than scheduling jobs to do the work.
Bill Chambers has a three-part series on writing Spark applications. Part 3 involves data engineering pipelines.
Jules Damji and Jason Pohl show an example of an ML pipeline in Databricks.
Dave Wang, et al, show off notebook workflows.
The official Databricks documentation on notebook workflows.
Managing dependencies in data pipelines, specifically when using Apache Airflow or ADF with Databricks.
Performing ETL with Azure Databricks.
This Microsoft Learn module takes us through the Delta Lake architecture.
Databricks Connect isn't a topic I cover in this talk, but it is useful as you spend more time in the platform and you want to do this work in a full IDE.
Data pipelines should also follow proper software development lifecycles, notes Jesus de Diego. We have a tendency to forget that in the data science world.