Building Your First Data Pipeline in Apache Spark

Kevin Feasel (@feaselkl)
https://csmore.info/on/pipeline

Who Am I? What Am I Doing Here?

Motivation

My goals in this talk:

  • Explain what Apache Spark is.
  • Flesh out the metaphor of data pipelines.
  • Provide a quick primer on Azure Databricks.
  • Explain what a data lake is and making the best use of it.
  • Implement a data science pipeline, including development, deployment, and management.

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

The Genesis of Spark

Spark started as a research project at the University of California Berkeley’s Algorithms, Machines, People Lab (AMPLab) in 2009. The project's goal was to develop in-memory cluster computing, avoiding MapReduce's reliance on heavy I/O use.

The first open source release of Spark was 2010, concurrent with a paper from Matei Zaharia, et al.

In 2012, Zaharia, et al release a paper on Resilient Distributed Datasets.

Resilient Distributed Datasets

The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:

  • Immutable – you never change an RDD itself; instead, you apply transformation functions to return a new RDD

Resilient Distributed Datasets

The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:

  • Immutable
  • Distributed – executors (akin to data nodes) split up the data set into sizes small enough to fit into those machines’ memory

Resilient Distributed Datasets

The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:

  • Immutable
  • Distributed
  • Resilient – in the event that one executor fails, the driver (akin to a name node) recognizes this failure and enlists a new executor to finish the job

Resilient Distributed Datasets

The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:

  • Immutable
  • Distributed
  • Resilient
  • Lazy – Executors try to minimize the number of data-changing operations

Add all of this together and you have the key component behind Spark.

The Evolution of Spark

One of the first additions to Spark was SQL support, first with Shark and then with Spark SQL.

With Apache Spark 2.0, Spark SQL can take advantage of Datasets (strongly typed RDDs) and DataFrames (Datasets with named columns).

Spark SQL functions are accessible within the SparkSession object, created by default as “spark” in the Spark shell.

Use Cases for Spark

Apache Spark is a versatile platform and is popular for several things:

  • Processing and transforming data in batches
  • Distributed machine learning
  • Centralizing analytics for organizations
  • Near-real-time streaming of data

Language Support

Apache Spark has "first-class" support for three languages: Scala, Python, and Java.

In addition to this, it has "second-class" support for two more languages: SQL and R.

Further, there are projects which extend things further, such as .NET for Apache Spark, which adds C# and F# support.

Use Spark in Azure

There are several ways to use Apache Spark in the Azure ecosystem, such as:

  • Azure Databricks
  • Azure Synapse Analytics
  • Azure HDInsight
  • Azure Data Factory
  • Installing Spark on VM
  • Running a container with Spark installed

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

Pipelines as a Metaphor

Pipelines are a common metaphor in computer science. For example:

Environment Example
Unix (sh) grep "Aubrey" Employee.txt | uniq | wc -l
F# movies |> Seq.filter(fun m -> m.Title = Title) |> Seq.head
Powershell gci *.sql -Recurse | Select-String "str"
R inputs %>% filter(val > 5) %>% select(col1, col2)

Pipelines for Data

In all of the prior examples, sections of code combine together as connected pipes in a pipeline, with data flowing through them and transforming at each step.

In addition to these code-based pipelines, we have many examples of graphical pipelines.

SQL Server Integration Services

Informatica

Apache NiFi

Filling Out the Metaphor

The pipeline metaphor works extremely well:

Filling Out the Metaphor

Each segment of code is a transformer which modifies the fluid in some way.

Filling Out the Metaphor

Transformers are connected together with pipes.

Filling Out the Metaphor

Data acts as a fluid, moving from transformer to transformer via the pipes.

Filling Out the Metaphor

Some transformers are "streaming" or non-blocking, meaning that fluid pushes through without collecting. Think of a mesh filter which strains out particulate matter.

Filling Out the Metaphor

Some transformers are blocking, meaning that they need to collect all of the data before moving on. Think of a reservoir which collects all of the fluid, mixes it with something, and then turns a flow control valve to allow the fluid to continue to the next transformer.

Filling Out the Metaphor

Backpressure happens when a downstream component processes more slowly than upstream components. In the case of sufficient backpressure, the fluid may stop pushing forward and can back up.

Filling Out the Metaphor

Breaking our metaphor, data doesn't move exactly like a fluid. Instead, it typically moves in buffers: discrete blocks of information held in memory and transferred from one stage to the next. We see this as a specific number of rows processed at a time or a specific amount of data processed at a time.

Filling Out the Metaphor

The way to regulate backpressure in a data system is to wait for a downstream component to ask for a buffer before sending it and processing the next. Now, we avoid the risk of out-of-memory errors from creating too many buffers or buffers being lost because the downstream component can't collect them.

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

Getting Started with Databricks

Databricks, the commercial enterprise behind Apache Spark, makes available the Databricks Unified Analytics Platform in AWS and Azure. They also have a Community Edition, available for free.

Getting Started with Databricks

First, search for "databricks" and select the Azure Databricks option.

Demo Time

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

The Data Lake

The concept of the data lake comes from a key insight: not all relevant data fits the structure of a data warehouse. Furthermore, there is a lot of effort involved in adding new data to a warehouse.

Data is typically stored in the data lake as files, and may include delimited files, files using specialized formats (ORC, Parquet), or even binaries (images, videos, audio) depending on the need.

A Sample Data Lake

The Layers

We usually think of three layers of a data lake, which Databricks called "bronze," "silver," and "gold."

Moving Between Layers

The process of cleaning, refining, enriching, and polishing data allows us to migrate it from bronze to silver to gold. All of this happens through data pipelines, either automated piplines (which can feed results into a data warehouse or other analytics platform) or manual pipelines for ad hoc research.

Demo Time

The Data Lakehouse

Databricks has coined the term Lakehouse to represent the combination of data warehouse and data lake in one managed area.

The Data Lakehouse

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

The Data Science Process

The Microsoft Team Data Science Process is one example of a process for implementing data science and machine learning tasks.

Machine Learning with Spark

Spark supports machine learning using the native MLlib library which works with RDDs.

On top of this, there is a spark.ml namespace which works with DataFrames.

Experiments

Experiments give you an opportunity to group together different trials in solving a given problem.

Models

Once you've landed on an answer, register your trained model.

Deployment

Serving a model is easy through the UI.

CI/CD

Just as with data loading, we'll want to create repeatable processes to train, register, and serve models, and one answer is the same as before: notebook workflows.

Demo Time

Agenda

  1. What Is Apache Spark?
  2. The Data Pipeline
  3. Getting Started with Databricks
  4. Transforming Data in the Data Lake
  5. Machine Learning and Experimentation
  6. Workflow Management

Jobs

Databricks jobs allow us to schedule the execution of notebooks, Java executables (JAR files), Python scripts, and spark-submit calls.

Creating a Job

When creating a job, it's best to use a purpose-built cluster--this costs $0.25 per DBU-hour less than all-purpose compute clusters, which can be a considerable cost savings over the course of a month.

Scheduling a Job

Each job can run manually or on a fixed schedule of your choosing.

Job Results

Review job results after a run to see what it did and to investigate any errors which pop up.

Demo Time

Wrapping Up

Over the course of this talk, we have learned about the basics of Apache Spark, data pipelines, and data lakes. Along the way, we created notebook workflows for data lake population as well as machine learning.

Wrapping Up

To learn more, go here:
https://csmore.info/on/pipeline


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact