Getting Started with Apache Spark

ABSTRACT

As companies work to gain insight from ever-increasing amounts of data, data platform practitioners need tools which can scale along with the data. Early big data solutions in the Hadoop ecosystem assumed that data sizes overwhelmed available memory, emphasizing heavy disk usage to coordinate work between nodes. As the cost of memory decreases and the amount of memory available per server increases, we see a shift in the makeup of big data systems, emphasizing heavy memory usage instead of disk. Apache Spark, which focuses on memory-intensive operations, has taken advantage of this hardware shift to become the dominant solution for problems requiring distributed data. In this talk, we will take an introductory look at Apache Spark. We will review where it fits in the Hadoop ecosystem, cover how to get started and some of the basic functional programming concepts needed to understand Spark, and see examples of how we can use Spark to solve issues when analyzing large data sets.

ADDITIONAL MEDIA

On February 9, 2020, I gave a version of this talk at SQL Saturday Austin BI. You can get the recording at UserGroup.tv.