Data Virtualization with PolyBase

Kevin Feasel (@feaselkl)
https://csmore.info/on/polybase

Who Am I? What Am I Doing Here?

Motivation

The goal of this talk is to cover PolyBase, Microsoft's technology for data virtualization and query scale-out.

We will look at installation, configuration, and a number of practical business scenarios involving the product.

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

What is PolyBase?

PolyBase is Microsoft's newest technology for covering three use cases:

  1. Connecting to heterogeneous, remote data sources.
  2. Data virtualization: creating the appearance of local data while still having the data live remotely.
  3. Scale-out versus scale-up and Massively Parallel Processing (MPP).

A Brief History

  • Parallel Data Warehouse (2012): PolyBase introduced as a way to integrate SQL Server and Hadoop.
  • SQL Server 2016: PolyBase "V1" introduced.
  • SQL Server 2019: PolyBase "V2" introduced. PolyBase also brought into SQL Server Big Data Clusters.

Supported Data Sources: V1

PolyBase V1 supported two external data sources:

  1. HDFS -- Hadoop
  2. WASB(S) -- Blob Storage

Supported Data Sources: V2

PolyBase V2 has blown open the door in supported data sources:

  1. SQL Server
  2. Oracle
  3. MongoDB / Cosmos DB
  4. Teradata
  5. Any ODBC connector -- Hive, Spark, Excel

Massively Parallel Processing

SQL Server is a classic "scale-up" technology: if you want more power, add more RAM/CPUs/resources to the single server.

Hadoop is a great example of an MPP system: if you want more power, add more servers; the system will coordinate processing.

MPP and Scale-Out Versus Scale-Up

Upsides:

  • Scale-out is often cheaper than scale-up
  • You hit the ceiling on scale-up long before scale-out

Downsides:

  • Scale-out is typically more complex
  • Scale-out has a longer startup time

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration
New installation
Install only V2 components
Install V1 components as well
Install PolyBase standalone
Install Azul Zulu Open JRE
Configure PolyBase services
Save install details
Installation is complete

Enabling PolyBase

Driver Installation

In order to use the generic ODBC connector, you need to install an appropriate ODBC driver on the machine hosting SQL Server. Grab the right driver and follow the normal install steps.

We will demonstrate with Apache Hive and Apache Spark, both for Hortonworks Data Platform (HDP) 3.0.

Install the Hive ODBC driver
Install the Hive ODBC driver
Install the Hive ODBC driver
Install the Hive ODBC driver
Install the Hive ODBC driver

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

External Objects

Three sorts of external objects exist.

  1. External Data Source
  2. External File Format
  3. External Table

External Data Source

An external data source tells SQL Server where it can find remote data.

That data does not migrate to SQL Server! It lives on the external data source.

External File Format

An external file format tells SQL Server what type of file you intend to use. SQL Server supports delimited files (e.g., comma or tab separated), ORC, and Parquet formats.

This is not required for V2 data sources, as they do not read from files.

External Table

An external table tells SQL Server the structure of your external resource. SQL Server requires structured data and will reject records which do not fit the data types and sizes you set.

Rejection and You

Going from unstructured to structured data is a risk: conversions may fail. When that happens, the PolyBase engine rejects that row. After we meet a rejection threshold, PolyBase fails the query.

Rejection and You

The three parameters we control are:

REJECT_TYPE : { VALUE, PERCENTAGE }

REJECT_VALUE

REJECT_SAMPLE_VALUE = Number of rows to pull in before recalculating rejection percentage.

Demo Time

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

Cold Storage

Assumptions:

  1. You have one or more giant tables in SQL Server.
  2. Users only query a small, pre-known fraction of that data (e.g., last year or last 2 years).
  3. Occasionally someone needs this under-utilized data.
  4. When users do need the under-utilized data, they can afford to wait a while.

Cold Storage

Technique: write old data to Azure Blob Storage / Hadoop and use partitioned views to join back together.

Benefits:

  1. Blob Storage and HDFS tend to be much cheaper than SQL Server disks
  2. Less data >> faster queries on SQL server
  3. Partitioned views mean no code changes are necessary and queries automatically benefit

Demo Time

Alternate Approaches

  • Migrate old data to Azure Blob Storage and skip the partitioned view
  • Store all old data in one external table

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

ELT

Assumptions:

  1. You need the same data in multiple databases, such as OLTP + warehouse
  2. You have processes in place (e.g., SSIS packages) to Extract, Transform, and Load this data
  3. You have a large amount of data

ELT

Technique: instead of Extract-Transform-Load, use PolyBase and the Extract-Load-Transform (ELT) pattern to land and transform data.

Benefits:

  1. ELT is typically less memory-intensive than ETL
  2. ELT scales to larger data sizes than ETL
  3. ELT with PolyBase lets you stick with T-SQL instead of T-SQL + SSIS / Biml + C#

Demo Time

Alternate Approaches

  • Create views on the source system and perform more transformation on the source side
  • Use ELT to combine data from multiple source systems as part of the same process

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

Data Virtualization

Assumptions:

  1. You have data in several, different source systems
  2. These source systems use several technologies, such as SQL Server, Oracle, Spark, Cosmos DB, etc.
  3. You would like to query these source systems with T-SQL and integrate them together in your queries
  4. Performance is not a key consideration

Data Virtualization

Technique: use PolyBase external data sources and external tables to create SQL Server representations of data sets.

Benefits:

  1. One language to query them all
  2. Users do not need to know where the data really lives
  3. All filtering, reshaping, and data logic works as T-SQL regardless of the data's origin

Demo Time

Agenda

  1. What is PolyBase?
  2. Installation + Configuration
  3. The Building Blocks
  4. Cold Storage
  5. ELT
  6. Data Virtualization
  7. Tuning and Administration

Statistics

PolyBase allows you to create statistics on external tables.

Limitations:

  1. Statistics are not auto-created.
  2. Statistics are not updatable--need to drop and re-create.
  3. Not all external sources support sampling.
  4. SQL Server generates statistics, so you need to bring the data over. Make sure you have enough disk space!

Using Statistics

Once created, external table statistics behave the same way as normal table statistics and the optimizer can use them similarly.

Demo Time

Important DMVs

There are several important DMVs for query tuning and regular administration.

Demo Time

Execution Plans

PolyBase adds execution information to query plans, including much of what you will find in the DMVs and sometimes a bit more.

Demo Time

Wrapping Up

PolyBase has a number of useful business cases. It won't give you the greatest performance, but if you want a SQL interface and performance is not a top consideration, PolyBase has you covered.

Wrapping Up

To learn more, go here:
https://csmore.info/on/polybase


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact