Data Virtualization in SQL Server

Kevin Feasel (@feaselkl)
https://csmore.info/on/polybase

Who Am I? What Am I Doing Here?

The Scenario

Volcano facts live in Cosmos DB.

Country economic statistics live as CSV files in Azure Blob Storage.

Eruption history lives in another SQL Server instance.


Your boss wants one report that joins all three.

You have an afternoon. Good luck.

One T-SQL Query to Solve the Problem

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

What is PolyBase?

SQL Server's framework for querying data that lives outside the database.

First shipped in SQL Server 2016 with support for Hadoop and Azure Blob Storage. Since then, it has expanded to support a wide range of external data sources.

The Primary Conceit

External data sources are represented in SQL Server as external tables. They have a schema defined in SQL Server, but the data itself lives at the source.

External resources behave like regular tables. Filtering, joining, and aggregating all just work; the engine translates as needed and pulls only what the query asks for.

How We Got Here

YearWhat shipped
2016PolyBase v1: Hadoop and Azure Blob Storage only
2019PolyBase v2: ODBC connectors (SQL, Oracle, MongoDB, Teradata, generic ODBC on Windows)
2022Hadoop and scale-out groups removed. Native S3, Azure Blob (abs), and ADLS Gen2 (adls) added.
2025Native engine for Parquet/CSV/Delta with no PolyBase service required.

PolyBase vs Linked Servers

Linked ServerPolyBase
ScopeInstance-levelDatabase-scoped
DriversOLE DBODBC
AuthSQL and Windows authSQL auth only
Sweet spotOLTP, distributed transactionsAnalytic queries, large rowsets

Query performance tends to be fairly similar between the two because of overlap in how they operate.

Security: The Big Difference

Linked Server security is at the server. A login mapping reaches everything the remote login can see, including each database, schema, and table.

PolyBase security is at the object level. You GRANT SELECT on the specific external table, and that's all the user can reach, even if the underlying credential has broader access.

Enabling PolyBase

Only needed for database connectors (SQL Server, Cosmos DB, Postgres, Oracle). File-based queries on Azure / S3 work without it in SQL Server 2025.

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

External Data Source

Tells SQL Server where the remote data lives. Data does not migrate to your SQL Server instance. Instead, SQL Server learns how to reach it.

Connector targets:

  • Azure Blob Storage, ADLS Gen2, S3-compatible object storage
  • Cosmos DB (MongoDB API)
  • SQL Server / Azure SQL Database, Oracle, Teradata
  • Generic ODBC: Postgres, MySQL, others (on Linux as of 2025)

Predicate Pushdown

A naive approach pulls all rows from the external source and filters them locally. This can be very inefficient, especially for large datasets.

With PUSHDOWN = ON on the data source, supported predicates execute at the source instead. The remote engine uses its own indexes and partitioning, and only matching rows make the trip.

Predicate Pushdown Support

Database connectors (SQL Server, Oracle, MongoDB, generic ODBC) tend to honor pushdown. Storage connectors likely will not and will send the entire file.

The rule of thumb: is there something on the other end that spends compute power to filter data?

TDS 8.0

SQL-to-SQL external data sources require a trusted certificate by default in 2025. ODBC Driver 18 ships with Encrypt=Yes on by default, and queries may fail at runtime if the target can't present a valid certificate.

This is a breaking change from previous versions, which defaulted to unencrypted connections for SQL connectors. The new default is more secure, but may require some configuration work to get existing queries running again.

Valid Solutions

  • Install a trusted certificate on the target server.
  • TrustServerCertificate=True in CONNECTION_OPTIONS for self-signed scenarios.
  • Strict posture: Encrypt=Strict, TrustServerCertificate=No, use TLS 1.3.

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

External File Format

Tells SQL Server how a file is shaped. Only needed for file-based sources (Blob / ADLS / S3). ODBC-based external objects can skip this step.

Primary supported formats:

  • DELIMITEDTEXT: CSV, TSV, custom delimiters
  • PARQUET: columnar, embedded schema, snappy or gzip compression
  • DELTA: read-only

External Table

The SQL Server projection over the external resource. Defines column names, types, and nullability. Looks and feels like a regular table within SQL Server.

External tables allow a limited set of DDL commands and do not let you create indexes or (most) constraints.

Rejection and You

Type-checking is just as strict as regular tables in SQL Server. If a row does not fit the schema, PolyBase rejects it. Three knobs exist to tune this behavior:

  • REJECT_TYPEVALUE (count) or PERCENTAGE (cumulative)
  • REJECT_VALUE — the threshold; query fails when crossed
  • REJECT_SAMPLE_VALUE — rows per percentage-check window

Rejection only matters for object storage. ODBC sources require every row to be valid by definition.

Demo Time

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

Data Virtualization

The primary use case for PolyBase is data virtualization: when data lives across multiple systems and you want to query and join it with T-SQL.

This technique is best for occasional federated queries. Because of performance and scalability issues, it is not a replacement for moving the data when latency or volume demand it.

The Volcano Stack

SourceTechnologyHolds
Cosmos DBMongoDB APIVolcano facts
Azure Blob StorageCSV fileCountry statistics
PostgreSQLGeneric ODBCVolcano-type descriptions
SQL ServerSQL Server ODBCEruption history (NCEI)

Demo Time

Going the Other Way

Reading external data is half the story. PolyBase can also write a query result back out to cloud storage.

The mechanism is CREATE EXTERNAL TABLE AS SELECT, also known as CETAS.

CETAS requires the allow polybase export option to be enabled.

Demo Time

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

Cold Storage

Your SQL Server-based disks are likely very fast and very expensive. As data ages, we can move it to cheaper (but slower) storage options, while still being able to query it when necessary.

A view can (optionally) stitch the two halves together so the application sees one table, regardless of where the bytes actually live.

Best when older data is rarely touched but still occasionally needed for compliance, history, or analytics.

The Raleigh Fire Stack

TierYearsWhere it lives
Hot2024 – 2025Local SQL Server tables
Cold2007 – 2023Parquet on S3 (MinIO)
StitchAll yearsOne view, UNION ALL across the year tables

Each year table carries a CHECK constraint on dispatch_date_time so the optimizer can skip partitions.

Partition Elimination

When a query filters by date, the optimizer reads each table's CHECK constraint and prunes the partitions that can't contribute rows.

Cold partitions only get touched when the query actually references that data. Otherwise, we don't need to waste the compute resources.

Demo Time

Agenda

  1. The Basics of PolyBase
  2. External Data Sources
  3. External Tables
  4. Data Virtualization
  5. Cold Storage
  6. Wrapping Up

Recap

  • SQL Server can query remote databases and external storage (Azure Blob Storage or Amazon S3) via T-SQL.
  • In 2025, file queries are built into the engine, meaning no separate PolyBase service is required for that task.
  • Microsoft continues to invest in external object capabilities.
  • PolyBase is best when you need to join heterogeneous, remote data sources and performance is not the primary concern.

Wrapping Up

To learn more, go here:
https://csmore.info/on/polybase


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://csmore.info/contact