Volcano facts live in Cosmos DB.
Country economic statistics live as CSV files in Azure Blob Storage.
Eruption history lives in another SQL Server instance.
Your boss wants one report that joins all three.
You have an afternoon. Good luck.
SQL Server's framework for querying data that lives outside the database.
First shipped in SQL Server 2016 with support for Hadoop and Azure Blob Storage. Since then, it has expanded to support a wide range of external data sources.
External data sources are represented in SQL Server as external tables. They have a schema defined in SQL Server, but the data itself lives at the source.
External resources behave like regular tables. Filtering, joining, and aggregating all just work; the engine translates as needed and pulls only what the query asks for.
| Year | What shipped |
|---|---|
| 2016 | PolyBase v1: Hadoop and Azure Blob Storage only |
| 2019 | PolyBase v2: ODBC connectors (SQL, Oracle, MongoDB, Teradata, generic ODBC on Windows) |
| 2022 | Hadoop and scale-out groups removed. Native S3, Azure Blob (abs), and ADLS Gen2 (adls) added. |
| 2025 | Native engine for Parquet/CSV/Delta with no PolyBase service required. |
| Linked Server | PolyBase | |
|---|---|---|
| Scope | Instance-level | Database-scoped |
| Drivers | OLE DB | ODBC |
| Auth | SQL and Windows auth | SQL auth only |
| Sweet spot | OLTP, distributed transactions | Analytic queries, large rowsets |
Query performance tends to be fairly similar between the two because of overlap in how they operate.
Linked Server security is at the server. A login mapping reaches everything the remote login can see, including each database, schema, and table.
PolyBase security is at the object level. You GRANT SELECT on the specific external table, and that's all the user can reach, even if the underlying credential has broader access.
Only needed for database connectors (SQL Server, Cosmos DB, Postgres, Oracle). File-based queries on Azure / S3 work without it in SQL Server 2025.
Tells SQL Server where the remote data lives. Data does not migrate to your SQL Server instance. Instead, SQL Server learns how to reach it.
Connector targets:
A naive approach pulls all rows from the external source and filters them locally. This can be very inefficient, especially for large datasets.
With PUSHDOWN = ON on the data source, supported predicates execute at the source instead. The remote engine uses its own indexes and partitioning, and only matching rows make the trip.
Database connectors (SQL Server, Oracle, MongoDB, generic ODBC) tend to honor pushdown. Storage connectors likely will not and will send the entire file.
The rule of thumb: is there something on the other end that spends compute power to filter data?
SQL-to-SQL external data sources require a trusted certificate by default in 2025. ODBC Driver 18 ships with Encrypt=Yes on by default, and queries may fail at runtime if the target can't present a valid certificate.
This is a breaking change from previous versions, which defaulted to unencrypted connections for SQL connectors. The new default is more secure, but may require some configuration work to get existing queries running again.
TrustServerCertificate=True in CONNECTION_OPTIONS for self-signed scenarios.Encrypt=Strict, TrustServerCertificate=No, use TLS 1.3.Tells SQL Server how a file is shaped. Only needed for file-based sources (Blob / ADLS / S3). ODBC-based external objects can skip this step.
Primary supported formats:
DELIMITEDTEXT: CSV, TSV, custom delimitersPARQUET: columnar, embedded schema, snappy or gzip compressionDELTA: read-onlyThe SQL Server projection over the external resource. Defines column names, types, and nullability. Looks and feels like a regular table within SQL Server.
External tables allow a limited set of DDL commands and do not let you create indexes or (most) constraints.
Type-checking is just as strict as regular tables in SQL Server. If a row does not fit the schema, PolyBase rejects it. Three knobs exist to tune this behavior:
REJECT_TYPE — VALUE (count) or PERCENTAGE (cumulative)REJECT_VALUE — the threshold; query fails when crossedREJECT_SAMPLE_VALUE — rows per percentage-check windowRejection only matters for object storage. ODBC sources require every row to be valid by definition.
The primary use case for PolyBase is data virtualization: when data lives across multiple systems and you want to query and join it with T-SQL.
This technique is best for occasional federated queries. Because of performance and scalability issues, it is not a replacement for moving the data when latency or volume demand it.
| Source | Technology | Holds |
|---|---|---|
| Cosmos DB | MongoDB API | Volcano facts |
| Azure Blob Storage | CSV file | Country statistics |
| PostgreSQL | Generic ODBC | Volcano-type descriptions |
| SQL Server | SQL Server ODBC | Eruption history (NCEI) |
Reading external data is half the story. PolyBase can also write a query result back out to cloud storage.
The mechanism is CREATE EXTERNAL TABLE AS SELECT, also known as CETAS.
CETAS requires the allow polybase export option to be enabled.
Your SQL Server-based disks are likely very fast and very expensive. As data ages, we can move it to cheaper (but slower) storage options, while still being able to query it when necessary.
A view can (optionally) stitch the two halves together so the application sees one table, regardless of where the bytes actually live.
Best when older data is rarely touched but still occasionally needed for compliance, history, or analytics.
| Tier | Years | Where it lives |
|---|---|---|
| Hot | 2024 – 2025 | Local SQL Server tables |
| Cold | 2007 – 2023 | Parquet on S3 (MinIO) |
| Stitch | All years | One view, UNION ALL across the year tables |
Each year table carries a CHECK constraint on dispatch_date_time so the optimizer can skip partitions.
When a query filters by date, the optimizer reads each table's CHECK constraint and prunes the partitions that can't contribute rows.
Cold partitions only get touched when the query actually references that data. Otherwise, we don't need to waste the compute resources.
To learn more, go here:
https://csmore.info/on/polybase
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://csmore.info/contact