My goals in this talk:
Azure Synapse Analytics is Microsoft's platform for modern data warehousing. It is made up of three components:
Azure Synapse Analytics dedicated SQL pools, nee Azure SQL Data Warehouse, offer up a Massive Parallel Processing approach to data warehousing and work best in classic data warehousing scenarios:
Azure Synapse Analytics Spark pools allow you to spin up Apache Spark clusters. Key use cases for these Spark clusters include:
Azure Synapse Analytics serverless SQL pools are a bit different from dedicated SQL pools in the following ways:
Key use cases for Azure Synapse Analytics serverless SQL pools include:
Each Azure Synapse Analytics workspace comes with a built-in serverless pool.
You cannot create additional serverless SQL pools. Although there is an option to create a new SQL pool, it only allows you to create dedicated SQL pools.
Similarly, you cannot delete a serverless SQL pool. But if you don't use the serverless SQL pool to retrieve data, you never get charged for it.
We work for an agency known as Martian Agricultural Data Collection and Ownership (MADCOW).
Our job is to collect sensor data from agricultural plots in several Martian cities in order better to understand which plots of land are best suited for specific crops.
Our system has collected a significant amount of IoT sensor data and management would like us to build a rapid access system for reviewing this data. We have chosen Azure Synapse Analytics serverless SQL pools for the job.
Serverless SQL pools allow you to write T-SQL queries. But the product does not support the entire T-SQL namespace.
One major reason for this is that serverless SQL pools are (almost entirely) read-only.
The easiest way to access data from a serverless SQL pool is to load it from Azure Data Lake Storage.
Every Azure Synapse Analytics workspace has an associated data lake storage account.
OPENROWSET command allows you to read data from a file or folder.
OPENROWSET command supports the following data types:
Azure Synapse Analytics has some support for the Linux Foundation's Delta Lake (gifted by Databricks). Delta Lake has several beneficial features:
Azure Synapse Analytics extends a Microsoft technology called PolyBase, which allows you to virtualize data from a number of different data platform technologies. The most important one for our use case is, again, Azure Data Lake Storage Gen2.
For more on PolyBase itself, go to https://csmore.info/on/polybase.
Three sorts of external objects exist.
An external data source tells the serverless SQL pool where it can find remote data.
That data does not migrate to the serverless SQL pool! It lives on the external data source.
An external file format tells the serverless SQL pool what type of file you intend to use from our data lake. We can use delimited files (e.g., comma or tab separated), ORC, and Parquet formats.
An external table tells your serverless SQL pool the structure of your external resource. Serverless SQL pools require structured data and will reject records which do not fit the data types and sizes you set.
You can combine the
OPENROWSET approach with PolyBase's
CREATE EXTERNAL TABLE AS (CETAS) in order to shape external tables before creation.
Serverless SQL pools are great for ad hoc data exploration. They allow you to query arbitrary files or folders within a data lake, shape that data using T-SQL, and display or export the data in several ways.
OPENROWSET queries may load data from a single file.
Using wildcard characters like "*" we can change that to include a variety of files.
Going further, we can use
FILENAME() to get details on a file.
The end result is a listing of numbers of rows per file.
But we can include more than one wildcard and can even reference them in the query itself using
This lets us analyze data even in the case in which we don't specify year or month in the files themselves--just by virtue of being in the right folder, we can get relevant context!
We can even use these in the
WHERE clause to filter:
In this case, we filter to include only files which start with the letter "C" and ignore all other files.
Azure Synapse Analytics queries be exported directly to CSV, JSON, or XML.
Azure Synapse Analytics includes a charting library similar to that in Azure Data Studio.
Historically, we have built physical data warehouses as a way of storing data for solving known business problems. We can also create logical (virtual) data warehouses and query from separate systems when that makes sense.
Databricks has coined the term Lakehouse to represent the combination of data warehouse and data lake in one managed area.
A logical data warehouse in a serverless SQL pool isn't exactly the same thing as the Data Lakehouse, but there are some similarities in approach.
Serverless SQL pools don't have the same breadth of security options and roles as on-premises SQL Server instances, but there are still ways to control access to data.
The easiest way to grant access to files or folders in a Data Lake is to generate a SAS token.
SAS tokens can expire after a set amount of time and have specific rights. For our needs, we want at least Read and List permissions.
You may have several tokens available to a serverless SQL pool as database scoped credentials.
Serverless SQL pools allow you to create distinct roles and assign users to them, just as you can on-premises.
GRANT permissions to specific objects or schemas. This is a good option for regular users who just need to query tables in a logical data warehouse.
In order to allow users to view data using
OPENROWSET, you will need to grant rights for bulk database operations, as
OPENROWSET is a bulk operation in SQL Server.
Just as on-premises, the
CONTROL permission is very strong. It allows a user to set up permissions, create (or destroy) external tables or views, and query data via
OPENROWSET. Save this for administrators or power users.
We can also control access to specific rows in data based on role. Unlike SQL Server on-premises, there is no native row-level security. We can, however, create it via views and (optional) table-valued functions.
Suppose we have specific analysts by city.
Another option is to wrap the security in an inline table-valued function.
We can then use
CROSS APPLY to apply the security rules.
The different components of Azure Synapse Analytics are priced differently. Although dedicated SQL pools and Spark pools are priced by server utilization, serverless SQL pools are priced by data processed, at a rate of approximately $5 per terabyte processed.
Data processing includes data read, metadata read, data in intermediate storage during queries (e.g., spools or temp tables), data written to storage via CETAS, and data transferred between nodes or out to the end user.
Each query has a minimum of 10 MB of data processed. As such, try to avoid an enormous number of small queries off of tiny files, as those will add up quickly.
We know that scanning the minimum amount of data necessary is a key for maximizing performance. It's also our key for saving money here. Serverless SQL pools offer a few techniques for doing this:
Typically, Parquet format will be superior to CSVs for handling data in a data lake. This is for a few reasons:
The downside to compression is that data is returned back in result sets uncompressed.
If your query reads 1 TB of Parquet data and the compression ratio is 5:1 and your query returns back the entirety of this data, you will be charged for 1 TB of compressed data read from disk plus 5 TB of data transferred out, or 6 TB in total.
By contrast, aggregating the data will result in much smaller result sets and lower prices.
Using filters on
FILEPATH() can allow you to narrow down which files to read and what data to return.
Furthermore, when using
OPENROWSET for ad hoc queries, narrow down to the smallest number of folders or files needed to get the job done.
If you have frequently used aggregates, it may make sense to calculate them once and store the results in files for later use. In that case, you will incur a one-time charge to calculate the aggregates but then a small charge (down to 10 MB of data processed) per access.
Sometimes, in order to improve performance, serverless SQL pools will process more data than absolutely necessary. For example, the CSV reader pulls in data in chunks, and although you only asked for 5 rows, the chunk might contain 50 or 100. The net effects of this are typically small, but can add up with a large number of queries.
To reduce this, store data in Parquet format and write queries which intend to read from an entire file rather than part of the file. This might include reshaping files in your data lake.
Each serverless SQL pool has an associated DMV which calculates the amount of data processed:
This has been a look at serverless SQL pools in Azure Synapse Analytics. Although they are in the SQL Server family, they're closer to second cousins with on-premises SQL Server rather than twins.
To learn more, go here:
Catallaxy Services consulting: